WP2

The primary objective of this work package is to develop various types of multimodal models to study their respective properties and understand their strengths and weaknesses.

Initially, we are investigating the use of an architecture that allows us to address encoding and cost function constraints. Indeed, the different modalities require specific encoders for each modality that can produce modality-independent encodings. Moreover, these modalities necessitate different cost functions, leading us to consider a unified cost function.

Another solution being considered for learning multimodal representations, inspired by unsupervised approaches in neural translation (Üstün et al., 2021; Artexte et al., 2018), involves encoding a sequence independently of its modality to reconstruct or generate other modalities. Each modality would thus have its own decoder. Although less performant than supervised translation, this method is studied in parallel of the first method, as an alternative.

Throughout the project, we will explore various options to reduce the environmental cost of training our models. Additionally, we plan to produce distilled models from large-scale models to meet real-time computational constraints and provide more efficient models.