Within the frameworks of FlauBERT, LeBenchmark, and Propicto, an intensive study of the available data for written texts, speech, and pictograms was conducted. This data allows for the identification of the gender and domain of texts, audio, and pictograms, as well as the gender of speakers. This information is used to select different versions of training data, varying the gender ratios of authors to avoid amplifying social biases and to study the impact on downstream models and tasks (WP3).
WP1 involves collecting synchronized multimodal data to ensure the coherence of the joint latent space and to evaluate the consistency of inferences. Given the limited availability of such data, we will also perform conversions between data modes (e.g., audio to text, text to pictograms) to increase the volume of data and assess the value of each modality.
Objectives:
Collect, filter, and prepare unimodal and multimodal pretraining data, ensuring sufficient size, complete documentation, data diffusibility, and minimizing harmful biases. Evaluation data is excluded from pretraining to avoid any contamination.
Develop techniques to generate parallel data. For example, we will complement 14,000 hours of audio signals with transcriptions, use speech synthesis to augment speech corpora, and explore the use of generative image models to expand the pictogram corpus.
Provide WP2 and WP4 with preprocessed corpora in formats defined in WP2.