WP3

Evaluation of the causal models is performed through perplexity measurement, while transfer learning, where a language model is used within a network for a downstream task, allows for indirect evaluations based on the modality or domain (general, medical, news, etc.).

We use existing evaluation benchmarks such as FLUE (for text classification, paraphrase detection, inference, lexical disambiguation, syntactic analysis, morpho-syntactic tagging) and LeBenchmark (spoken language understanding, automatic speech recognition, syntactic analysis of speech). We also plan to create new evaluation datasets for tasks involving audiovisual content (TV news topic classification, media event detection, citation extraction) and political discourse and debate analysis (emotion detection, stance detection towards a topic).

In the medical field, a particular effort will be dedicated to pooling a set of tasks in which partners are involved within the framework of DrBert and Flaubert Medical (biomedical named entities, pathological voice processing, speech to pictograms, patient trajectory prediction, etc.).

Text generation tasks such as automatic summarization and text/speech simplification will also be included. The evaluation of multimodality in models will be conducted through tasks requiring multimodal information, such as TV news topic segmentation, which can rely on prosodic cues from audio and semantic cues captured through transcription.

We will evaluate the linguistic and factual “knowledge” encapsulated in the models, particularly regarding the handling of negation and the models’ ability to leverage long contexts. The models will also be evaluated on their ability to use world knowledge through the French Winograd dataset.

Finally, we will measure the biases of our language models, i.e., the systematic differences in behavior of these models based on the demographic characteristics of the individuals mentioned in the input and/or output. We will also seek to estimate whether there is a decrease or amplification of biases compared to the training corpora, in order to better understand the usability of our models for social science research.