BabyBabelLM
A Multilingual Benchmark of Developmentally Plausible Training Data
Overview

We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
A serene mountain lake
Training data distribution by category across languages for all data tiers in BabyBabelLM.
Dataset Construction

Contributing

BabyBabelLM is a living resource, created, maintained, and updated by language communities. We welcome community contributions, in two different forms:

Baseline Models and Evaluation

We train a set of monolingual and multilingual baseline models and evaluated them on a set of benchmarks, that are available through HuggingFace as well on the BabyLM page.
Contributors

Please reach out to Jaap or Leshem if you have any questions about the paper!
• Jaap Jumelet (University of Groningen)
• Abdellah Fourtassi (Aix Marseille University)
• Akari Haga (Nara Institute of Science and Technology)
• Bastian Bunzeck (Bielefeld University)
• Bhargav Shandilya (University of Colorado Boulder)
• Diana Galvan-Sosa (University of Cambridge, SomosNLP)
• Faiz Ghifari Haznitrama (KAIST)
• Francesca Padovani (University of Groningen)
• Francois Meyer (University of Cape Town)
• Hai Hu (City University of Hong Kong)
• Julen Etxaniz (HiTZ, University of the Basque Country)
• Laurent Prévot (Aix Marseille University)
• Linyang He (Columbia University)
• María Grandury (SomosNLP)
• Mila Marcheva (University of Cambridge)
• Negar Foroutan (EPFL)
• Nikitas Theodoropoulos (Independent Researcher)
• Pouya Sadeghi (University of Waterloo)
• Siyuan Song (University of Texas at Austin)
• Suchir Salhan (University of Cambridge)
• Susana Zhou (SomosNLP)
• Yurii Paniv (Ukrainian Catholic University)
• Ziyin Zhang (Shanghai Jiao Tong University)
• Arianna Bisazza (University of Groningen)
• Alex Warstadt (University of California San Diego)
• Leshem Choshen (MIT, MIT-IBM watson AI Lab)