BabyLM Challenge
Summary: The BabyLM Challenge will be held again as the shared task for CoNLL 2024, collocated with EMNLP! The overarching goals of the challenge remain the same, however some of the rules are different for this year. See below for an overview of rules updates.
→ Updated 100M and 10M word text-only dataset, with higher proportion child and child-directed speech.
→ A new multimodal dataset with 50M words of paired text-image data, and 50M words text-only data.
• The evaluation pipeline is out
here!
Consider
joining the BabyLM Slack if you have any questions for the organizers or want to connect with other participants!
Submission guide
Submit Here
To fill out the submission, please prepare these two things:
- A HuggingFace link to your models.
- A download link to your results, assembled via the collect_results.py script in babylm/evaluation-pipeline-2024.
Paper submission follows CoNLL template with 4-8 pages, as well as a hyperparameter form .
Rules Updates for BabyLM Round 2
• Human language learning is inherently multi-modal. To encourage more multi-modal submissions, we are replacing last year's loose track with a vision-language track . To help teams get started, we release a corpus of 50% text-only and 50% image-text multimodal data.
• Last year, all competition entrants were required to pretrain on a fixed corpus. This year we will relax this requirement. While we will still provide language-only and multi-modal datasets of 100M and 10M words, participants are free to construct their own datasets, provided that they stay within the 100M or 10M word budget. .
• To encourage contributions that are related to the goals of the challenge, but do not involve direct competition entries, we are introducing a paper-only track. Paper track submissions could include things like novel cognitively-inspired evaluation metrics or in-depth analyses of one particular BabyLM model.
Overview
Huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example,
Chinchilla sees 1.4
trillion words during training---well over 10000 words for every one word a 13 year old child has heard in their entire life.
The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Why <100 Million Words?
Focusing on scaled down pretraining has several potential benefits:
First, small scale pretraining can be a sandbox for developing novel techniques for improving data efficiency. These techniques have the potential to then scale up to larger scales commonly seen in applied NLP or used to enhance current approaches to modeling low-resource languages. Second, improving our ability to train LMs on the same kinds and quantities of data that humans learn from hopefully will give us greater access to plausible cognitive models of humans and help us understand what allows humans to acquire language so efficiently.
Organization Team
• Leshem Choshen (IBM Research, MIT)
• Ryan Cotterell (ETH Zurich)
• Michael Hu (NYU)
• Tal Linzen (NYU)
• Aaron Mueller (Northeastern)
• Candace Ross (Meta AI)
• Alex Warstadt (ETH Zurich, UCSD)
• Ethan Wilcox (ETH Zurich, Georgetown)
• Adina Williams (Meta AI)
• Chengxu Zhuang (MIT)
The BabyLM Challenge was held in 2023, where it was hosted at CoNLL. Information from our website describing the previous rules and timeline can be found here. At the following links you can find: last year's call for papers and also the proceedings from last year's challenge .