BabyLM Challenge
Summary: This shared task challenges community members to train a language model
from scratch on the same amount of linguistic data available to a child. Submissions should be implemented in Huggingface's Transformers library and will be evaluated on a shared pipeline. This shared task is co-sponsored by
CMCL and
CoNLL.
• Models and results due
July 15, 2023 July 22, 2023, 23:59 anywhere on earth (UTC-12). Submit on
dynabench.
• Paper submission due
August 1, 2023 August 2, 2023, 23:59 anywhere on earth (UTC-12). Submit on
OpenReview.
See the
guidelines for an overview of submission tracks and pretraining data. See the
call for papers for a detailed description of the task setup and data.
Consider
joining the BabyLM Slack if you have any questions for the organizers or want to connect with other participants!
Overview
Huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example,
Chinchilla sees 1.4
trillion words during training---well over 10000 words for every one word a 13 year old child has heard in their entire life.
The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development.
Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Why <100 Million Words?
Focusing on scaled down pretraining has several potential benefits:
First, small scale pretraining can be a sandbox for developing novel techniques for improving data efficiency. These techniques have the potential to then scale up to larger scales commonly seen in applied NLP or used to enhance current approaches to modeling low-resource languages.
Second, improving our ability to train LMs on the same kinds and quantities of data that humans learn from hopefully will give us greater access to plausible cognitive models of humans and help us understand what allows humans to acquire language so efficiently.
Organization Team
• Leshem Choshen
• Ryan Cotterell
• Kundan Krishna
• Tal Linzen
• Haokun Liu
• Aaron Mueller
• Alex Warstadt
• Ethan Wilcox
• Adina Williams
• Chengxu Zhuang