Summary: BabyLM returns for its 4th year as both a shared task and a workshop at EMNLP 2026. This round keeps the core goal: sample-efficient pretraining under human-scale data budgets, while updating the track structure and datasets.

• All data is available at this huggingface community! Data includes:

→ A detoxified 100M-word Strict dataset and a detoxified 10M-word Strict-Small dataset.

→ A new MultiLingual track based on BabyBabelLM, with evaluation focusing on English, Dutch, and Chinese.

Track update: This year introduces a dedicated MultiLingual track, and removes Multimodal and Interaction as standalone competition tracks, both are now subsumed into Strict / Strict-Small (paired image-text data and teacher-model feedback are allowed).

Evaluation pipeline: We will distribute an open-source pipeline building on the 2025 repository; the MultiLingual track will be evaluated with a mix of zero-shot and finetuning-based tasks across English, Dutch, and Chinese. (Pipeline/baselines are planned for early April.)

See the guidelines for an overview of submission tracks and pretraining data. See the updated call for papers for the full task setup, track rules, and dataset details.

Consider joining the BabyLM Slack if you have any questions for the organizers or want to connect with other participants!

• New track: MultiLingual. Participants train on a MultiLingual mixture from BabyBabelLM. The challenge track focuses on English, Dutch, and Chinese, and allows a custom mixture totaling 100M tokens (with word counts adjusted by each language’s Byte Premium in baseline construction).

Huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example, Chinchilla sees 1.4 trillion words during training--well over 10000 words for every one word a 13 year old child has heard in their entire life.

The goal of this workshop is to incentivize researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining, which is typically thought to be practical only for large industry groups, by drawing attention to open problems that can be addressed on a university budget.

Focusing on scaled-down pretraining has several potential benefits:
First, small-scale pretraining can be a sandbox for developing novel techniques for improving data efficiency. These techniques have the potential to then scale up to larger scales commonly seen in applied NLP or used to enhance current approaches to modeling low-resource languages. Second, improving our ability to train LMs on the same kinds and quantities of data that humans learn from, hopefully, will give us greater access to plausible cognitive models of humans and help us understand what allows humans to acquire language so efficiently.

• Leshem Choshen (IBM Research, MIT)

• Ryan Cotterell (ETH Zurich)

• Mustafa Omer Gul (Cornell University)

• Jaap Jumelet (University of Groningen)

• Tal Linzen (NYU)

• Aaron Mueller (Boston University)

• Suchir Salhan (University of Cambridge)

• Raj Sanjay Shah (Georgia Institute of Technology)

• Alex Warstadt (UCSD)

• Ethan Wilcox (Georgetown)

Overview • Workshop Schedule • Posters • Guidelines • Timeline • FAQs • Previous papers