BabyLM Challenge
Sample-efficient pretraining on a developmentally plausible corpus
Can I do cool idea X?
If it is interesting, innovative, or may result in important findings, we want you to try it! If you think the rules are holding you back from submitting to the competition, please reach out to the organizers. In the worst (or best) case scenario, it can be an interesting workshop paper.
Why doesn't BabyLM do cool idea X?
Maybe we haven't thought about it; please reach out. We value proactivity.
Can papers be submitted to multiple tracks?
Yes. For example, a single paper can describe models that are submitted separately to the Strict and Interaction tracks.:)
Can I submit a paper about my work?
Yes, all participants should submit a report, which will be published in a proceedings volume. You can also go ahead and describe any additional experiments beyond those required for the shared task evaluation.
Can I submit additional evaluation metrics?
Yes, if you wish to submit your own evaluation metrics and model performance. These will be considered alongside our standardized evaluation results as part of the holistic evaluation that determines outstanding paper awards.
What training regimes are permitted?
Any training objective/regime is permitted as long as the data restrictions are followed. If you use ancillary models, for example, in the case of reranking or data augmentation, the training data for these models is counted towards your 100M word budget. This applies to all tracks, including Interaction track; so, for example, while you can use the external model to produce POS tags, you cannot use an off-the-shelf POS tagger in your pipeline. For evaluation purposes, we require that the model provides a function to score a sequence of words without the need for additional fine-tuning.
Are there any limits on hyperparameters or model scale?
No. We may consider parameter efficiency and training efficiency when awarding outstanding paper awards, but we do not impose any hard limits.
Are there any limits on the number of epochs?
This year, yes. Refer to the ``Training Duration Limitation'' paragraph of Section 4.2 in the Call For Papers for more details.
Can I use external tools?
Yes, but if they are learned on language, their tokens are counted towards the 100M. That means one can train on the same text, both a tokenizer, a parser, an LM, etc., or on parts of the 100M, but the sum of all text seen by all training can not surpass the amount of text allowed. This raises the question of synthetic data, which is allowed under some restrictions. You may generate the 100M tokens in any legal way you like (yes, distilling or writing your own is fair, if you figure out what text facilitates learning, it is interesting regardless of how to gather such text), you may also train eventually on more than 100M words by augmentation, however, that only works in a closed system, i.e., the augmenters' training data counts toward the limit, so, for example, training two LMs on half of the words, and then having them generate more words and training a model on both the original data and the new one is legit (and it was not tested in the previous competition, so even the example itself is interesting). Note that the Interaction track has an additional tool allowed (the world to interact with).
I have different modalities that can help
If it is not linguistic data, prove it, last year’s submissions did not gain from non-linguistic grounding, but we encourage such scientific questions. If it is linguistic in nature (e.g., audio), then the words should still count towards the overall number of learned words.
Interaction track: Can I get non-verbal cues from the teacher?
Yes. Note, however, that the student’s outputs are limited.
Are we allowed to evaluate our model on outside benchmarks and use these results to select our model's hyperparameters?
Yes.