BabyLM Challenge
Sample-efficient pretraining on a developmentally plausible corpus
Can papers be submitted to multiple tracks?
Yes. For example, a single paper can describe models which are submitted separately to the strict and strict-small tracks. Additionally, in theory, models that quality for the strict-small and strict track also qualify for the multimodal track (as they are trained on less than the 100M word budget). However, we encourage only models that engage with the multimodal pretraining objective to actually be submitted to the multimodal track :)
Can I submit a paper about my work?
Yes, all participants should submit a reports, which will be published in a proceedings volume. You may also describe any additional experiments beyond those required for the shared task evaluation.
Can I submit additional evaluation metrics?
Yes, if you wish to submit your own evaluation metrics, along with model performance. These will be considered alongside our standardized evaluation results as part of the holistic evaluation that determines outstanding paper awards.
What training regimes are permitted?
Any kind of training objective/regime is permitted, as long as the data restrictions are followed. Pretrained models may not be used for any purpose such as reranking or data augmentation. We do however require for evaluation purposes that the model provides a function to score a sequence---e.g., log-likelihood for autoregressive models or pseudo-log-likelihood for masked language models---without the need for additional fine-tuning.
Are there any limits on hyperparameters or model scale?
No. We may consider parameter efficiency and training efficiency when awarding outstanding paper awards, but we do not impose any hard limits.
Are there any limits on the number of epochs?
No. We put no restrictions on the number of epochs, for several reasons: First, from an engineering perspective, training LMs with SGD tends to require multiple epochs at these scales to achieve peak performance. Second, from a cognitive perspective, humans have a memory of linguistic experience, and can continue to access and learn from these memories. Third, we aim not to take a stand on implementation details to allow the most freedom for innovation.
Are we allowed to use annotations from outside systems that rely on expert-annotated data, like POS taggers?
For the strict and strict-small tracks, we do not allow any systems that are trained on outside data, especially if they rely on expert-annotated data. However, if it is a hard-coded inductive bias that (for example) clusters words into n categories in an unsupervised manner, or relies on some outside clustering system trained only on the strict(-small) dataset, this is allowed. We allow systems that annotate (but do not augment) the pretraining dataset, and which are either hard-coded heuristics or trained systems that rely only on the pretraining data we release.
Are we allowed to evaluate our model on outside benchmarks and use these results to select our model's hyperparameters?
Yes.