BabyLM Challenge
Sample-efficient pretraining on a developmentally plausible corpus
Can papers be submitted to multiple tracks?
Yes. For example, a single paper can describe models which are submitted separately to the strict and strict-small tracks.
Can I submit a paper about my work?
Yes, we encourage all participants to submit their reports, which will be published in the proceedings of CoNLL. You may also describe any additional experiments beyond those required for the shared task evaluation.
Can I submit additional evaluation metrics?
Yes, if you wish to submit your own evaluation metrics, along with model performance. These will be considered alongside our standardized evaluation results as part of the holistic evaluation in the loose track.
What training regimes are permitted?
For the strict and strict-small tracks, any kind of training objective/regime is permitted, as long as the data restrictions are followed. Pretrained models may not be used for any purpose such as reranking or data augmentation. We do however require for evaluation purposes that the model provides a function to score a sequence---e.g., log-likelihood for autoregressive models or pseudo-log-likelihood for masked language models---without the need for additional fine-tuning.
Are there any limits on hyperparameters or model scale?
No. In the loose track, parameter efficiency and training efficiency may be considered along with other factors in ranking submissions, but we do not impose any hard limits.
Are there any limits on the number of epochs?
No. We put no restrictions on the number of epochs, for several reasons: First, from an engineering perspective, training LMs with SGD tends to require multiple epochs at these scales to achieve peak performance. Second, from a cognitive perspective, humans have a memory of linguistic experience, and can continue to access and learn from these memories. Third, we aim not to take a stand on implementation details to allow the most freedom for innovation.
Are we allowed to use annotations from outside systems that rely on expert-annotated data, like POS taggers?
For the strict and strict-small tracks, we do not allow any systems that are trained on outside data, especially if they rely on expert-annotated data. However, if it is a hard-coded inductive bias that (for example) clusters words into n categories in an unsupervised manner, or relies on some outside clustering system trained only on the strict(-small) dataset, this is allowed. We allow systems that annotate (but do not augment) the pretraining dataset, and which are either hard-coded heuristics or trained systems that rely only on the pretraining data we release.
Are we allowed to evaluate our model on outside benchmarks and use these results to select our model's hyperparameters?