English

Data Scorer

The model to score data for data selection in the paper Data Selection via Optimal Learning for Language Models. To use the model, follow the instructions here.

NOTE: you may need to download the fairseq-125M to ${PATH_TO_DATA_SELECTION_REPO}/checkpoints/fairseq/125M to prepare the tokenizer and config.json for the base model.

Citation

@article{gu2024data,
  title={Data Selection via Optimal Control for Language Models},
  author={Gu, Yuxian and Dong, Li and Wang, Hongning and Hao, Yaru and Dong, Qingxiu and Wei, Furu and Huang, Minlie},
  journal={arXiv preprint arXiv:2410.07064},
  year={2024}
}
Downloads last month
9
Inference API
Unable to determine this model's library. Check the docs .

Model tree for Data-Selection/data_scorer

Finetuned
(1)
this model

Dataset used to train Data-Selection/data_scorer