Data Scorer
The model to score data for data selection in the paper Data Selection via Optimal Learning for Language Models. To use the model, follow the instructions here.
NOTE: you may need to download the fairseq-125M to ${PATH_TO_DATA_SELECTION_REPO}/checkpoints/fairseq/125M
to prepare the tokenizer and config.json for the base model.
Citation
@article{gu2024data,
title={Data Selection via Optimal Control for Language Models},
author={Gu, Yuxian and Dong, Li and Wang, Hongning and Hao, Yaru and Dong, Qingxiu and Wei, Furu and Huang, Minlie},
journal={arXiv preprint arXiv:2410.07064},
year={2024}
}
- Downloads last month
- 9
Model tree for Data-Selection/data_scorer
Base model
KoboldAI/fairseq-dense-125M