--- license: cc language: - ja library_name: transformers --- ### Summary This is a text classifier for assigning a [JLPT level](https://www.jlpt.jp/e/about/levelsummary.html). It was trained at the sentence level. A pre-trained [cl-tohoku-bert-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) is finetuned on ~5000 labeled sentences obtained from language learning websites. Performance on same distribution data is good. ``` precision recall f1-score support N5 0.88 0.88 0.88 25 N4 0.90 0.89 0.90 53 N3 0.78 0.90 0.84 62 N2 0.71 0.79 0.75 47 N1 0.95 0.77 0.85 73 accuracy 0.84 260 macro avg 0.84 0.84 0.84 260 weighted avg 0.85 0.84 0.84 260 ``` But on test data consisting of official JLPT material it is not so good. ``` precision recall f1-score support N5 0.62 0.66 0.64 145 N4 0.34 0.36 0.35 143 N3 0.33 0.67 0.45 197 N2 0.26 0.20 0.23 192 N1 0.59 0.08 0.15 202 accuracy 0.38 879 macro avg 0.43 0.39 0.36 879 weighted avg 0.42 0.38 0.34 879 ``` Still, it can give a ballpark estimation of sentence difficulty, although not very precise. # Cite ``` @inproceedings{benedetti-etal-2024-automatically, title = "Automatically Suggesting Diverse Example Sentences for {L}2 {J}apanese Learners Using Pre-Trained Language Models", author = "Benedetti, Enrico and Aizawa, Akiko and Boudin, Florian", editor = "Fu, Xiyan and Fleisig, Eve", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-srw.11", pages = "114--131" } ```