bennexx's picture
added citation
86a7703 verified
---
license: cc
language:
- ja
library_name: transformers
---
### Summary
This is a text classifier for assigning a [JLPT level](https://www.jlpt.jp/e/about/levelsummary.html). It was trained at the sentence level.
A pre-trained [cl-tohoku-bert-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) is finetuned on ~5000 labeled sentences obtained from language learning websites.
Performance on same distribution data is good.
```
precision recall f1-score support
N5 0.88 0.88 0.88 25
N4 0.90 0.89 0.90 53
N3 0.78 0.90 0.84 62
N2 0.71 0.79 0.75 47
N1 0.95 0.77 0.85 73
accuracy 0.84 260
macro avg 0.84 0.84 0.84 260
weighted avg 0.85 0.84 0.84 260
```
But on test data consisting of official JLPT material it is not so good.
```
precision recall f1-score support
N5 0.62 0.66 0.64 145
N4 0.34 0.36 0.35 143
N3 0.33 0.67 0.45 197
N2 0.26 0.20 0.23 192
N1 0.59 0.08 0.15 202
accuracy 0.38 879
macro avg 0.43 0.39 0.36 879
weighted avg 0.42 0.38 0.34 879
```
Still, it can give a ballpark estimation of sentence difficulty, although not very precise.
# Cite
```
@inproceedings{benedetti-etal-2024-automatically,
title = "Automatically Suggesting Diverse Example Sentences for {L}2 {J}apanese Learners Using Pre-Trained Language Models",
author = "Benedetti, Enrico and
Aizawa, Akiko and
Boudin, Florian",
editor = "Fu, Xiyan and
Fleisig, Eve",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-srw.11",
pages = "114--131"
}
```