|
--- |
|
license: cc |
|
language: |
|
- ja |
|
library_name: transformers |
|
--- |
|
|
|
### Summary |
|
|
|
This is a text classifier for assigning a [JLPT level](https://www.jlpt.jp/e/about/levelsummary.html). It was trained at the sentence level. |
|
A pre-trained [cl-tohoku-bert-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) is finetuned on ~5000 labeled sentences obtained from language learning websites. |
|
Performance on same distribution data is good. |
|
|
|
``` |
|
precision recall f1-score support |
|
N5 0.88 0.88 0.88 25 |
|
N4 0.90 0.89 0.90 53 |
|
N3 0.78 0.90 0.84 62 |
|
N2 0.71 0.79 0.75 47 |
|
N1 0.95 0.77 0.85 73 |
|
accuracy 0.84 260 |
|
macro avg 0.84 0.84 0.84 260 |
|
weighted avg 0.85 0.84 0.84 260 |
|
``` |
|
|
|
|
|
|
|
But on test data consisting of official JLPT material it is not so good. |
|
``` |
|
precision recall f1-score support |
|
N5 0.62 0.66 0.64 145 |
|
N4 0.34 0.36 0.35 143 |
|
N3 0.33 0.67 0.45 197 |
|
N2 0.26 0.20 0.23 192 |
|
N1 0.59 0.08 0.15 202 |
|
accuracy 0.38 879 |
|
macro avg 0.43 0.39 0.36 879 |
|
weighted avg 0.42 0.38 0.34 879 |
|
``` |
|
|
|
|
|
|
|
Still, it can give a ballpark estimation of sentence difficulty, although not very precise. |
|
|
|
# Cite |
|
|
|
``` |
|
@inproceedings{benedetti-etal-2024-automatically, |
|
title = "Automatically Suggesting Diverse Example Sentences for {L}2 {J}apanese Learners Using Pre-Trained Language Models", |
|
author = "Benedetti, Enrico and |
|
Aizawa, Akiko and |
|
Boudin, Florian", |
|
editor = "Fu, Xiyan and |
|
Fleisig, Eve", |
|
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)", |
|
month = aug, |
|
year = "2024", |
|
address = "Bangkok, Thailand", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2024.acl-srw.11", |
|
pages = "114--131" |
|
} |
|
``` |
|
|