EGD DistilBERT (Multilingual Cased)

Model Overview

This model is based on DistilBERT-base-multilingual-cased and has been fine-tuned on English, Hungarian, and German data for text classification of European Parliamentary speeches into rhetorical categories.

The model classifies text into three categories:

0 - Other (text that does not fit into moralist or realist categories)
1 - Moralist (arguments emphasizing moral reasoning)
2 - Realist (arguments applying pragmatic or realist reasoning)

This model is useful for analyzing political discourse and rhetorical styles in multiple languages.

Evaluation Results

The model was evaluated on a test set of 938 sentences, with the following results:

Label	Precision	Recall	F1-score	Support
0 - Other	0.91	0.92	0.92	783
1 - Moralist	0.49	0.40	0.44	65
2 - Realist	0.43	0.44	0.44	90

Overall accuracy: 0.84
Macro average F1-score: 0.60
Weighted average F1-score: 0.84

The model reliably distinguishes the general (other) class from moralist and realist arguments, though performance on the minority classes (1 and 2) is lower.

Usage

This model can be used with the Hugging Face Transformers library:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "uvegesistvan/EGD_distilbert-base-multilingual-cased"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify an example text
text = "The European Union has a responsibility towards future generations."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# Get predicted class
predicted_class = logits.argmax().item()
print(f"Predicted class: {predicted_class}")