ymoslem commited on
Commit
4d7b3a7
·
verified ·
1 Parent(s): c04bed3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -12
README.md CHANGED
@@ -37,32 +37,54 @@ datasets:
37
  - ymoslem/wmt-da-human-evaluation-long-context
38
  model-index:
39
  - name: Quality Estimation for Machine Translation
40
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ---
42
 
43
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
44
- should probably proofread and complete it, then remove this comment. -->
45
-
46
  # Quality Estimation for Machine Translation
47
 
48
  This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the ymoslem/wmt-da-human-evaluation-long-context dataset.
49
  It achieves the following results on the evaluation set:
50
- - Loss: 0.0214
 
51
 
52
  ## Model description
53
 
54
- More information needed
55
-
56
- ## Intended uses & limitations
57
-
58
- More information needed
59
 
60
  ## Training and evaluation data
61
 
62
- More information needed
 
 
 
63
 
64
  ## Training procedure
65
 
 
 
 
66
  ### Training hyperparameters
67
 
68
  The following hyperparameters were used during training:
@@ -72,7 +94,7 @@ The following hyperparameters were used during training:
72
  - seed: 42
73
  - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
74
  - lr_scheduler_type: linear
75
- - training_steps: 60000
76
 
77
  ### Training results
78
 
@@ -146,3 +168,126 @@ The following hyperparameters were used during training:
146
  - Pytorch 2.4.1+cu124
147
  - Datasets 3.2.0
148
  - Tokenizers 0.21.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  - ymoslem/wmt-da-human-evaluation-long-context
38
  model-index:
39
  - name: Quality Estimation for Machine Translation
40
+ results:
41
+ - task:
42
+ type: regression
43
+ dataset:
44
+ name: ymoslem/wmt-da-human-evaluation-long-context
45
+ type: QE
46
+ metrics:
47
+ - name: Pearson Correlation
48
+ type: Pearson
49
+ value: 0.5013
50
+ - name: Mean Absolute Error
51
+ type: MAE
52
+ value: 0.1024
53
+ - name: Root Mean Squared Error
54
+ type: RMSE
55
+ value: 0.1464
56
+ - name: R-Squared
57
+ type: R2
58
+ value: 0.251
59
+ metrics:
60
+ - pearsonr
61
+ - mae
62
+ - r_squared
63
  ---
64
 
 
 
 
65
  # Quality Estimation for Machine Translation
66
 
67
  This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the ymoslem/wmt-da-human-evaluation-long-context dataset.
68
  It achieves the following results on the evaluation set:
69
+ - Last checkpoint: Loss: 0.0214
70
+ - Best checkpoint (this one): Loss: 0.0214
71
 
72
  ## Model description
73
 
74
+ This model is for reference-free quality estimation (QE) of machine translation (MT) systems.
 
 
 
 
75
 
76
  ## Training and evaluation data
77
 
78
+ The model is trained on the long-context dataset [ymoslem/wmt-da-human-evaluation-long-context](https://huggingface.co/datasets/ymoslem/wmt-da-human-evaluation-long-context).
79
+
80
+ * Training: 7.65 million long-context texts
81
+ * Test: 59,235 long-context texts
82
 
83
  ## Training procedure
84
 
85
+ - tokenizer.model_max_length: 8192 (full context length)
86
+ - attn_implementation: flash_attention_2
87
+
88
  ### Training hyperparameters
89
 
90
  The following hyperparameters were used during training:
 
94
  - seed: 42
95
  - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
96
  - lr_scheduler_type: linear
97
+ - training_steps: 60000 (approx. 1 epoch)
98
 
99
  ### Training results
100
 
 
168
  - Pytorch 2.4.1+cu124
169
  - Datasets 3.2.0
170
  - Tokenizers 0.21.0
171
+
172
+ ## Inference
173
+
174
+ 1. Install the required libraries.
175
+
176
+ ```bash
177
+ pip3 install --upgrade datasets accelerate transformers
178
+ pip3 install --upgrade flash_attn triton
179
+ ```
180
+
181
+ 2. Load the test dataset.
182
+
183
+ ```python
184
+ from datasets import load_dataset
185
+
186
+ test_dataset = load_dataset("ymoslem/wmt-da-human-evaluation",
187
+ split="test",
188
+ trust_remote_code=True
189
+ )
190
+ print(test_dataset)
191
+ ```
192
+
193
+ 3. Load the model and tokenizer:
194
+
195
+ ```python
196
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
197
+ import torch
198
+
199
+ # Load the fine-tuned model and tokenizer
200
+ model_name = "ymoslem/ModernBERT-base-long-context-qe-v1"
201
+ model = AutoModelForSequenceClassification.from_pretrained(
202
+ model_name,
203
+ device_map="auto",
204
+ torch_dtype=torch.bfloat16,
205
+ attn_implementation="flash_attention_2",
206
+ )
207
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
208
+
209
+ # Move model to GPU if available
210
+ device = "cuda" if torch.cuda.is_available() else "cpu"
211
+ model.to(device)
212
+ model.eval()
213
+ ```
214
+
215
+ 4. Prepare the dataset. Each source segment `src` and target segment `tgt` are separated by the `sep_token`, which is `'</s>'` for ModernBERT.
216
+
217
+ ```python
218
+ sep_token = tokenizer.sep_token
219
+ input_test_texts = [f"{src} {sep_token} {tgt}" for src, tgt in zip(test_dataset["src"], test_dataset["mt"])]
220
+ ```
221
+
222
+ 5. Generate predictions.
223
+
224
+ If you print `model.config.problem_type`, the output is `regression`.
225
+ Still, you can use the "text-classification" pipeline as follows (cf. [pipeline documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TextClassificationPipeline)):
226
+
227
+ ```python
228
+ from transformers import pipeline
229
+
230
+ classifier = pipeline("text-classification",
231
+ model=model_name,
232
+ tokenizer=tokenizer,
233
+ device=0,
234
+ )
235
+
236
+ predictions = classifier(input_test_texts,
237
+ batch_size=128,
238
+ truncation=True,
239
+ padding="max_length",
240
+ max_length=tokenizer.model_max_length,
241
+ )
242
+ predictions = [prediction["score"] for prediction in predictions]
243
+
244
+ ```
245
+
246
+ Alternatively, you can use an elaborate version of the code, which is slightly faster and provides more control.
247
+
248
+ ```python
249
+ from torch.utils.data import DataLoader
250
+ import torch
251
+ from tqdm.auto import tqdm
252
+
253
+ # Tokenization function
254
+ def process_batch(batch, tokenizer, device):
255
+ sep_token = tokenizer.sep_token
256
+ input_texts = [f"{src} {sep_token} {tgt}" for src, tgt in zip(batch["src"], batch["mt"])]
257
+ tokens = tokenizer(input_texts,
258
+ truncation=True,
259
+ padding="max_length",
260
+ max_length=tokenizer.model_max_length,
261
+ return_tensors="pt",
262
+ ).to(device)
263
+ return tokens
264
+
265
+
266
+
267
+ # Create a DataLoader for batching
268
+ test_dataloader = DataLoader(test_dataset,
269
+ batch_size=128, # Adjust batch size as needed
270
+ shuffle=False)
271
+
272
+
273
+ # List to store all predictions
274
+ predictions = []
275
+
276
+ with torch.no_grad():
277
+ for batch in tqdm(test_dataloader, desc="Inference Progress", unit="batch"):
278
+
279
+ tokens = process_batch(batch, tokenizer, device)
280
+
281
+ # Forward pass: Generate model's logits
282
+ outputs = model(**tokens)
283
+
284
+ # Get logits (predictions)
285
+ logits = outputs.logits
286
+
287
+ # Extract the regression predicted values
288
+ batch_predictions = logits.squeeze()
289
+
290
+ # Extend the list with the predictions
291
+ predictions.extend(batch_predictions.tolist())
292
+ ```
293
+