bert-base-cased-finetuned

This model is a fine-tuned version of bert-base-cased on the usa_news_en dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0900
  • Accuracy: 0.9800

Model description

The fine-tuned model corresponds to a binary classification model that determines whether two English news headlines are related or not related. In the following paper {News Gathering: Leveraging Transformers to Rank News} it can find more details. To utilize the fine-tuned model, you can follow the steps outlined below:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer

### 1. Load the model:
model_name = "cmunhozc/news-ranking-ft-bert"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

### 2. Dataset:
def preprocess_fctn(examples):
  return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)
...
encoded_dataset = dataset.map(preprocess_fctn, batched=True, load_from_cache_file=False)
...

### 3. Evaluation:
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)

trainer_hf = Trainer(model,
                     eval_dataset    = encoded_dataset['validation'],
                     tokenizer       = tokenizer,
                     compute_metrics = compute_metrics)

trainer_hf.evaluate()

predictions = trainer_hf.predict(encoded_dataset["validation"])
acc_val  = metric.compute(predictions=np.argmax(predictions.predictions,axis=1).tolist(), references=predictions.label_ids)['accuracy']

Finally, with the classification above model, you can follow the steps below to generate the news ranking.

  • For each news article in the google_news_en dataset dataset positioned as the first element in a pair, retrieve all corresponding pairs from the dataset.
  • Employing pair encoders, rank the news articles that occupy the second position in each pair, determining their relevance to the first article.
  • Organize each list generated by the encoders based on the probabilities obtained for the relevance class.

Intended uses & limitations

More information needed

Training, evaluation and test data

The training data is sourced from the train split in usa_news_en dataset, and a similar procedure is applied for the validation set. In the case of testing, the initial segment for the text classification model is derived from the test_1 and test_2 splits. As for the ranking model, the test dataset from google_news_en dataset is utilized

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Accuracy
0.0967 1.0 3526 0.0651 0.9771
0.0439 2.0 7052 0.0820 0.9776
0.0231 3.0 10578 0.0900 0.9800

Framework versions

  • Transformers 4.35.2
  • Pytorch 2.1.0+cu121
  • Datasets 2.16.1
  • Tokenizers 0.15.0
Downloads last month
10
Safetensors
Model size
108M params
Tensor type
F32
·
Inference Examples
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for cmunhozc/news-ranking-ft-bert

Finetuned
(2062)
this model

Dataset used to train cmunhozc/news-ranking-ft-bert