DistilBERT Token Classification Model for Unit Conversion

Model Overview

This model is a fine-tuned version of distilbert/distilbert-base-uncased for token classification on unit conversion-related text. It is designed to recognize unit values and conversion entities, facilitating automatic extraction of unit-related data.

Dataset

The model is trained on the maliknaik/natural_unit_conversion dataset, which contains:

  • Training set: 583,863 examples
  • Validation set: 100,091 examples
  • Test set: 150,137 examples

Each example consists of:

  • text: The input sentence containing unit-related phrases.
  • entities: The labeled entities specifying unit values and types.

Dataset url: https://huggingface.co/datasets/maliknaik/natural_unit_conversion

Labels

The model classifies tokens into the following categories:

  • B-FROM_UNIT: Beginning of the source unit
  • I-FROM_UNIT: Inside the source unit
  • B-TO_UNIT: Beginning of the target unit
  • I-TO_UNIT: Inside the target unit
  • B-FEET_VALUE: Beginning of feet value
  • I-FEET_VALUE: Inside feet value
  • B-INCH_VALUE: Beginning of inch value
  • I-INCH_VALUE: Inside inch value

Training Details

  • Base Model: distilbert/distilbert-base-uncased
  • Tokenization: AutoTokenizer from Hugging Face Transformers
  • Training Framework: Hugging Face Trainer
  • Data Collator: DataCollatorForTokenClassification
  • Loss Function: CrossEntropyLoss
  • Batch Size: 64
  • Epochs: 10
  • GPU: 1x NVIDIA Tesla P4 (8GB GDDR5)
  • CPU: 56 vCPUs
  • RAM: 283GB

Usage

To use this model for inference:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = 'maliknaik/distilbert-natural-unit-conversion'

model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

text = 'How many miles are there in 50 kilometers?'

unit_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
print(unit_pipeline(text))

Output:

[{'entity_group': 'TO_UNIT',
  'score': np.float32(0.9999982),
  'word': 'miles',
  'start': 9,
  'end': 14},
 {'entity_group': 'FROM_UNIT',
  'score': np.float32(0.9999473),
  'word': 'kilometers',
  'start': 31,
  'end': 41}]

Performance

The model achieves high f1 score in identifying unit values and conversions. The f1-score for validation and test sets is expected to be optimized further.

Usage

This dataset can be used for training named entity recognition (NER) models, especially for tasks related to unit conversion and natural language understanding.

License

This model is available under the CC0-1.0 license. It is free to use for any purpose without any restrictions.

Contributions

Developed by Malik N. Mohammed, leveraging DistilBERT for efficient NLP token classification.

Citation

If you use this model in your work, please cite it as follows:

@misc{unit-conversion-dataset,
  author = {Malik N. Mohammed},
  title = {Natural Language Unit Conversion Model for Named-Entity Recognition},
  year = {2025},
  publisher = {HuggingFace},
  journal = {HuggingFace repository}
  howpublished = {\url{https://huggingface.co/maliknaik/distilbert-natural-unit-conversion/}}
}
Downloads last month
10
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.