Shiksha
Collection
Technical Domain focused Translation Dataset and Model for Indian Languages
•
4 items
•
Updated
Use the code below to get started with the model.
import torch
from peft import AutoPeftModelForSeq2SeqLM
from transformers import NllbTokenizerFast
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model and tokenizer from local checkpoint
model = AutoPeftModelForSeq2SeqLM.from_pretrained("SPRINGLab/shiksha-MT-nllb-3.3B", device_map=device)
tokenizer = NllbTokenizerFast.from_pretrained("facebook/nllb-200-3.3B")
input_text = "Welcome back to the lecture series in Cell Culture."
# Lang codes: https://github.com/facebookresearch/flores/tree/main/flores200
tgt_lang = "hin_Deva"
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
output = model.generate(input_ids=inputs["input_ids"].to(device), max_new_tokens=256, forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang))
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
print(output_text[0])
We used the following datasets for training this adapter:
Shiksha: https://huggingface.co/datasets/SPRINGLab/shiksha
BPCC-cleaned: https://huggingface.co/datasets/SPRINGLab/BPCC_cleaned
We used 8 x A100 40GB GPUs for training this adapter. We would like to thank CDAC for providing the compute resources.
If you use this model in your work, please cite us:
BibTeX:
@misc{joglekar2024shikshatechnicaldomainfocused,
title={Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages},
author={Advait Joglekar and Srinivasan Umesh},
year={2024},
eprint={2412.09025},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.09025},
}
Base model
facebook/nllb-200-3.3B