arxiv:2211.11418

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

Published on Nov 21, 2022

Authors:

Abstract

The monolingual Hindi BERT models currently available on the model hub do not perform better than the multi-lingual models on downstream tasks. We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus. Further, since Indic languages, Hindi and Marathi share the Devanagari script, we train a single model for both languages. We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets. We evaluate these models on downstream Hindi and Marathi text classification and named entity recognition tasks. The HindBERT and DevBERT-based models show significant improvements over multi-lingual MuRIL, IndicBERT, and XLM-R. Based on these observations we also release monolingual BERT models for other Indic languages Kannada, Telugu, Malayalam, Tamil, Gujarati, Assamese, Odia, Bengali, and Punjabi. These models are shared at https://huggingface.co/l3cube-pune .

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 22

Browse 22 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2211.11418 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.