Commit
·
cb383ec
1
Parent(s):
646e144
Update README.md
Browse files
README.md
CHANGED
@@ -19,13 +19,21 @@ datasets:
|
|
19 |
- oscar-corpus/OSCAR-2301
|
20 |
---
|
21 |
|
|
|
|
|
22 |
# Tweety Tatar / Hydra-Base 7b / 2024-v1
|
23 |
|
24 |
## Model description
|
25 |
-
This model is our Hydra LLM for the [Tatar language](https://en.wikipedia.org/wiki/Tatar_language),
|
26 |
Hydra LLMs are trans-tokenized language models finetuned to produce output in a particular language, while accepting input encoded using either their own tokenizer, the one of their base model, or a mix of both.
|
27 |
This enables them to receive code-switched input in both their native language and other languages, which is an ideal setup for translation tasks, or retrieval-augmented generation (RAG) in cross-lingual scenarios.
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
## In-scope usage
|
30 |
This model can be used as-is to answer questions in Tatar based on a cross-lingual context, or finetuned into a machine translation system from one of the 10 languages supported by TowerInstruct into the Tatar language.
|
31 |
This list of languages nobably includes English and Russian.
|
@@ -147,4 +155,18 @@ def translate_english_text(english_text: str) -> str:
|
|
147 |
return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
|
148 |
|
149 |
translate_english_text("The city of Paris is very pretty.") # Париж шәһәре бик матур.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
150 |
```
|
|
|
19 |
- oscar-corpus/OSCAR-2301
|
20 |
---
|
21 |
|
22 |
+
<img align="right" src="https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1/resolve/main/TweetyTatar.png?download=true" alt="Tweety-Tatar-7B: A Tatar Large Language Model" width="20%">
|
23 |
+
|
24 |
# Tweety Tatar / Hydra-Base 7b / 2024-v1
|
25 |
|
26 |
## Model description
|
27 |
+
This model is our Hydra LLM for the [Tatar language](https://en.wikipedia.org/wiki/Tatar_language), converted from the [TowerInstruct-7b-v0.1](https://huggingface.co/Unbabel/TowerInstruct-7B-v0.1) model trained by Unbabel.
|
28 |
Hydra LLMs are trans-tokenized language models finetuned to produce output in a particular language, while accepting input encoded using either their own tokenizer, the one of their base model, or a mix of both.
|
29 |
This enables them to receive code-switched input in both their native language and other languages, which is an ideal setup for translation tasks, or retrieval-augmented generation (RAG) in cross-lingual scenarios.
|
30 |
|
31 |
+
- **Developed by:** [François Remy](https://huggingface.co/FremyCompany) (UGent), [Alfiya Khabibullina](https://huggingface.co/justalphie) (BeCode), [et al.](#citation)
|
32 |
+
- **Funded by:** IDLab / GPULab
|
33 |
+
- **Model type:** Foundation model using the mistral architecture
|
34 |
+
- **Language(s) (NLP):** Tatar
|
35 |
+
- **License:** Creative Commons Attribution Non Commercial 4.0
|
36 |
+
|
37 |
## In-scope usage
|
38 |
This model can be used as-is to answer questions in Tatar based on a cross-lingual context, or finetuned into a machine translation system from one of the 10 languages supported by TowerInstruct into the Tatar language.
|
39 |
This list of languages nobably includes English and Russian.
|
|
|
155 |
return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
|
156 |
|
157 |
translate_english_text("The city of Paris is very pretty.") # Париж шәһәре бик матур.
|
158 |
+
```
|
159 |
+
|
160 |
+
## Citation
|
161 |
+
|
162 |
+
If you use this model, please cite our work as:
|
163 |
+
|
164 |
+
```
|
165 |
+
@article{tweeties2024,
|
166 |
+
title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP},
|
167 |
+
author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
|
168 |
+
url = {https://huggingface.co/Tweeties},
|
169 |
+
year = {2024},
|
170 |
+
note = {Under review at COLM 2024}
|
171 |
+
}
|
172 |
```
|