Tweeties
/

tweety-tatar-hydra-base-7b-v24a

Text Generation

Transformers

Safetensors

llama_hydra

tweety

custom_code

Model card Files Files and versions Community

FremyCompany commited on Apr 26, 2024

Commit

cb383ec

1 Parent(s): 646e144

Update README.md

Browse files

Files changed (1) hide show

README.md +23 -1

README.md CHANGED Viewed

@@ -19,13 +19,21 @@ datasets:
 - oscar-corpus/OSCAR-2301
 ---
 # Tweety Tatar / Hydra-Base 7b / 2024-v1
 ## Model description
-This model is our Hydra LLM for the [Tatar language](https://en.wikipedia.org/wiki/Tatar_language), finetuned from the [TowerInstruct-7b-v0.1](https://huggingface.co/Unbabel/TowerInstruct-7B-v0.1) model trained by Unbabel.
 Hydra LLMs are trans-tokenized language models finetuned to produce output in a particular language, while accepting input encoded using either their own tokenizer, the one of their base model, or a mix of both.
 This enables them to receive code-switched input in both their native language and other languages, which is an ideal setup for translation tasks, or retrieval-augmented generation (RAG) in cross-lingual scenarios.
 ## In-scope usage
 This model can be used as-is to answer questions in Tatar based on a cross-lingual context, or finetuned into a machine translation system from one of the 10 languages supported by TowerInstruct into the Tatar language.
 This list of languages nobably includes English and Russian.
@@ -147,4 +155,18 @@ def translate_english_text(english_text: str) -> str:
     return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
 translate_english_text("The city of Paris is very pretty.") # Париж шәһәре бик матур.
 ```

 - oscar-corpus/OSCAR-2301
 ---
+<img align="right" src="https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1/resolve/main/TweetyTatar.png?download=true" alt="Tweety-Tatar-7B: A Tatar Large Language Model" width="20%">
 # Tweety Tatar / Hydra-Base 7b / 2024-v1
 ## Model description
+This model is our Hydra LLM for the [Tatar language](https://en.wikipedia.org/wiki/Tatar_language), converted from the [TowerInstruct-7b-v0.1](https://huggingface.co/Unbabel/TowerInstruct-7B-v0.1) model trained by Unbabel.
 Hydra LLMs are trans-tokenized language models finetuned to produce output in a particular language, while accepting input encoded using either their own tokenizer, the one of their base model, or a mix of both.
 This enables them to receive code-switched input in both their native language and other languages, which is an ideal setup for translation tasks, or retrieval-augmented generation (RAG) in cross-lingual scenarios.
+- **Developed by:** [François Remy](https://huggingface.co/FremyCompany) (UGent), [Alfiya Khabibullina](https://huggingface.co/justalphie) (BeCode), [et al.](#citation)
+- **Funded by:** IDLab / GPULab
+- **Model type:** Foundation model using the mistral architecture
+- **Language(s) (NLP):** Tatar
+- **License:** Creative Commons Attribution Non Commercial 4.0
 ## In-scope usage
 This model can be used as-is to answer questions in Tatar based on a cross-lingual context, or finetuned into a machine translation system from one of the 10 languages supported by TowerInstruct into the Tatar language.
 This list of languages nobably includes English and Russian.
     return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
 translate_english_text("The city of Paris is very pretty.") # Париж шәһәре бик матур.
+```
+## Citation
+If you use this model, please cite our work as:
+```
+@article{tweeties2024,
+    title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP},
+    author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
+    url = {https://huggingface.co/Tweeties},
+    year = {2024},
+    note = {Under review at COLM 2024}
+}
 ```