Text Generation
Transformers
Safetensors
llama_hydra
tweety
custom_code
FremyCompany commited on
Commit
cb383ec
·
1 Parent(s): 646e144

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -1
README.md CHANGED
@@ -19,13 +19,21 @@ datasets:
19
  - oscar-corpus/OSCAR-2301
20
  ---
21
 
 
 
22
  # Tweety Tatar / Hydra-Base 7b / 2024-v1
23
 
24
  ## Model description
25
- This model is our Hydra LLM for the [Tatar language](https://en.wikipedia.org/wiki/Tatar_language), finetuned from the [TowerInstruct-7b-v0.1](https://huggingface.co/Unbabel/TowerInstruct-7B-v0.1) model trained by Unbabel.
26
  Hydra LLMs are trans-tokenized language models finetuned to produce output in a particular language, while accepting input encoded using either their own tokenizer, the one of their base model, or a mix of both.
27
  This enables them to receive code-switched input in both their native language and other languages, which is an ideal setup for translation tasks, or retrieval-augmented generation (RAG) in cross-lingual scenarios.
28
 
 
 
 
 
 
 
29
  ## In-scope usage
30
  This model can be used as-is to answer questions in Tatar based on a cross-lingual context, or finetuned into a machine translation system from one of the 10 languages supported by TowerInstruct into the Tatar language.
31
  This list of languages nobably includes English and Russian.
@@ -147,4 +155,18 @@ def translate_english_text(english_text: str) -> str:
147
  return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
148
 
149
  translate_english_text("The city of Paris is very pretty.") # Париж шәһәре бик матур.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  ```
 
19
  - oscar-corpus/OSCAR-2301
20
  ---
21
 
22
+ <img align="right" src="https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1/resolve/main/TweetyTatar.png?download=true" alt="Tweety-Tatar-7B: A Tatar Large Language Model" width="20%">
23
+
24
  # Tweety Tatar / Hydra-Base 7b / 2024-v1
25
 
26
  ## Model description
27
+ This model is our Hydra LLM for the [Tatar language](https://en.wikipedia.org/wiki/Tatar_language), converted from the [TowerInstruct-7b-v0.1](https://huggingface.co/Unbabel/TowerInstruct-7B-v0.1) model trained by Unbabel.
28
  Hydra LLMs are trans-tokenized language models finetuned to produce output in a particular language, while accepting input encoded using either their own tokenizer, the one of their base model, or a mix of both.
29
  This enables them to receive code-switched input in both their native language and other languages, which is an ideal setup for translation tasks, or retrieval-augmented generation (RAG) in cross-lingual scenarios.
30
 
31
+ - **Developed by:** [François Remy](https://huggingface.co/FremyCompany) (UGent), [Alfiya Khabibullina](https://huggingface.co/justalphie) (BeCode), [et al.](#citation)
32
+ - **Funded by:** IDLab / GPULab
33
+ - **Model type:** Foundation model using the mistral architecture
34
+ - **Language(s) (NLP):** Tatar
35
+ - **License:** Creative Commons Attribution Non Commercial 4.0
36
+
37
  ## In-scope usage
38
  This model can be used as-is to answer questions in Tatar based on a cross-lingual context, or finetuned into a machine translation system from one of the 10 languages supported by TowerInstruct into the Tatar language.
39
  This list of languages nobably includes English and Russian.
 
155
  return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
156
 
157
  translate_english_text("The city of Paris is very pretty.") # Париж шәһәре бик матур.
158
+ ```
159
+
160
+ ## Citation
161
+
162
+ If you use this model, please cite our work as:
163
+
164
+ ```
165
+ @article{tweeties2024,
166
+ title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP},
167
+ author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
168
+ url = {https://huggingface.co/Tweeties},
169
+ year = {2024},
170
+ note = {Under review at COLM 2024}
171
+ }
172
  ```