File size: 7,497 Bytes
1ef1f5d 36d92d3 1ef1f5d 36d92d3 cb383ec 36d92d3 cb383ec 36d92d3 cb383ec 36d92d3 cb383ec e385080 cb383ec e385080 cb383ec 36d92d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
base_model: Unbabel/TowerInstruct-7B-v0.1
license: cc-by-nc-4.0
language:
- tt
- en
- de
- fr
- zh
- pt
- nl
- ru
- ko
- it
- es
tags:
- tweety
datasets:
- oscar-corpus/OSCAR-2301
---
<img align="right" src="https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1/resolve/main/TweetyTatar.png?download=true" alt="Tweety-Tatar-7B: A Tatar Large Language Model" width="20%">
# Tweety Tatar / Hydra-Base 7b / 2024-v1
## Model description
This model is our Hydra LLM for the [Tatar language](https://en.wikipedia.org/wiki/Tatar_language), converted from the [TowerInstruct-7b-v0.1](https://huggingface.co/Unbabel/TowerInstruct-7B-v0.1) model trained by Unbabel.
Hydra LLMs are trans-tokenized language models finetuned to produce output in a particular language, while accepting input encoded using either their own tokenizer, the one of their base model, or a mix of both.
This enables them to receive code-switched input in both their native language and other languages, which is an ideal setup for translation tasks, or retrieval-augmented generation (RAG) in cross-lingual scenarios.
- **Developed by:** [François Remy](https://huggingface.co/FremyCompany) (UGent), [Alfiya Khabibullina](https://huggingface.co/justalphie) (BeCode), [et al.](#citation)
- **Funded by:** IDLab / GPULab
- **Model type:** Foundation model using the mistral architecture
- **Language(s) (NLP):** Tatar
- **License:** Creative Commons Attribution Non Commercial 4.0
## In-scope usage
This model can be used as-is to answer questions in Tatar based on a cross-lingual context, or finetuned into a machine translation system from one of the 10 languages supported by TowerInstruct into the Tatar language.
This list of languages nobably includes English and Russian.
The model performs best when translating sentences or small paragraphs, and is not suited for document translation tasks.
This model should not be used in the reverse direction, to translate Tatar into English.
When the system isn't finetuned, enabling beam search is recommended for best results.
We also provide a model [finetuned for translation](https://huggingface.co/Tweeties/tweety-tatar-hydra-trans-7b-2024-v1), but take note of the non-commercial license imposed by Unbabel on the base model.
## Usage instructions
Using this model usually requires building the prompts by mixing tokens from two tokenizers, the original TowerInstruct tokenizer for input in the source language, and the new Tatar tokenizer for the prompt and output, as described in the examples below:
```py
import re
import torch
import torch.nn as nn
import transformers
MODEL_NAME = "Tweeties/tweety-tatar-hydra-base-7b-2024-v1"
MAIN_TOKENIZER_NAME = "Tweeties/tweety-tatar-hydra-base-7b-2024-v1"
UTIL_TOKENIZER_NAME = "Unbabel/TowerInstruct-7B-v0.1"
model = transformers.AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
main_tokenizer = transformers.LlamaTokenizerFast.from_pretrained(MAIN_TOKENIZER_NAME)
util_tokenizer = transformers.LlamaTokenizerFast.from_pretrained(UTIL_TOKENIZER_NAME)
main_tokenizer_len = len(main_tokenizer)
```
### Cross-lingual question answering
```py
def answer_english_question(english_text: str) -> str:
# craft the input
input_ids = torch.concat([
main_tokenizer.encode(f"Татар телендә түбәндәге сорауга җавап бирегез:\n", return_tensors='pt'),
util_tokenizer.encode(f"{english_text}", add_special_tokens=False, return_tensors='pt') + torch.tensor([main_tokenizer_len]),
main_tokenizer.encode(f"\n\nҗавап:\n", add_special_tokens=False, return_tensors='pt')
], axis=1)
# prevent the model from repeating the prompt
prompt_starts = [
main_tokenizer.encode("Түбәндәге"),
main_tokenizer.encode("\nТүбәндәге")[2:],
main_tokenizer.encode("Текстны"),
main_tokenizer.encode("\nТекстны")[2:]
]
# prevent the model from repeating the English text
english_starts = [
main_tokenizer.encode(re.sub(r'[ ].*', '', english_text)),
main_tokenizer.encode('\n'+re.sub(r'[ ].*', '', english_text))[2:],
main_tokenizer.encode(re.sub(r'[ ].*', '', english_text.upper())),
main_tokenizer.encode('\n'+re.sub(r'[ ].*', '', english_text.upper()))[2:],
]
# genereate the output
model_inputs = {'input_ids':input_ids.to(model.device)}
model_outputs = model.generate(
**model_inputs,
max_new_tokens=5,
num_beams=8,
no_repeat_ngram_size=6,
early_stopping=False,
pad_token_id=main_tokenizer.eos_token_id,
eos_token_id=main_tokenizer.convert_tokens_to_ids(['<0x0A>','</s>']),
bad_words_ids=english_starts+prompt_starts
)
# decode the output
return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
answer_english_question("Is Paris located in France?\n") # Әйе, Парижда
```
### Machine Translation (see [finetuned model](https://huggingface.co/Tweeties/tweety-tatar-hydra-trans-7b-2024-v1))
```py
def translate_english_text(english_text: str) -> str:
# craft the input
input_ids = torch.concat([
main_tokenizer.encode(f"Түбәндәге текстны инглиз теленнән татар теленә тәрҗемә итегез:\n", return_tensors='pt'),
util_tokenizer.encode(f"{english_text}", add_special_tokens=False, return_tensors='pt') + torch.tensor([main_tokenizer_len]),
main_tokenizer.encode(f"\nТекстны татар теленә тәрҗемә итү:\n", add_special_tokens=False, return_tensors='pt')
], axis=1)
# prevent the model from repeating the prompt
prompt_starts = [
main_tokenizer.encode("Түбәндәге"),
main_tokenizer.encode("\nТүбәндәге")[2:],
main_tokenizer.encode("Текстны"),
main_tokenizer.encode("\nТекстны")[2:]
]
# prevent the model from repeating the English text
english_starts = [
main_tokenizer.encode(re.sub(r'[ ].*', '', english_text)),
main_tokenizer.encode('\n'+re.sub(r'[ ].*', '', english_text))[2:],
main_tokenizer.encode(re.sub(r'[ ].*', '', english_text.upper())),
main_tokenizer.encode('\n'+re.sub(r'[ ].*', '', english_text.upper()))[2:],
]
# genereate the output
model_inputs = {'input_ids':input_ids.to(model.device)}
model_outputs = model.generate(
**model_inputs,
max_new_tokens=128,
num_beams=8,
no_repeat_ngram_size=6,
early_stopping=False,
pad_token_id=main_tokenizer.eos_token_id,
eos_token_id=main_tokenizer.convert_tokens_to_ids(['<0x0A>','</s>']),
bad_words_ids=english_starts+prompt_starts
)
# decode the output
return (main_tokenizer.decode(model_outputs[0][input_ids.shape[1]:]))
translate_english_text("The city of Paris is very pretty.") # Париж шәһәре бик матур.
```
## Citation
If you use this model, please cite our work as:
```
@article{tweeties2024,
title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP},
author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
url = {https://arxiv.org/abs/2408.04303},
year = {2024},
note = {Accepted at COLM 2024}
}
``` |