๐จ PLEASE USE THE OFFICIAL QUANTIZED VERSIONS: GGUF OR REQUEST A SPECIFIC ONE ๐จ
๐จ There is no guarantee that you are using the latest improved versions from 3rd party quantizations as we have updated the model's weights ๐จ
Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language
Following the release of Meltemi-7B on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs. Krikri is built on top of Llama-3.1-8B, extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Instruct, along with the base model, Llama-Krikri-8B-Base
Model Information
Base Model
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (approximately 80,000 Greek words)
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
- This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
- Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
- The training corpus also contains 7.8 billion math and code tokens.
- This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
Sub-corpus | # Tokens | Percentage |
---|---|---|
Greek | 56.7 B | 62.3 % |
English | 21.0 B | 23.1 % |
Parallel | 5.5 B | 6.0 % |
Math/Code | 7.8 B | 8.6 % |
Total | 91 B | 100% |
Chosen subsets of the 91 billion corpus were upsampled resulting in a size of 110 billion tokens.
Instruct Model
Llama-Krikri-8B-Instruct is the result of post-training Llama-Kriki-8B-Base and features:
- Enhanced chat capabilities and instruction-following in both Greek and English.
- Document translation from Greek to English, French, German, Italian, Portuguese, Spanish and vice versa.
- Great performance on generation, comprehension, and editing tasks, such as summarization, creative content creation, text modification, entity recognition, sentiment analysis, etc.
- Domain-specifc expertise for specialized legal, financial, medical, and scientific applications.
- Retrieval-Augmented Generation (RAG) utilizing multiple documents with 128k context length.
- Improved coding and agentic capabilities with correct formatting and tool use.
- Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings.
- Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.
Post-training Methodology
We used a multi-stage process in order to build Llama-Krikri-8B-Instruct which includes:
- 2-stage Supervised Fine-Tuning with a combination of Greek & English instruction-response pairs (& multi-turn conversations)
- Stage 1: 856,946 instruction-response pairs (371,379 Greek + 485,567 English)
- Stage 2: 638,408 instruction-response pairs (279,948 Greek + 358,460 English)
- Alignment with a combination of Greek & English preference triplets (Instruction - Chosen Response - Rejected Response)
- Length Normalized DPO: 92,394 preference triplets (47,132 Greek + 45,262 English)
Post-training Data Construction
To build the SFT & DPO data, we utilized various methodologies including:
- Collecting existing high-quality datasets such as Tulu 3, SmolTalk, MAGPIE Ultra, Orca Agent Instruct, IFEval Like Data, UltraFeedback, NVIDIA HelpSteer2, Intel Orca, UltraMedical, and other datasets focused on safety, truthfulness, and instruction-following.
- Translating various data into Greek using an in-house translation tool.
- Regenerating translated data and contrasting the translated with the regenerated responses (i.e., for creating preference triplets).
- Distilling (with the MAGPIE methodology) models which exhibit strong performance in Greek, such as Gemma 2 27B IT.
- Scoring data with the Skywork Reward Gemma 2 27B v0.2 Reward Model and filtering using rule-based filters.
- Creating data for sentence and document translation using high-quality parallel corpora mainly from ELRC-SHARE.
- Synthetically extracting question-answer pairs and multi-turn dialogues from diverse sources such as Wikipedia, EUR-LEX, Greek School Books, and Kallipos.
How to use
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
model.to(device)
system_prompt = "ฮฮฏฯฮฑฮน ฯฮฟ ฮฯฮนฮบฯฮฏ, ฮญฮฝฮฑ ฮตฮพฮฑฮนฯฮตฯฮนฮบฮฌ ฮฑฮฝฮตฯฯฯ
ฮณฮผฮญฮฝฮฟ ฮผฮฟฮฝฯฮญฮปฮฟ ฮคฮตฯฮฝฮทฯฮฎฯ ฮฮฟฮทฮผฮฟฯฯฮฝฮทฯ ฮณฮนฮฑ ฯฮฑ ฮตฮปฮปฮทฮฝฮนฮบฮฑ ฮบฮฑฮน ฮตฮบฯฮฑฮนฮดฮตฯฯฮทฮบฮตฯ ฮฑฯฯ ฯฮฟ ฮฮฮ ฯฮฟฯ
ฮ.ฮ. \"ฮฮธฮทฮฝฮฌ\"."
user_prompt = "ฮฃฮต ฯฮน ฮดฮนฮฑฯฮญฯฮตฮน ฮญฮฝฮฑ ฮบฯฮนฮบฯฮฏ ฮฑฯฯ ฮญฮฝฮฑ ฮปฮฌฮผฮฑ;"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)
print(tokenizer.batch_decode(outputs)[0])
With OpenAI compatible server via vLLM
vllm serve ilsp/Llama-Krikri-8B-Instruct \
--enforce-eager \
--dtype 'bfloat16' \
--api-key token-abc123
Then, the model can be used through Python using:
from openai import OpenAI
api_key = "token-abc123"
base_url = "http://localhost:8000/v1"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
system_prompt = "ฮฮฏฯฮฑฮน ฮญฮฝฮฑ ฮฑฮฝฮตฯฯฯ
ฮณฮผฮญฮฝฮฟ ฮผฮตฯฮฑฯฯฮฑฯฯฮนฮบฯ ฯฯฯฯฮทฮผฮฑ ฯฮฟฯ
ฮฑฯฮฑฮฝฯฮฌฮตฮน ฮผฮต ฮปฮฏฯฯฮตฯ Python. ฮฮตฮฝ ฮณฯฮฌฯฮตฮนฯ ฯฮฏฯฮฟฯฮฑ ฮฌฮปฮปฮฟ ฯฯฮนฯ ฮฑฯฮฑฮฝฯฮฎฯฮตฮนฯ ฯฮฟฯ
ฯฮญฯฮฑ ฮฑฯฯ ฯฮนฯ ฮผฮตฯฮฑฯฯฮฑฯฮผฮญฮฝฮตฯ ฮปฮฏฯฯฮตฯ."
user_prompt = "ฮฯฯฮต ฮผฮฟฯ
ฯฮทฮฝ ฯฮฑฯฮฑฮบฮฌฯฯ ฮปฮฏฯฯฮฑ ฮผฮต ฮผฮตฯฮฑฯฯฮฑฯฮผฮญฮฝฮฟ ฮบฮฌฮธฮต string ฯฮทฯ ฯฯฮฑ ฮตฮปฮปฮทฮฝฮนฮบฮฌ: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
response = client.chat.completions.create(model="ilsp/Llama-Krikri-8B-Instruct",
messages=messages,
temperature=0.0,
top_p=0.95,
max_tokens=8192,
stream=False)
print(response.choices[0].message.content)
# ['ฮฮธฮนฮบฮฎ ฮบฮฑฮธฮฎฮบฮฟฮฝฯฮฟฯ', 'ฮฮตฯฮฑฮผฮฟฮฝฯฮญฯฮฝฮฑ ฮทฮธฮนฮบฮฎ', 'ฮฃฯ
ฮฝฮตฯฮตฮนฮฟฮบฯฮฑฯฮนฮบฮฎ ฮทฮธฮนฮบฮฎ', 'ฮฉฯฮตฮปฮนฮผฮนฯฯฮนฮบฮฎ ฮทฮธฮนฮบฮฎ', 'ฮฮตฮฟฮฝฯฮฟฮปฮฟฮณฮนฮบฮฎ ฮทฮธฮนฮบฮฎ', 'ฮฮธฮนฮบฮฎ ฮฑฯฮตฯฯฮฝ', 'ฮฃฯฮตฯฮนฮบฮนฯฯฮนฮบฮฎ ฮทฮธฮนฮบฮฎ']
Evaluation
In the table below, we report the scores for our chat evaluation suite which includes:
- Greek IFEval (strict average)
- English IFEval (strict average)
- Greek MT-Bench using gpt-4o-2024-08-06 as the judge model.
- English MT-Bench using gpt-4o-2024-08-06 as the judge model.
We can observe that Llama-Krikri-8B-Instruct exhibits the strongest performance in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by +21.7% and +7.3% on the Greek and English IFEval respectively. It also exhibits the strongest chat capabilities in the Greek MT-Bench benchmark (+0.28 compared to Aya Expanse 8B), while also being very competitive in the English variant of the MT-Bench benchmark.
IFEval EL (strict avg) | IFEval EN (strict avg) | MT-Bench EL | MT-Bench EN | |
---|---|---|---|---|
Qwen 2.5 7B Instruct | 46.2% | 74.8% | 5.83 | 7.87 |
EuroLLM 9B Instruct | 51.3% | 64.5% | 5.98 | 6.27 |
Aya Expanse 8B | 50.4% | 62.2% | 7.68 | 6.92 |
Meltemi 7B v1.5 Instruct | 32.7% | 41.2% | 6.25 | 5.46 |
Llama-3.1-8B Instruct | 45.8% | 75.1% | 6.46 | 7.25 |
Llama-Krikri-8B Instruct | 67.5% | 82.4% | 7.96 | 7.21 |
We also used the Arena-Hard-Auto automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available here. We report 2 scores for Arena-Hard-Auto:
- No Style Control: The original version of the benchmark.
- With Style Control: The benchmark with style control methods for Markdown elements. You can read more about the methodology and technical background in this blogspot.
Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using gpt-4o-2024-08-06 as the judge model and gpt-4o-mini-2024-07-18 as the baseline model (i.e., by default 50% score).
Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring higher than models over 8 times its size (such as Llama-3.1-70B Instruct) and is also competitive with closed-source (e.g., GPT-4o-Mini) and highly-performant open-source models (e.g., Gemma 2 27B IT & Aya Expanse 32B).
Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology by using gpt-4-1106-preview as the judge model and gpt-4-0314 as the baseline model.
Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is competitive with significantly larger previous-generation LLMs (such as Qwen 2 72B Instruct) and that it improves upon Llama-3.1-8B Instruct by +24.5% / +16% (No style control / With style control).
*Please note that judge models are biased towards student models trained on distilled data from them. You can read more here.
๐จ More information on post-training, methodology, and evaluation coming soon. ๐จ
Acknowledgements
The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.
- Downloads last month
- 3,365
Model tree for ilsp/Llama-Krikri-8B-Instruct
Base model
ilsp/Llama-Krikri-8B-Base