File size: 5,941 Bytes
951230e
 
 
 
 
 
 
 
73a53ab
951230e
 
 
f385cf6
 
 
951230e
 
 
 
 
 
 
 
 
 
 
f385cf6
 
 
 
bfbc807
f385cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
base_model: unsloth/meta-llama-3.1-8b-instruct-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- grpo
license: apache-2.0
language:
- en
- tr
datasets:
- umarigan/OpenThoughts-43k-TR
---

# Uploaded  model

- **Developed by:** umarigan
- **License:** apache-2.0
- **Finetuned from model :** unsloth/meta-llama-3.1-8b-instruct-bnb-4bit

This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)


Eval results:
arc-tr = 57.68%
truthful_qa-tr = ~20%-40%

following code to reproduce the results:

```python

import torch
from transformers import pipeline
from datasets import load_dataset
import re
import torch
from transformers import pipeline

model_id = "umarigan/llama-3.2-8B-R1-Tr"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)


#ARC-TR
ds = load_dataset("mukayese/arc-tr", split ='test')

def extract_answer(text):
    """Extract first occurring A-D label from generated text"""
    match = re.search(r'\b([A-D])\b', text, re.IGNORECASE)
    return match.group(1).upper() if match else None

total = 0
correct = 0

for example in ds:
    # Format the question and choices
    question = example["question"]
    choices = "\n".join([f"{label}) {text}" for label, text in 
                       zip(example["choices"]["label"], example["choices"]["text"])])
    
    # Create prompt with explicit instruction
    prompt = f"""Answer this multiple-choice question by providing ONLY the letter corresponding to the correct answer (A, B, C, or D). Do not include any explanation.

    Question: {question}
    Options:
    {choices}
    Answer:"""
        
    # Generate response
    messages = [{"role": "user", "content": prompt}]
    try:
        outputs = pipe(
            messages,
            max_new_tokens=5,  # Limit response length to get just the answer
            do_sample=False    # Disable sampling for more deterministic answers
        )
        response = outputs[0]["generated_text"][-1]['content']
        predicted = extract_answer(response)
        answer = example["answerKey"]
        
        # Update counters
        total += 1
        if predicted == answer:
            correct += 1
            
    except Exception as e:
        print(f"Error processing example: {e}")
        continue

# Print results
print(f"\nBenchmark Results:")
print(f"Total questions processed: {total}")
print(f"Correct answers: {correct}")
print(f"Accuracy: {correct/total:.2%}" if total > 0 else "No questions processed")
#output
#Benchmark Results:
#Total questions processed: 1172
#Correct answers: 676
#Accuracy: 57.68%


#TRUTHFUL-TR

import re
ds2 = load_dataset("mukayese/truthful_qa-tr", split ='validation')
def evaluate_mc(example, targets_key="mc1_targets"):
    """Evaluate a single multiple-choice example with variable choices"""
    question = example["question"]
    choices = example[targets_key]["choices"]
    labels = example[targets_key]["labels"]
    
    # Generate option labels dynamically (A, B, C, ..., G)
    option_labels = [chr(65 + i) for i in range(len(choices))]
    
    # Create prompt with explicit instruction
    options_text = "\n".join([f"{label}) {text}" for label, text in zip(option_labels, choices)])
    prompt = f"""Answer this multiple-choice question by selecting the most correct option. Provide only the letter corresponding to your choice ({', '.join(option_labels)}).

  Question: {question}
  Options:
  {options_text}
  Answer:"""
    
    # Generate response
    messages = [{"role": "user", "content": prompt}]
    try:
        outputs = pipe(
            messages,
            max_new_tokens=5,  # Limit response length to get just the answer
            do_sample=False    # Disable sampling for more deterministic answers
        )
        response = outputs[0]["generated_text"][-1]['content']
        
        # Extract predicted label
        predicted = extract_answer(response, option_labels)
        if predicted is None:
            return 0  # Count as incorrect if no valid answer
        
        # Get correct answer
        correct_idx = labels.index(1)
        correct_label = option_labels[correct_idx]
        
        return int(predicted == correct_label)
    
    except Exception as e:
        print(f"Error processing example: {e}")
        return 0

def extract_answer(text, valid_labels):
    """Extract first occurring valid label from generated text"""
    # Create regex pattern that matches any of the valid labels
    pattern = r'\b(' + '|'.join(valid_labels) + r')\b'
    match = re.search(pattern, text, re.IGNORECASE)
    return match.group(1).upper() if match else None

# Evaluate on both mc1 and mc2 targets
mc1_scores = []
mc2_scores = []

for example in ds2:
    mc1_scores.append(evaluate_mc(example, "mc1_targets"))
    mc2_scores.append(evaluate_mc(example, "mc2_targets"))

# Calculate metrics
def calculate_metrics(scores):
    total = len(scores)
    correct = sum(scores)
    accuracy = correct / total if total > 0 else 0
    return total, correct, accuracy

mc1_total, mc1_correct, mc1_accuracy = calculate_metrics(mc1_scores)
mc2_total, mc2_correct, mc2_accuracy = calculate_metrics(mc2_scores)

# Print results
print("\nBenchmark Results:")
print(f"MC1 Targets:")
print(f"Total questions: {mc1_total}")
print(f"Correct answers: {mc1_correct}")
print(f"Accuracy: {mc1_accuracy:.2%}")
print(f"\nMC2 Targets:")
print(f"Total questions: {mc2_total}")
print(f"Correct answers: {mc2_correct}")
print(f"Accuracy: {mc2_accuracy:.2%}")

#output
#MC1 Targets:
#Total questions: 817
#Correct answers: 355
#Accuracy: 43.45%

#MC2 Targets:
#Total questions: 817
#Correct answers: 181
#Accuracy: 22.15