Llama-3.2-3B-LongCoT

A small model with LongCoT capability.

Example Image

Features

  • Using high-quality synthetic data for fine-tuning.
  • The model can adjust whether to use LongCoT based on the complexity of the question.
  • Good at mathematics and reasoning

Benchmark

Benchmark Llama-3.2-3B-Instruct Llama-3.2-3B-LongCoT
Math 35.5 52.0
GSM8K 77.3 82.3

Inference

Example of Stream Inference:

import time
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TextStreamer,
)

# Model ID from Hugging Face
model_id = "Kadins/Llama-3.2-3B-LongCoT"

# Load the pre-trained model with appropriate data type and device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Use bfloat16 for optimized performance
    device_map="auto",           # Automatically map the model to available devices
)

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_id)

def stream_chat(messages, max_new_tokens=8192, top_p=0.95, temperature=0.6):
    """
    Generates a response using streaming inference.

    Args:
        messages (list): A list of dictionaries containing the conversation prompt.
        max_new_tokens (int): Maximum number of tokens to generate.
        top_p (float): Nucleus sampling parameter for controlling diversity.
        temperature (float): Sampling temperature to control response creativity.
    """
    # Prepare the input by applying the chat template and tokenizing
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,  # Ensure the output is a dictionary
    ).to(model.device)  # Move the inputs to the same device as the model

    # Initialize the TextStreamer for real-time output
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Record the start time for performance measurement
    start_time = time.time()

    # Generate the response using the model's generate method with streaming
    model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        repetition_penalty=1.1,
        top_p=top_p,
        temperature=temperature,
        streamer=streamer,  # Enable streaming of the generated tokens
    )

    # Calculate and print the total response time
    total_time = time.time() - start_time
    print(f"\n--- Response finished in {total_time:.2f} seconds ---")

def chat_loop():
    """
    Initiates an interactive chat session with the model.
    Continuously reads user input and generates model responses until the user exits.
    """
    while True:
        # Initialize the conversation with a system message
        messages = [
            {"role": "system", "content": "You are a reasoning expert and helpful assistant."},
        ]

        # Prompt the user for input
        user_input = input("\nUser: ")
        if user_input.strip().lower() in ["exit", "quit"]:
            print("Exiting chat...")
            break

        # Append the user's message to the conversation history
        messages.append({"role": "user", "content": user_input})

        print("Assistant: ", end="", flush=True)

        # Generate and stream the assistant's response
        stream_chat(messages)

        # Note: Currently, the assistant's reply is streamed directly to the console.
        # To store the assistant's reply in the conversation history, additional handling is required.

if __name__ == "__main__":
    # Start the interactive chat loop when the script is executed
    chat_loop()
Downloads last month
22
Safetensors
Model size
3.21B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Kadins/Llama-3.2-3B-LongCoT

Finetuned
(219)
this model
Quantizations
1 model