--- license: cc-by-nc-4.0 language: - ru - en base_model: - d0rj/rut5-base-summ pipeline_tag: summarization tags: - summarization - natural-language-processing - text-summarization - machine-learning - deep-learning - transformer - artificial-intelligence - text-analysis - sequence-to-sequence - pytorch - tensorflow - safetensors - t5 library_name: transformers --- ![Official LaciaSUM Logo](https://huggingface.co/LaciaStudio/Lacia_sum_small_v1/resolve/main/LaciaSUM.png) # Russian Text Summarization Model - LaciaSUM V1 (small) This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries. # Key Features * Objective: Automatic abstractive summarization of texts. * Base Model: d0rj/rut5-base-summ. * Dataset: A custom CSV file with columns Text (original text) and Summarize (summary). * Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task. # Training Settings: * Number of epochs: 9. * Batch size: 4 per device. * Warmup steps: 1000. * FP16 training enabled (if CUDA is available). * Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training). # Description The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes: Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary. Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100. This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats. **The model also supports English language, but its support was not tested** # Example Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Lacia_sum_small_v1") model = AutoModelForSeq2SeqLM.from_pretrained("LaciaStudio/Lacia_sum_small_v1") text = "Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы. Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим новые перспективы в различных областях." # "summarize: " prefix input_text = "summarize: " + text inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True) summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print("Summary:", summary) ``` # Example of summarization **RU** Main text: ``` Современные технологии оказывают значительное влияние на нашу повседневную жизнь и рабочие процессы. Искусственный интеллект становится важным инструментом, помогающим оптимизировать задачи и открывающим новые перспективы в различных областях. ``` Summarized text: ``` Современные технологии оказывают значительное влияние на повседневную жизнь и рабочие процессы, включая искусственный интеллект, который помогает оптимизировать задачи и открывать новые перспективы. ``` **EN** Main text: ``` Modern technologies have a significant impact on our daily lives and work processes. Artificial intelligence is becoming an important tool that helps optimize tasks and opens up new opportunities in various fields. ``` Summarized text: ``` Matern technologies have a controration on our daily lives and work processes. Artificial intelligence is becoming an important tool and helps and opens up new opportunities. ``` **Finetuned by LaciaStudio | LaciaAI**