KoBART 기반 한국어 슬랭 번역기

이 모델은 KoBART를 파인튜닝하여 한국어 비속어를 표준어로 번역해주는 번역기입니다.

사용한 모델

SKT - KoBART

데이터셋

AI Hub의 연령대별 특징적 발화(은어-속어 등) 데이터를 사용하였습니다. 10대, 20대, 30대의 유행어 및 속어 데이터를 활용하여 학습을 진행하였습니다.

사용 예시

Input Text: 아 롤하는데 한타에서 졌어
Generated Text: 아 리그 오브 레전드하는데 대규모 교전에서 졌어

학습 세부정보

하이퍼 파라미터

training_args = TrainingArguments(
    output_dir="your dir",
    evaluation_strategy="steps",
    eval_steps=10000,
    save_strategy="steps",
    save_steps=10000,
    learning_rate=2e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=8,
    logging_dir="your dir",
    num_train_epochs=5,
    weight_decay=0.01,
    fp16=True,
    report_to="none",
    logging_steps=1000,
    warmup_steps=500,
    lr_scheduler_type="linear",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

학습 환경

•	GPU: NVIDIA RTX A5000
•	학습 시간: 8시간

학습 결과

Step Training Loss Validation Loss
100000 0.0591000 0.047132
200000 0.0303000 0.024423
300000 0.0208000 0.017365
400000 0.0159000 0.013130
500000 0.0129000 0.011025
5900000 0.0002000 0.007907
6000000 0.0002000 0.007920
6100000 0.0002000 0.007869

사용 방법

from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

MODEL_NAME = "hongggggggggggg/korea-slang-translator-kobert"
tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_NAME)
model = BartForConditionalGeneration.from_pretrained(MODEL_NAME)

# 테스트 입력 데이터
input_text = "아 롤하는데 한타에서 졌어"

# 입력 텍스트를 토크나이즈
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# 모델 추론
output_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)

# 생성된 텍스트 디코딩
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)

# 결과 출력
print("Input Text:", input_text)
print("Generated Text:", output_text)


library_name: transformers tags: - slang - korea - profanity - translator - Korean license: mit language: - ko base_model: - skt/kobert-base-v1 pipeline_tag: translation

KoBERT-based Korean Slang Translator

This model is a translator that converts Korean slang into standard language by fine-tuning KoBERT.

Base Model

SKT - KoBERT

Dataset

We used the Age-specific Characteristic Utterances (Slang-Profanity, etc.) Data from AI Hub. The training was conducted using trendy words and slang data from teenagers, 20s, and 30s.

Usage Example

Input Text: 아 롤하는데 한타에서 졌어
Generated Text: 아 리그 오브 레전드하는데 대규모 교전에서 졌어

Training Details

Hyperparameters

training_args = TrainingArguments(
    output_dir="your dir",
    evaluation_strategy="steps",
    eval_steps=10000,
    save_strategy="steps",
    save_steps=10000,
    learning_rate=2e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=8,
    logging_dir="your dir",
    num_train_epochs=5,
    weight_decay=0.01,
    fp16=True,
    report_to="none",
    logging_steps=1000,
    warmup_steps=500,
    lr_scheduler_type="linear",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

Training Environment

  • GPU: NVIDIA RTX A5000
  • Training Time: 8 hours

Training Results

Step Training Loss Validation Loss
100000 0.0591000 0.047132
200000 0.0303000 0.024423
300000 0.0208000 0.017365
400000 0.0159000 0.013130
500000 0.0129000 0.011025
5900000 0.0002000 0.007907
6000000 0.0002000 0.007920
6100000 0.0002000 0.007869

How to Use

from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

MODEL_NAME = "hongggggggggggg/korea-slang-translator-kobert"
tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_NAME)
model = BartForConditionalGeneration.from_pretrained(MODEL_NAME)

# Test input data
input_text = "아 롤하는데 한타에서 졌어"

# Tokenize input text
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Model inference
output_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)

# Decode generated text
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)

# Print results
print("Input Text:", input_text)
print("Generated Text:", output_text)
Downloads last month
281
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for hongggggggggggg/korea-slang-translator-kobart

Base model

hyunwoongko/kobart
Finetuned
(4)
this model