KoBART 기반 한국어 슬랭 번역기
이 모델은 KoBART를 파인튜닝하여 한국어 비속어를 표준어로 번역해주는 번역기입니다.
사용한 모델
데이터셋
AI Hub의 연령대별 특징적 발화(은어-속어 등) 데이터를 사용하였습니다. 10대, 20대, 30대의 유행어 및 속어 데이터를 활용하여 학습을 진행하였습니다.
사용 예시
Input Text: 아 롤하는데 한타에서 졌어
Generated Text: 아 리그 오브 레전드하는데 대규모 교전에서 졌어
학습 세부정보
하이퍼 파라미터
training_args = TrainingArguments(
output_dir="your dir",
evaluation_strategy="steps",
eval_steps=10000,
save_strategy="steps",
save_steps=10000,
learning_rate=2e-5,
per_device_train_batch_size=10,
per_device_eval_batch_size=8,
logging_dir="your dir",
num_train_epochs=5,
weight_decay=0.01,
fp16=True,
report_to="none",
logging_steps=1000,
warmup_steps=500,
lr_scheduler_type="linear",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
)
학습 환경
• GPU: NVIDIA RTX A5000
• 학습 시간: 8시간
학습 결과
Step | Training Loss | Validation Loss |
---|---|---|
100000 | 0.0591000 | 0.047132 |
200000 | 0.0303000 | 0.024423 |
300000 | 0.0208000 | 0.017365 |
400000 | 0.0159000 | 0.013130 |
500000 | 0.0129000 | 0.011025 |
5900000 | 0.0002000 | 0.007907 |
6000000 | 0.0002000 | 0.007920 |
6100000 | 0.0002000 | 0.007869 |
사용 방법
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
MODEL_NAME = "hongggggggggggg/korea-slang-translator-kobert"
tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_NAME)
model = BartForConditionalGeneration.from_pretrained(MODEL_NAME)
# 테스트 입력 데이터
input_text = "아 롤하는데 한타에서 졌어"
# 입력 텍스트를 토크나이즈
input_ids = tokenizer.encode(input_text, return_tensors="pt")
# 모델 추론
output_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)
# 생성된 텍스트 디코딩
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
# 결과 출력
print("Input Text:", input_text)
print("Generated Text:", output_text)
library_name: transformers tags: - slang - korea - profanity - translator - Korean license: mit language: - ko base_model: - skt/kobert-base-v1 pipeline_tag: translation
KoBERT-based Korean Slang Translator
This model is a translator that converts Korean slang into standard language by fine-tuning KoBERT.
Base Model
Dataset
We used the Age-specific Characteristic Utterances (Slang-Profanity, etc.) Data from AI Hub. The training was conducted using trendy words and slang data from teenagers, 20s, and 30s.
Usage Example
Input Text: 아 롤하는데 한타에서 졌어
Generated Text: 아 리그 오브 레전드하는데 대규모 교전에서 졌어
Training Details
Hyperparameters
training_args = TrainingArguments(
output_dir="your dir",
evaluation_strategy="steps",
eval_steps=10000,
save_strategy="steps",
save_steps=10000,
learning_rate=2e-5,
per_device_train_batch_size=10,
per_device_eval_batch_size=8,
logging_dir="your dir",
num_train_epochs=5,
weight_decay=0.01,
fp16=True,
report_to="none",
logging_steps=1000,
warmup_steps=500,
lr_scheduler_type="linear",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
)
Training Environment
- GPU: NVIDIA RTX A5000
- Training Time: 8 hours
Training Results
Step | Training Loss | Validation Loss |
---|---|---|
100000 | 0.0591000 | 0.047132 |
200000 | 0.0303000 | 0.024423 |
300000 | 0.0208000 | 0.017365 |
400000 | 0.0159000 | 0.013130 |
500000 | 0.0129000 | 0.011025 |
5900000 | 0.0002000 | 0.007907 |
6000000 | 0.0002000 | 0.007920 |
6100000 | 0.0002000 | 0.007869 |
How to Use
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
MODEL_NAME = "hongggggggggggg/korea-slang-translator-kobert"
tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_NAME)
model = BartForConditionalGeneration.from_pretrained(MODEL_NAME)
# Test input data
input_text = "아 롤하는데 한타에서 졌어"
# Tokenize input text
input_ids = tokenizer.encode(input_text, return_tensors="pt")
# Model inference
output_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)
# Decode generated text
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
# Print results
print("Input Text:", input_text)
print("Generated Text:", output_text)
- Downloads last month
- 281
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for hongggggggggggg/korea-slang-translator-kobart
Base model
hyunwoongko/kobart