Japanese to Korean translator

Japanese to Korean translator model based on EncoderDecoderModel(bert-japanese+kogpt2)

Usage

Demo

Please visit https://huggingface.co/spaces/sappho192/aihub-ja-ko-translator-demo

Dependencies (PyPI)

  • torch
  • transformers
  • fugashi
  • unidic-lite

Inference

from transformers import(
    EncoderDecoderModel,
    PreTrainedTokenizerFast,
    BertJapaneseTokenizer,
)

import torch

encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
decoder_model_name = "skt/kogpt2-base-v2"

src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)

model = EncoderDecoderModel.from_pretrained("sappho192/aihub-ja-ko-translator")

text = "εˆγ‚γΎγ—γ¦γ€‚γ‚ˆγ‚γ—γγŠι‘˜γ„γ—γΎγ™γ€‚"

def translate(text_src):
    embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
    embeddings = {k: v for k, v in embeddings.items()}
    output = model.generate(**embeddings, max_length=500)[0, 1:-1]
    text_trg = trg_tokenizer.decode(output.cpu())
    return text_trg

print(translate(text))

Dataset

This model used datasets from 'The Open AI Dataset Project (AI-Hub, South Korea)'.
All data information can be accessed through 'AI-Hub (aihub.or.kr)'.
(In order for a corporation, organization, or individual located outside of Korea to use AI data, etc., a separate agreement is required with the performing organization and the Korea National Information Society agency(NIA). In order to export AI data, etc. outside the country, a separate agreement is required with the performing organization and the NIA. Link)

이 λͺ¨λΈμ€ κ³Όν•™κΈ°μˆ μ •λ³΄ν†΅μ‹ λΆ€μ˜ μž¬μ›μœΌλ‘œ ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯μ›μ˜ 지원을 λ°›μ•„ κ΅¬μΆ•λœ 데이터셋을 ν™œμš©ν•˜μ—¬ μˆ˜ν–‰λœ μ—°κ΅¬μž…λ‹ˆλ‹€.
λ³Έ λͺ¨λΈμ— ν™œμš©λœ λ°μ΄ν„°λŠ” AI ν—ˆλΈŒ(aihub.or.kr)μ—μ„œ λ‹€μš΄λ‘œλ“œ λ°›μœΌμ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.
(ꡭ외에 μ†Œμž¬ν•˜λŠ” 법인, 단체 λ˜λŠ” 개인이 AI데이터 등을 μ΄μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” μˆ˜ν–‰κΈ°κ΄€ λ“± 및 ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯원과 λ³„λ„λ‘œ ν•©μ˜κ°€ ν•„μš”ν•©λ‹ˆλ‹€.
λ³Έ AI데이터 λ“±μ˜ κ΅­μ™Έ λ°˜μΆœμ„ μœ„ν•΄μ„œλŠ” μˆ˜ν–‰κΈ°κ΄€ λ“± 및 ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯원과 λ³„λ„λ‘œ ν•©μ˜κ°€ ν•„μš”ν•©λ‹ˆλ‹€. [좜처])

Dataset list

The dataset used to train the model is merged following sub-datasets:

    1. μΌμƒμƒν™œ 및 ꡬ어체 ν•œ-쀑, ν•œ-일 λ²ˆμ—­ 병렬 λ§λ­‰μΉ˜ 데이터 [Link]
    1. ν•œκ΅­μ–΄-λ‹€κ΅­μ–΄(μ˜μ–΄ μ œμ™Έ) λ²ˆμ—­ λ§λ­‰μΉ˜(κΈ°μˆ κ³Όν•™) [Link]
    1. ν•œκ΅­μ–΄-λ‹€κ΅­μ–΄ λ²ˆμ—­ λ§λ­‰μΉ˜(κΈ°μ΄ˆκ³Όν•™) [Link]
    1. ν•œκ΅­μ–΄-λ‹€κ΅­μ–΄ λ²ˆμ—­ λ§λ­‰μΉ˜ (인문학) [Link]
  • ν•œκ΅­μ–΄-일본어 λ²ˆμ—­ λ§λ­‰μΉ˜ [Link]

To reproduce the the merged dataset, you can use the code in below link:
https://github.com/sappho192/aihub-translation-dataset

Downloads last month
462
Safetensors
Model size
265M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Space using sappho192/aihub-ja-ko-translator 1