MilindChawre's picture
Adding hindi BPE tokenizer
da971a5

A newer version of the Streamlit SDK is available: 1.43.1

Upgrade
metadata
title: Hindi BPE Tokenizer
emoji: ๐ŸŒ
colorFrom: yellow
colorTo: indigo
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
short_description: Hindi BPE tokenizer

Hindi BPE Tokenizer

This Python script is designed for the preprocessing of Hindi text and the training of a Byte Pair Encoding (BPE) tokenizer specifically tailored for the Hindi language. It automatically fetches and processes a segment of the IndicCorp Hindi dataset.

Key Features

  • Intelligent Dataset Management:

    • Downloads the initial 10GB of the IndicCorp Hindi dataset
    • Capable of resuming interrupted downloads
    • Samples 2 million lines from the first 3 million available
    • Includes progress indicators for both downloading and processing
  • Text Preprocessing:

    • Filters to retain only Hindi characters (Unicode range: \u0900-\u097F)
    • Eliminates digits (both English and Devanagari)
    • Normalizes punctuation (converts Hindi full stops 'เฅค' to '.')
    • Cleans up whitespace
  • BPE Tokenizer Training:

    • Enhanced training using numpy's vectorized operations
    • Processes data in batches for improved efficiency
    • Configurable vocabulary size: 5000 tokens
    • Special tokens included: <pad>, <unk>, <s>, </s>
    • Minimum token frequency set to 2
    • Tracks progress with compression ratios

Prerequisites

To install the necessary packages, run:

pip install numpy requests tqdm matplotlib

Getting Started

  1. Execute the tokenizer training script:
python hindi_tokenizer.py
  1. Utilize the interactive encoder/decoder:
python use_tokenizer.py

Directory Layout

.
โ”œโ”€โ”€ hindi_tokenizer.py # Primary training script
โ”œโ”€โ”€ use_tokenizer.py # Tool for interactive encoding/decoding
โ”œโ”€โ”€ raw_hindi_dataset.txt # Downloaded dataset (10GB)
โ””โ”€โ”€ output/
    โ”œโ”€โ”€ preprocessed_hindi.txt # Cleaned text output
    โ””โ”€โ”€ hindi_encoder.json # Configuration for the tokenizer

Dataset Information

Example Usage

Training the Tokenizer

from hindi_tokenizer import main
# Train and retrieve the tokenizer
tokenizer = main()

Utilizing the Trained Tokenizer

from hindi_tokenizer import load_tokenizer, encode_text, decode_text
# Load the pre-existing tokenizer
tokenizer = load_tokenizer("output/hindi_encoder.json")
# Encode a sample text
text = "เคจเคฎเคธเฅเคคเฅ‡ เคญเคพเคฐเคค!"
token_ids, tokens = encode_text(tokenizer, text)
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
# Decode back to the original text
decoded_text = decode_text(tokenizer, token_ids)
print(f"Decoded: {decoded_text}")

Technical Insights

Preprocessing Steps

  1. Character filtering: [^\u0900-\u097F\sเฅค,.!?\-]
  2. Removal of digits: [0-9เฅฆ-เฅฏ]
  3. Normalization of punctuation: เฅค โ†’ .
  4. Whitespace normalization

Tokenizer Settings

  • Model Type: Byte Pair Encoding (BPE)
  • Vocabulary Size: 5000
  • Number of Special Tokens: 4
  • Batch Size for Training: 1,000
  • Interval for Statistics Tracking: 500
  • Utilizes numpy for vectorized operations

Performance Enhancements

  • Vectorized operations based on Numpy
  • Batch processing for merge operations
  • Optimized memory usage
  • Sliding window technique for pair counting
  • Pre-allocated arrays for enhanced speed
  • Updates to statistics in batches

Error Management

The script incorporates thorough error handling for:

  • Network-related issues during downloads
  • Resuming partial downloads
  • File input/output operations
  • Processing of the dataset
  • Verification of compression ratios

BPE Tokenizer Training Logs

(temporary) โžœ  erav3-s11-hindi-tokenizer git:(master) โœ— python hindi_tokenizer.py
Sufficient dataset already exists, skipping download.
Step 2: Preprocessing dataset...
Reading and preparing dataset...
Reading lines: 2000005it [00:01, 1093427.18it/s]
Cleaning and normalizing text...
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2000000/2000000 [00:17<00:00, 114213.87it/s]
Initializing vocabulary...
Computing initial frequencies...
Training BPE:  10%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ                                                                                                                                                                           | 500/4887 [05:05<14:23,  5.08it/s]
Iteration 613
Created token: 'เคฐเค‚' (merged 77,383 times)
Current vocabulary size: 613
Current data size: 266,508,022
Current compression ratio: 1.68
--------------------------------------------------------------------------------
Training BPE:  20%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰                                                                                                                                                       | 1000/4887 [06:42<12:09,  5.33it/s]
Iteration 1,113
Created token: 'เคน,' (merged 14,825 times)
Current vocabulary size: 1,113
Current data size: 266,508,022
Current compression ratio: 1.74
--------------------------------------------------------------------------------
Training BPE:  31%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž                                                                                                                                   | 1500/4887 [09:55<06:43,  8.40it/s]
Iteration 1,613
Created token: 'เฅ‹ เคน' (merged 45,509 times)
Current vocabulary size: 1,613
Current data size: 266,508,022
Current compression ratio: 2.24
--------------------------------------------------------------------------------
Training BPE:  41%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š                                                                                                                | 2000/4887 [10:51<05:14,  9.18it/s]
Iteration 2,113
Created token: 'เคชเคฐเฅ' (merged 26,421 times)
Current vocabulary size: 2,113
Current data size: 266,508,022
Current compression ratio: 2.39
--------------------------------------------------------------------------------
Training BPE:  51%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                                                                                            | 2499/4887 [13:17<03:45, 10.61it/s]
Iteration 2,613
Created token: 'เคนเคพเคฐ ' (merged 15,505 times)
Current vocabulary size: 2,613
Current data size: 266,508,022
Current compression ratio: 2.66
--------------------------------------------------------------------------------
Training BPE:  61%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ                                                                         | 2999/4887 [14:02<02:48, 11.22it/s]
Iteration 3,113
Created token: 'เคฟเคฒเฅ‡ ' (merged 11,115 times)
Current vocabulary size: 3,113
Current data size: 266,508,022
Current compression ratio: 2.79
--------------------------------------------------------------------------------
Training BPE:  72%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                                      | 3500/4887 [16:13<01:57, 11.83it/s]
Iteration 3,613
Created token: 'เค เคพเค•' (merged 7,706 times)
Current vocabulary size: 3,613
Current data size: 266,508,022
Current compression ratio: 2.93
--------------------------------------------------------------------------------
Training BPE:  82%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ                                  | 4000/4887 [16:54<01:11, 12.48it/s]
Iteration 4,113
Created token: 'เค‚เค—เค ' (merged 6,185 times)
Current vocabulary size: 4,113
Current data size: 266,508,022
Current compression ratio: 3.03
--------------------------------------------------------------------------------
Training BPE:  92%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰               | 4499/4887 [18:52<00:30, 12.78it/s]
Iteration 4,613
Created token: 'เคฌเฅ‡เคนเคฆ' (merged 4,949 times)
Current vocabulary size: 4,613
Current data size: 266,508,022
Current compression ratio: 3.13
--------------------------------------------------------------------------------
Training BPE: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4887/4887 [19:21<00:00,  4.21it/s]

Training completed. Final vocabulary size: 5000
Final compression ratio: 3.22

Tokenizer Test:
--------------------------------------------------
Original Text: เคซเคฟเคฐ เคชเคพเคจเฅ€ เคญเฅ€ เค•เคฎ เคฎเคพเคคเฅเคฐเคพ เคฎเฅ‡เค‚

Tokens: ['เคซเคฟเคฐ', 'เคชเคพ', 'เคจเฅ€', 'เคญเฅ€', 'เค•เคฎ', 'เคฎเคพ', 'เคคเฅเคฐ', 'เคพ', 'เคฎเฅ‡เค‚']
Token IDs: [4947, 215, 225, 210, 450, 172, 1314, 70, 1163]

Decoded Text: เคซเคฟเคฐ เคชเคพ เคจเฅ€ เคญเฅ€ เค•เคฎ เคฎเคพ เคคเฅเคฐ เคพ เคฎเฅ‡เค‚
(temporary) โžœ  erav3-s11-hindi-tokenizer git:(master) โœ—

BPE Tokenizer Sample Usage Logs

(temporary) โžœ  erav3-s11-hindi-tokenizer git:(master) โœ— python use_tokenizer.py
Loaded vocabulary size: 5000
Max token ID: 4999
Sample tokens: [(0, '<pad>'), (1, '<unk>'), (2, '<s>'), (3, '</s>'), (4, ' ')]
Hindi Text Encoder/Decoder (type 'quit' to exit)
--------------------------------------------------

Enter Hindi text to encode/decode: เคถเคฌเฅเคฆเค•เฅ‹เคถ เคเค• เคฌเคกเคผเฅ€ เคธเฅ‚เคšเฅ€ เคฏเคพ เค•เคฟเคคเคพเคฌ เคนเฅ‹เคคเฅ€ เคนเฅˆ

Encoding:
Tokens: ['เคถเคฌเฅเคฆ', 'เค•เฅ‹', 'เคถ', 'เคเค•', 'เคฌเคกเคผ', 'เฅ€', 'เคธเฅ‚เคš', 'เฅ€', 'เคฏเคพ', 'เค•เคฟ', 'เคคเคพเคฌ', 'เคนเฅ‹เคค', 'เฅ€', 'เคนเฅˆ']
Token IDs: [3645, 150, 63, 259, 1767, 72, 3922, 72, 134, 151, 2092, 1484, 72, 132]

Decoding:
Text: เคถเคฌเฅเคฆ เค•เฅ‹ เคถ เคเค• เคฌเคกเคผ เฅ€ เคธเฅ‚เคš เฅ€ เคฏเคพ เค•เคฟ เคคเคพเคฌ เคนเฅ‹เคค เฅ€ เคนเฅˆ

Enter Hindi text to encode/decode: quit
(temporary) โžœ  erav3-s11-hindi-tokenizer git:(master) โœ— 

Contributions

We welcome you to report issues or submit pull requests for enhancements.

License

MIT License