MilindChawre commited on
Commit
da971a5
·
1 Parent(s): 02c1a7d

Adding hindi BPE tokenizer

Browse files
Files changed (6) hide show
  1. README.md +239 -2
  2. app.py +177 -0
  3. hindi_tokenizer.py +594 -0
  4. output/hindi_encoder.json +0 -0
  5. requirements.txt +5 -0
  6. use_tokenizer.py +40 -0
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Erav3 S11 Hindi Tokenizer
3
  emoji: 🌍
4
  colorFrom: yellow
5
  colorTo: indigo
@@ -10,4 +10,241 @@ pinned: false
10
  short_description: Hindi BPE tokenizer
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Hindi BPE Tokenizer
3
  emoji: 🌍
4
  colorFrom: yellow
5
  colorTo: indigo
 
10
  short_description: Hindi BPE tokenizer
11
  ---
12
 
13
+ # Hindi BPE Tokenizer
14
+
15
+ This Python script is designed for the preprocessing of Hindi text and the training of a Byte Pair Encoding (BPE) tokenizer specifically tailored for the Hindi language. It automatically fetches and processes a segment of the IndicCorp Hindi dataset.
16
+
17
+ ## Key Features
18
+
19
+ - **Intelligent Dataset Management**:
20
+ - Downloads the initial 10GB of the IndicCorp Hindi dataset
21
+ - Capable of resuming interrupted downloads
22
+ - Samples 2 million lines from the first 3 million available
23
+ - Includes progress indicators for both downloading and processing
24
+
25
+ - **Text Preprocessing**:
26
+ - Filters to retain only Hindi characters (Unicode range: \u0900-\u097F)
27
+ - Eliminates digits (both English and Devanagari)
28
+ - Normalizes punctuation (converts Hindi full stops '।' to '.')
29
+ - Cleans up whitespace
30
+
31
+ - **BPE Tokenizer Training**:
32
+ - Enhanced training using numpy's vectorized operations
33
+ - Processes data in batches for improved efficiency
34
+ - Configurable vocabulary size: 5000 tokens
35
+ - Special tokens included: `<pad>`, `<unk>`, `<s>`, `</s>`
36
+ - Minimum token frequency set to 2
37
+ - Tracks progress with compression ratios
38
+
39
+ ## Prerequisites
40
+
41
+ To install the necessary packages, run:
42
+ ```
43
+ pip install numpy requests tqdm matplotlib
44
+ ```
45
+
46
+ ## Getting Started
47
+
48
+ 1. Execute the tokenizer training script:
49
+ ```
50
+ python hindi_tokenizer.py
51
+ ```
52
+
53
+ 2. Utilize the interactive encoder/decoder:
54
+ ```
55
+ python use_tokenizer.py
56
+ ```
57
+
58
+ ## Directory Layout
59
+ ```
60
+ .
61
+ ├── hindi_tokenizer.py # Primary training script
62
+ ├── use_tokenizer.py # Tool for interactive encoding/decoding
63
+ ├── raw_hindi_dataset.txt # Downloaded dataset (10GB)
64
+ └── output/
65
+ ├── preprocessed_hindi.txt # Cleaned text output
66
+ └── hindi_encoder.json # Configuration for the tokenizer
67
+ ```
68
+
69
+ ## Dataset Information
70
+
71
+ - **Source**: IndicCorp Hindi Collection
72
+ - **URL**: https://objectstore.e2enetworks.net/ai4b-public-nlu-nlg/v1-indiccorp/hi.txt
73
+ - **Download Size**: First 10GB of a ~20GB file
74
+ - **Training Sample**: 2,000,000 lines from the initial 3 million lines
75
+
76
+ ## Example Usage
77
+
78
+ ### Training the Tokenizer
79
+ ```
80
+ from hindi_tokenizer import main
81
+ # Train and retrieve the tokenizer
82
+ tokenizer = main()
83
+ ```
84
+
85
+ ### Utilizing the Trained Tokenizer
86
+ ```
87
+ from hindi_tokenizer import load_tokenizer, encode_text, decode_text
88
+ # Load the pre-existing tokenizer
89
+ tokenizer = load_tokenizer("output/hindi_encoder.json")
90
+ # Encode a sample text
91
+ text = "नमस्ते भारत!"
92
+ token_ids, tokens = encode_text(tokenizer, text)
93
+ print(f"Tokens: {tokens}")
94
+ print(f"Token IDs: {token_ids}")
95
+ # Decode back to the original text
96
+ decoded_text = decode_text(tokenizer, token_ids)
97
+ print(f"Decoded: {decoded_text}")
98
+ ```
99
+
100
+ ## Technical Insights
101
+
102
+ ### Preprocessing Steps
103
+ 1. Character filtering: `[^\u0900-\u097F\s।,.!?\-]`
104
+ 2. Removal of digits: `[0-9०-९]`
105
+ 3. Normalization of punctuation: `।` → `.`
106
+ 4. Whitespace normalization
107
+
108
+ ### Tokenizer Settings
109
+ - Model Type: Byte Pair Encoding (BPE)
110
+ - Vocabulary Size: 5000
111
+ - Number of Special Tokens: 4
112
+ - Batch Size for Training: 1,000
113
+ - Interval for Statistics Tracking: 500
114
+ - Utilizes numpy for vectorized operations
115
+
116
+ ### Performance Enhancements
117
+ - Vectorized operations based on Numpy
118
+ - Batch processing for merge operations
119
+ - Optimized memory usage
120
+ - Sliding window technique for pair counting
121
+ - Pre-allocated arrays for enhanced speed
122
+ - Updates to statistics in batches
123
+
124
+ ## Error Management
125
+
126
+ The script incorporates thorough error handling for:
127
+ - Network-related issues during downloads
128
+ - Resuming partial downloads
129
+ - File input/output operations
130
+ - Processing of the dataset
131
+ - Verification of compression ratios
132
+
133
+ ## BPE Tokenizer Training Logs
134
+ ```
135
+ (temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗ python hindi_tokenizer.py
136
+ Sufficient dataset already exists, skipping download.
137
+ Step 2: Preprocessing dataset...
138
+ Reading and preparing dataset...
139
+ Reading lines: 2000005it [00:01, 1093427.18it/s]
140
+ Cleaning and normalizing text...
141
+ 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000000/2000000 [00:17<00:00, 114213.87it/s]
142
+ Initializing vocabulary...
143
+ Computing initial frequencies...
144
+ Training BPE: 10%|███████████████████▌ | 500/4887 [05:05<14:23, 5.08it/s]
145
+ Iteration 613
146
+ Created token: 'रं' (merged 77,383 times)
147
+ Current vocabulary size: 613
148
+ Current data size: 266,508,022
149
+ Current compression ratio: 1.68
150
+ --------------------------------------------------------------------------------
151
+ Training BPE: 20%|██████████████████████████████████████▉ | 1000/4887 [06:42<12:09, 5.33it/s]
152
+ Iteration 1,113
153
+ Created token: 'ह,' (merged 14,825 times)
154
+ Current vocabulary size: 1,113
155
+ Current data size: 266,508,022
156
+ Current compression ratio: 1.74
157
+ --------------------------------------------------------------------------------
158
+ Training BPE: 31%|██████████████████████████████████████████████████████████▎ | 1500/4887 [09:55<06:43, 8.40it/s]
159
+ Iteration 1,613
160
+ Created token: 'ो ह' (merged 45,509 times)
161
+ Current vocabulary size: 1,613
162
+ Current data size: 266,508,022
163
+ Current compression ratio: 2.24
164
+ --------------------------------------------------------------------------------
165
+ Training BPE: 41%|█████████████████████████████████████████████████████████████████████████████▊ | 2000/4887 [10:51<05:14, 9.18it/s]
166
+ Iteration 2,113
167
+ Created token: 'पर्' (merged 26,421 times)
168
+ Current vocabulary size: 2,113
169
+ Current data size: 266,508,022
170
+ Current compression ratio: 2.39
171
+ --------------------------------------------------------------------------------
172
+ Training BPE: 51%|█████████████████████████████████████████████████████████████████████████████████████████████████▏ | 2499/4887 [13:17<03:45, 10.61it/s]
173
+ Iteration 2,613
174
+ Created token: 'हार ' (merged 15,505 times)
175
+ Current vocabulary size: 2,613
176
+ Current data size: 266,508,022
177
+ Current compression ratio: 2.66
178
+ --------------------------------------------------------------------------------
179
+ Training BPE: 61%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 2999/4887 [14:02<02:48, 11.22it/s]
180
+ Iteration 3,113
181
+ Created token: 'िले ' (merged 11,115 times)
182
+ Current vocabulary size: 3,113
183
+ Current data size: 266,508,022
184
+ Current compression ratio: 2.79
185
+ --------------------------------------------------------------------------------
186
+ Training BPE: 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 3500/4887 [16:13<01:57, 11.83it/s]
187
+ Iteration 3,613
188
+ Created token: 'ठाक' (merged 7,706 times)
189
+ Current vocabulary size: 3,613
190
+ Current data size: 266,508,022
191
+ Current compression ratio: 2.93
192
+ --------------------------------------------------------------------------------
193
+ Training BPE: 82%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 4000/4887 [16:54<01:11, 12.48it/s]
194
+ Iteration 4,113
195
+ Created token: 'ंग��' (merged 6,185 times)
196
+ Current vocabulary size: 4,113
197
+ Current data size: 266,508,022
198
+ Current compression ratio: 3.03
199
+ --------------------------------------------------------------------------------
200
+ Training BPE: 92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 4499/4887 [18:52<00:30, 12.78it/s]
201
+ Iteration 4,613
202
+ Created token: 'बेहद' (merged 4,949 times)
203
+ Current vocabulary size: 4,613
204
+ Current data size: 266,508,022
205
+ Current compression ratio: 3.13
206
+ --------------------------------------------------------------------------------
207
+ Training BPE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4887/4887 [19:21<00:00, 4.21it/s]
208
+
209
+ Training completed. Final vocabulary size: 5000
210
+ Final compression ratio: 3.22
211
+
212
+ Tokenizer Test:
213
+ --------------------------------------------------
214
+ Original Text: फिर पानी भी कम मात्रा में
215
+
216
+ Tokens: ['फिर', 'पा', 'नी', 'भी', 'कम', 'मा', 'त्र', 'ा', 'में']
217
+ Token IDs: [4947, 215, 225, 210, 450, 172, 1314, 70, 1163]
218
+
219
+ Decoded Text: फिर पा नी भी कम मा त्र ा में
220
+ (temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗
221
+ ```
222
+
223
+ ## BPE Tokenizer Sample Usage Logs
224
+ ```
225
+ (temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗ python use_tokenizer.py
226
+ Loaded vocabulary size: 5000
227
+ Max token ID: 4999
228
+ Sample tokens: [(0, '<pad>'), (1, '<unk>'), (2, '<s>'), (3, '</s>'), (4, ' ')]
229
+ Hindi Text Encoder/Decoder (type 'quit' to exit)
230
+ --------------------------------------------------
231
+
232
+ Enter Hindi text to encode/decode: शब्दकोश एक बड़ी सूची या किताब होती है
233
+
234
+ Encoding:
235
+ Tokens: ['शब्द', 'को', 'श', 'एक', 'बड़', 'ी', 'सूच', 'ी', 'या', 'कि', 'ताब', 'होत', 'ी', 'है']
236
+ Token IDs: [3645, 150, 63, 259, 1767, 72, 3922, 72, 134, 151, 2092, 1484, 72, 132]
237
+
238
+ Decoding:
239
+ Text: शब्द को श एक बड़ ी सूच ी या कि ताब होत ी है
240
+
241
+ Enter Hindi text to encode/decode: quit
242
+ (temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗
243
+ ```
244
+
245
+ ## Contributions
246
+
247
+ We welcome you to report issues or submit pull requests for enhancements.
248
+
249
+ ## License
250
+ MIT License
app.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from pathlib import Path
3
+ from hindi_tokenizer import load_tokenizer, encode_text, decode_text
4
+
5
+ def load_hindi_tokenizer():
6
+ """Load the trained Hindi BPE tokenizer"""
7
+ output_dir = Path("output")
8
+ config_path = output_dir / "hindi_encoder.json"
9
+
10
+ if not config_path.exists():
11
+ st.error("Error: Tokenizer configuration not found! Please train the tokenizer first.")
12
+ st.stop()
13
+
14
+ return load_tokenizer(str(config_path))
15
+
16
+ def main():
17
+ st.set_page_config(
18
+ page_title="Hindi BPE Tokenizer",
19
+ page_icon="🇮🇳",
20
+ layout="wide",
21
+ initial_sidebar_state="expanded"
22
+ )
23
+
24
+ # Set custom CSS for styling
25
+ st.markdown(
26
+ """
27
+ <style>
28
+ .stApp {
29
+ background-color: #2E2E2E; /* Dark background */
30
+ color: #FFFFFF; /* White text for better contrast */
31
+ }
32
+ .stButton {
33
+ background-color: #4CAF50; /* Green button */
34
+ color: white;
35
+ border-radius: 8px; /* Rounded corners */
36
+ padding: 10px 20px; /* Padding for buttons */
37
+ font-size: 16px; /* Larger font size */
38
+ }
39
+ .stButton:hover {
40
+ background-color: #45a049; /* Darker green on hover */
41
+ }
42
+ .stTextInput, .stTextArea {
43
+ background-color: #ffffff; /* White input fields */
44
+ border: 2px solid #4CAF50; /* Green border */
45
+ border-radius: 8px; /* Rounded corners */
46
+ padding: 10px; /* Padding for input fields */
47
+ font-size: 16px; /* Larger font size */
48
+ }
49
+ .stHeader {
50
+ color: #FFD700; /* Gold color for header text */
51
+ font-size: 28px; /* Larger header font size */
52
+ font-weight: bold; /* Bold header */
53
+ }
54
+ .stMarkdown {
55
+ color: #FFFFFF; /* White markdown text */
56
+ font-size: 16px; /* Larger markdown font size */
57
+ }
58
+ .stTextInput:focus, .stTextArea:focus {
59
+ border-color: #45a049; /* Change border color on focus */
60
+ box-shadow: 0 0 5px rgba(76, 175, 80, 0.5); /* Add shadow on focus */
61
+ }
62
+ </style>
63
+ """,
64
+ unsafe_allow_html=True
65
+ )
66
+
67
+ st.title("Hindi BPE Tokenizer")
68
+ st.markdown("A web interface for encoding and decoding Hindi text using BPE tokenization")
69
+
70
+ # Load tokenizer
71
+ try:
72
+ tokenizer = load_hindi_tokenizer()
73
+ except Exception as e:
74
+ st.error(f"Error loading tokenizer: {e}")
75
+ st.stop()
76
+
77
+ # Create two columns
78
+ encode_col, decode_col = st.columns(2)
79
+
80
+ # Encoding Section
81
+ with encode_col:
82
+ st.header("Encode Hindi Text")
83
+ st.markdown("Convert Hindi text into token IDs")
84
+
85
+ input_text = st.text_area(
86
+ "Enter Hindi Text",
87
+ placeholder="यहाँ हिंदी टेक्स्ट लिखें...",
88
+ height=150,
89
+ key="encode_input"
90
+ )
91
+
92
+ if st.button("Encode", key="encode_button"):
93
+ if input_text.strip():
94
+ try:
95
+ token_ids, tokens = encode_text(tokenizer, input_text)
96
+
97
+ st.subheader("Results:")
98
+ st.markdown("**Tokens:**")
99
+ st.write(tokens)
100
+
101
+ st.markdown("**Token IDs:**")
102
+ st.write(token_ids)
103
+
104
+ # Display as comma-separated string for easy copying
105
+ st.markdown("**Token IDs (comma-separated):**")
106
+ st.code(", ".join(map(str, token_ids)))
107
+
108
+ except Exception as e:
109
+ st.error(f"Error during encoding: {e}")
110
+ else:
111
+ st.warning("Please enter some text to encode")
112
+
113
+ # Decoding Section
114
+ with decode_col:
115
+ st.header("Decode Token IDs")
116
+ st.markdown("Convert token IDs back to Hindi text")
117
+
118
+ input_ids = st.text_area(
119
+ "Enter Token IDs (comma-separated)",
120
+ placeholder="2197, 1024, 402, 7, 924...",
121
+ height=150,
122
+ key="decode_input"
123
+ )
124
+
125
+ if st.button("Decode", key="decode_button"):
126
+ if input_ids.strip():
127
+ try:
128
+ # Convert string of IDs to list of integers
129
+ token_ids = [int(id.strip()) for id in input_ids.split(",")]
130
+
131
+ decoded_text = decode_text(tokenizer, token_ids)
132
+
133
+ st.subheader("Results:")
134
+ st.markdown("**Decoded Text:**")
135
+ st.write(decoded_text)
136
+
137
+ # Display in a box for better visibility
138
+ st.text_area(
139
+ "Decoded Text (copyable)",
140
+ value=decoded_text,
141
+ height=100,
142
+ key="decoded_output"
143
+ )
144
+
145
+ except ValueError:
146
+ st.error("Invalid input format. Please enter comma-separated numbers.")
147
+ except Exception as e:
148
+ st.error(f"Error during decoding: {e}")
149
+ else:
150
+ st.warning("Please enter token IDs to decode")
151
+
152
+ # Add information section at the bottom
153
+ st.markdown("---")
154
+ st.markdown("### About the Tokenizer")
155
+
156
+ info_col1, info_col2 = st.columns(2)
157
+
158
+ with info_col1:
159
+ st.markdown("""
160
+ **Tokenizer Details:**
161
+ - Type: Byte Pair Encoding (BPE)
162
+ - Vocabulary Size: 5000 tokens
163
+ - Special Tokens: `<pad>`, `<unk>`, `<s>`, `</s>`
164
+ - Minimum Token Frequency: 2
165
+ """)
166
+
167
+ with info_col2:
168
+ st.markdown("""
169
+ **Preprocessing:**
170
+ - Retains Hindi Unicode (\\u0900-\\u097F)
171
+ - Removes digits and special characters
172
+ - Normalizes punctuation
173
+ - Cleans whitespace
174
+ """)
175
+
176
+ if __name__ == "__main__":
177
+ main()
hindi_tokenizer.py ADDED
@@ -0,0 +1,594 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import requests
3
+ from pathlib import Path
4
+ from collections import defaultdict, Counter
5
+ from tqdm import tqdm
6
+ import matplotlib.pyplot as plt
7
+ import json
8
+ import numpy as np
9
+
10
+ class TrieNode:
11
+ """Node in the prefix tree (trie) for fast token matching"""
12
+ def __init__(self):
13
+ self.children = {}
14
+ self.is_token = False
15
+ self.token = None
16
+
17
+ class BPETokenizer:
18
+ def __init__(self, vocab_size=5000):
19
+ self.vocab_size = vocab_size
20
+ self.chars = [] # List of unique characters
21
+ self.stoi = {} # String to index mapping
22
+ self.itos = {} # Index to string mapping
23
+ self.data = [] # Encoded text data
24
+ self.special_tokens = ["<pad>", "<unk>", "<s>", "</s>"]
25
+
26
+ # Statistics tracking
27
+ self.stats = {
28
+ "vocab_sizes": [],
29
+ "data_sizes": [],
30
+ "compression_ratios": [],
31
+ "merge_counts": [],
32
+ "tokens_created": [],
33
+ "max_token_lengths": [1],
34
+ }
35
+
36
+ self.original_length = 0
37
+ self.max_token_length = 1
38
+
39
+ def initialize_vocab(self, text):
40
+ """Initialize vocabulary from characters in text"""
41
+ # Preprocess text first
42
+ text = preprocess_hindi_text(text)
43
+
44
+ # Get unique characters and add special tokens
45
+ chars = sorted(list(set(text)))
46
+ all_tokens = self.special_tokens + chars
47
+
48
+ # Create mappings
49
+ self.stoi = {ch: i for i, ch in enumerate(all_tokens)}
50
+ self.itos = {i: ch for i, ch in enumerate(all_tokens)}
51
+
52
+ # Initial encoding
53
+ self.data = [self.stoi[c] for c in text]
54
+ self.original_length = len(self.data)
55
+
56
+ # Initialize stats
57
+ self.stats["vocab_sizes"].append(len(self.stoi))
58
+ self.stats["data_sizes"].append(len(self.data))
59
+ self.stats["compression_ratios"].append(1.0)
60
+
61
+ def get_digram_stats(self):
62
+ """Optimized digram counting using Counter"""
63
+ # Pre-compute pairs for all data at once
64
+ pairs = zip(self.data, self.data[1:])
65
+ return Counter((int(pair[0]), int(pair[1])) for pair in pairs)
66
+
67
+ def replace_byte_pair_in_data(self, pair, new_token):
68
+ """Optimized pair replacement using numpy"""
69
+ data = np.array(self.data)
70
+ i = 0
71
+ result = []
72
+
73
+ # Use numpy's vectorized operations
74
+ while i < len(data) - 1:
75
+ if data[i] == pair[0] and data[i + 1] == pair[1]:
76
+ result.append(new_token)
77
+ i += 2
78
+ else:
79
+ result.append(data[i])
80
+ i += 1
81
+
82
+ if i == len(data) - 1:
83
+ result.append(data[-1])
84
+
85
+ return result
86
+
87
+ def encode_pair(self, pair):
88
+ """Add a new token to vocabulary from pair"""
89
+ pair_str = self.itos[pair[0]] + self.itos[pair[1]]
90
+ next_idx = len(self.itos)
91
+ self.stoi[pair_str] = next_idx
92
+ self.itos[next_idx] = pair_str
93
+
94
+ # Update max token length
95
+ self.max_token_length = max(self.max_token_length, len(pair_str))
96
+ return next_idx
97
+
98
+ def train(self, texts, min_frequency=2, print_interval=500):
99
+ """Optimized BPE training with vectorized operations"""
100
+ # Combine all texts and initialize vocab
101
+ print("Initializing vocabulary...")
102
+ full_text = " ".join(texts)
103
+ self.initialize_vocab(full_text)
104
+
105
+ # Convert data to numpy array for faster operations
106
+ data = np.array(self.data, dtype=np.int32)
107
+
108
+ # Pre-compute character frequencies using numpy
109
+ print("Computing initial frequencies...")
110
+ unique, counts = np.unique(data, return_counts=True)
111
+ char_freqs = dict(zip(unique, counts))
112
+
113
+ # Initialize progress bar
114
+ pbar = tqdm(total=self.vocab_size - len(self.stoi),
115
+ desc="Training BPE",
116
+ position=0)
117
+
118
+ # Batch processing parameters
119
+ batch_size = min(1000, self.vocab_size - len(self.stoi))
120
+ stats_buffer = []
121
+
122
+ while len(self.stoi) < self.vocab_size:
123
+ # Get pair frequencies using vectorized operations
124
+ # Create a view of consecutive pairs
125
+ pair_view = np.lib.stride_tricks.sliding_window_view(data, 2)
126
+
127
+ # Convert to tuples for counting
128
+ pairs = [tuple(pair) for pair in pair_view]
129
+ pair_counts = Counter(pairs)
130
+
131
+ if not pair_counts:
132
+ break
133
+
134
+ # Get top pairs for batch processing
135
+ top_pairs = sorted(pair_counts.items(), key=lambda x: x[1], reverse=True)[:batch_size]
136
+
137
+ # Process batch of pairs
138
+ for (token1, token2), freq in top_pairs:
139
+ if len(self.stoi) >= self.vocab_size:
140
+ break
141
+
142
+ # Create new token
143
+ new_idx = self.encode_pair((token1, token2))
144
+
145
+ # Vectorized pair replacement
146
+ # Create a boolean mask for matching pairs
147
+ pair_mask = (data[:-1] == token1) & (data[1:] == token2)
148
+ if not np.any(pair_mask):
149
+ continue
150
+
151
+ # Create new data array efficiently
152
+ indices = np.where(pair_mask)[0]
153
+ new_data = np.empty(len(data) - len(indices), dtype=np.int32)
154
+
155
+ # Fill new data array using vectorized operations
156
+ pos = 0
157
+ prev_idx = 0
158
+ for idx in indices:
159
+ # Copy unchanged elements
160
+ new_data[pos:pos + (idx - prev_idx)] = data[prev_idx:idx]
161
+ pos += idx - prev_idx
162
+ # Add merged token
163
+ new_data[pos] = new_idx
164
+ pos += 1
165
+ prev_idx = idx + 2
166
+
167
+ # Copy remaining elements
168
+ if prev_idx < len(data):
169
+ new_data[pos:] = data[prev_idx:]
170
+
171
+ data = new_data
172
+
173
+ # Update statistics
174
+ stats_buffer.append({
175
+ 'vocab_size': len(self.stoi),
176
+ 'data_size': len(data),
177
+ 'merge_count': freq,
178
+ 'new_token': self.itos[new_idx]
179
+ })
180
+
181
+ pbar.update(1)
182
+
183
+ # Batch update statistics
184
+ if len(stats_buffer) >= print_interval:
185
+ self._update_stats_batch(stats_buffer)
186
+ if print_interval:
187
+ self.print_progress(
188
+ len(self.stoi),
189
+ stats_buffer[-1]['new_token'],
190
+ stats_buffer[-1]['merge_count']
191
+ )
192
+ stats_buffer = []
193
+
194
+ # Final statistics update
195
+ if stats_buffer:
196
+ self._update_stats_batch(stats_buffer)
197
+
198
+ pbar.close()
199
+ self.data = data.tolist()
200
+
201
+ # Calculate final compression ratio
202
+ final_ratio = self.original_length / len(self.data)
203
+ print(f"\nTraining completed. Final vocabulary size: {len(self.stoi)}")
204
+ print(f"Final compression ratio: {final_ratio:.2f}")
205
+
206
+ def _update_stats_batch(self, stats_buffer):
207
+ """Update statistics in batch for better performance"""
208
+ if not stats_buffer:
209
+ return
210
+
211
+ # Update all statistics at once
212
+ self.stats["vocab_sizes"].extend(s['vocab_size'] for s in stats_buffer)
213
+ self.stats["data_sizes"].extend(s['data_size'] for s in stats_buffer)
214
+ self.stats["merge_counts"].extend(s['merge_count'] for s in stats_buffer)
215
+ self.stats["tokens_created"].extend(s['new_token'] for s in stats_buffer)
216
+
217
+ # Update compression ratios
218
+ new_ratios = [self.original_length / s['data_size'] for s in stats_buffer]
219
+ self.stats["compression_ratios"].extend(new_ratios)
220
+
221
+ # Update max token lengths
222
+ self.stats["max_token_lengths"].extend([self.max_token_length] * len(stats_buffer))
223
+
224
+ def print_progress(self, iteration, new_token, merge_count):
225
+ """Print training progress"""
226
+ print(f"\nIteration {iteration:,}")
227
+ print(f"Created token: '{new_token}' (merged {merge_count:,} times)")
228
+ print(f"Current vocabulary size: {len(self.stoi):,}")
229
+ print(f"Current data size: {len(self.data):,}")
230
+ print(f"Current compression ratio: {self.stats['compression_ratios'][-1]:.2f}")
231
+ print("-" * 80)
232
+
233
+ def plot_statistics(self):
234
+ """Plot training statistics"""
235
+ fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
236
+
237
+ # Plot 1: Vocabulary Size vs Data Size
238
+ ax1.plot(self.stats["vocab_sizes"], self.stats["data_sizes"])
239
+ ax1.set_xlabel("Vocabulary Size")
240
+ ax1.set_ylabel("Dataset Size")
241
+ ax1.set_title("Vocabulary Size vs Dataset Size")
242
+
243
+ # Plot 2: Compression Ratio vs Vocabulary Size
244
+ ax2.plot(self.stats["vocab_sizes"], self.stats["compression_ratios"])
245
+ ax2.set_xlabel("Vocabulary Size")
246
+ ax2.set_ylabel("Compression Ratio")
247
+ ax2.set_title("Compression Ratio vs Vocabulary Size")
248
+
249
+ # Plot 3: Merge Counts Distribution
250
+ if self.stats["merge_counts"]:
251
+ ax3.hist(self.stats["merge_counts"], bins=30)
252
+ ax3.set_xlabel("Number of Merges")
253
+ ax3.set_ylabel("Frequency")
254
+ ax3.set_title("Distribution of Merge Counts")
255
+
256
+ # Plot 4: Token Lengths Over Time
257
+ if self.stats["tokens_created"]:
258
+ token_lengths = [len(token) for token in self.stats["tokens_created"]]
259
+ ax4.plot(range(len(token_lengths)), token_lengths)
260
+ ax4.set_xlabel("Merge Operation")
261
+ ax4.set_ylabel("New Token Length")
262
+ ax4.set_title("Token Length Evolution")
263
+
264
+ plt.tight_layout()
265
+ plt.show()
266
+
267
+ def save(self, filepath: str) -> None:
268
+ """Save tokenizer state to a JSON file"""
269
+ state = {
270
+ "stoi": self.stoi,
271
+ "itos": self.itos,
272
+ "max_token_length": self.max_token_length,
273
+ "stats": self.stats,
274
+ "special_tokens": self.special_tokens
275
+ }
276
+
277
+ with open(filepath, "w", encoding="utf-8") as f:
278
+ json.dump(state, f, ensure_ascii=False, indent=2)
279
+
280
+ @classmethod
281
+ def load(cls, filepath: str) -> "BPETokenizer":
282
+ """Load tokenizer state from a JSON file"""
283
+ with open(filepath, "r", encoding="utf-8") as f:
284
+ state = json.load(f)
285
+
286
+ # Create new instance
287
+ instance = cls()
288
+
289
+ # Convert string keys to integers in itos
290
+ instance.itos = {int(k): v for k, v in state["itos"].items()}
291
+ # Convert string values to integers in stoi
292
+ instance.stoi = {k: int(v) for k, v in state["stoi"].items()}
293
+ instance.max_token_length = state["max_token_length"]
294
+ instance.stats = state["stats"]
295
+ instance.special_tokens = state["special_tokens"]
296
+
297
+ # Debug info
298
+ print(f"Loaded vocabulary size: {len(instance.itos)}")
299
+ print(f"Max token ID: {max(instance.itos.keys())}")
300
+ print(f"Sample tokens: {list(instance.itos.items())[:5]}")
301
+
302
+ return instance
303
+
304
+ def encode(self, text: str):
305
+ """Convert text to token indices"""
306
+ # Preprocess input text
307
+ text = preprocess_hindi_text(text)
308
+
309
+ tokens = []
310
+ token_ids = []
311
+
312
+ # Split text into words
313
+ words = text.split()
314
+
315
+ for word in words:
316
+ # Try to find longest matching token
317
+ while word:
318
+ longest_match = None
319
+ for token, idx in sorted(self.stoi.items(), key=lambda x: len(x[0]), reverse=True):
320
+ if word.startswith(token):
321
+ longest_match = (token, idx)
322
+ break
323
+
324
+ if longest_match:
325
+ token, idx = longest_match
326
+ tokens.append(token)
327
+ token_ids.append(idx)
328
+ word = word[len(token):]
329
+ else:
330
+ # Skip unknown character and continue
331
+ word = word[1:]
332
+
333
+ return token_ids, tokens
334
+
335
+ def decode(self, token_ids: list) -> str:
336
+ """Convert token indices back to text with better error handling"""
337
+ decoded_tokens = []
338
+ max_id = max(self.itos.keys())
339
+
340
+ for idx in token_ids:
341
+ try:
342
+ # Convert to int and check range
343
+ idx = int(idx) if isinstance(idx, str) else idx
344
+ if idx < 0 or idx > max_id:
345
+ continue
346
+
347
+ # Get token from vocabulary
348
+ if idx in self.itos:
349
+ token = self.itos[idx]
350
+ if token not in self.special_tokens:
351
+ # Add token with space
352
+ decoded_tokens.append(token)
353
+
354
+ except (ValueError, KeyError):
355
+ continue
356
+
357
+ # Join all tokens with spaces and clean up extra spaces
358
+ result = " ".join(token for token in decoded_tokens if token.strip())
359
+ # Remove duplicate spaces and strip
360
+ result = " ".join(result.split())
361
+ return result
362
+
363
+ def download_dataset(url, filepath, max_size_gb=2):
364
+ """
365
+ Downloads a portion of the dataset with size limit and resume capability.
366
+
367
+ Args:
368
+ url (str): URL of the dataset
369
+ filepath (Path): Path where the file should be saved
370
+ max_size_gb (float): Maximum size to download in gigabytes
371
+ """
372
+ max_size_bytes = max_size_gb * 1024 * 1024 * 1024 # Convert GB to bytes
373
+
374
+ # Check if we already have enough data
375
+ if filepath.exists() and filepath.stat().st_size >= max_size_bytes:
376
+ print(f"Already have {max_size_gb}GB of data, skipping download.")
377
+ return
378
+
379
+ print(f"Downloading first {max_size_gb}GB from {url}")
380
+
381
+ # Get the current size if file exists (for resume)
382
+ current_size = filepath.stat().st_size if filepath.exists() else 0
383
+
384
+ # Set up headers for resume
385
+ headers = {'Range': f'bytes={current_size}-'} if current_size > 0 else {}
386
+
387
+ try:
388
+ response = requests.get(url, stream=True, headers=headers)
389
+ response.raise_for_status()
390
+
391
+ # Get file size for progress bar
392
+ total_size = min(
393
+ int(response.headers.get('content-length', 0)) + current_size,
394
+ max_size_bytes
395
+ )
396
+
397
+ mode = 'ab' if current_size > 0 else 'wb'
398
+ with open(filepath, mode) as file, tqdm(
399
+ desc="Downloading",
400
+ initial=current_size,
401
+ total=total_size,
402
+ unit='iB',
403
+ unit_scale=True,
404
+ unit_divisor=1024,
405
+ ) as progress_bar:
406
+ for data in response.iter_content(chunk_size=8192):
407
+ if not data:
408
+ break
409
+
410
+ size = file.write(data)
411
+ progress_bar.update(size)
412
+
413
+ # Check if we've reached the size limit
414
+ if file.tell() >= max_size_bytes:
415
+ print(f"\nReached {max_size_gb}GB limit, stopping download.")
416
+ break
417
+
418
+ except requests.exceptions.RequestException as e:
419
+ print(f"Error during download: {e}")
420
+ if filepath.exists():
421
+ print("Partial download remains available for resume.")
422
+ raise
423
+
424
+ def prepare_dataset(input_path, sample_size=None, max_lines=None):
425
+ """
426
+ Prepares the dataset by optionally sampling and basic cleaning.
427
+
428
+ Args:
429
+ input_path (Path): Path to the raw dataset
430
+ sample_size (int, optional): Number of lines to sample. If None, use entire dataset
431
+ max_lines (int, optional): Maximum number of lines to read from file
432
+
433
+ Returns:
434
+ list: Processed lines from the dataset
435
+ """
436
+ print("Reading and preparing dataset...")
437
+ lines = []
438
+
439
+ with open(input_path, 'r', encoding='utf-8') as file:
440
+ for i, line in enumerate(tqdm(file, desc="Reading lines")):
441
+ if max_lines and i >= max_lines:
442
+ break
443
+
444
+ if line.strip():
445
+ lines.append(line)
446
+ if sample_size and len(lines) >= sample_size:
447
+ break
448
+
449
+ return lines
450
+
451
+ def preprocess_hindi_text(text):
452
+ """
453
+ Preprocesses Hindi text by removing unwanted characters and normalizing punctuation.
454
+
455
+ Args:
456
+ text (str): Raw Hindi text input
457
+
458
+ Returns:
459
+ str: Cleaned and normalized text
460
+ """
461
+ # Remove <unk> tokens first
462
+ text = text.replace("<unk>", "")
463
+
464
+ # Retain Hindi characters and punctuation
465
+ text = re.sub(r"[^\u0900-\u097F\s।,.!?\-]", "", text)
466
+ # Remove digits (both English and Hindi)
467
+ text = re.sub(r"[0-9०-९]", "", text)
468
+ # Normalize full stops and whitespace
469
+ text = re.sub(r"।", ".", text)
470
+ text = re.sub(r"\s+", " ", text).strip()
471
+ return text
472
+
473
+ def calculate_compression_ratio(tokenizer, corpus_path):
474
+ """
475
+ Calculates the compression ratio for the tokenizer on the given corpus.
476
+
477
+ Args:
478
+ tokenizer (Tokenizer): Trained BPE tokenizer
479
+ corpus_path (str): Path to the preprocessed corpus
480
+
481
+ Returns:
482
+ float: Compression ratio (characters/tokens)
483
+ """
484
+ with open(corpus_path, "r", encoding="utf-8") as file:
485
+ corpus = file.readlines()
486
+
487
+ total_chars = sum(len(line) for line in corpus)
488
+ total_tokens = sum(len(tokenizer.encode(line).tokens) for line in corpus)
489
+ return total_chars / total_tokens
490
+
491
+ def encode_text(tokenizer, text):
492
+ cleaned_text = preprocess_hindi_text(text)
493
+ return tokenizer.encode(cleaned_text)
494
+
495
+ def decode_text(tokenizer, token_ids):
496
+ return tokenizer.decode(token_ids)
497
+
498
+ def test_tokenizer(tokenizer, test_text):
499
+ """
500
+ Tests the tokenizer by encoding and decoding sample text.
501
+
502
+ Args:
503
+ tokenizer (Tokenizer): Trained BPE tokenizer
504
+ test_text (str): Sample text for testing
505
+ """
506
+ print("\nTokenizer Test:")
507
+ print("-" * 50)
508
+ print(f"Original Text: {test_text}")
509
+
510
+ # Encode
511
+ token_ids, tokens = encode_text(tokenizer, test_text)
512
+ print(f"\nTokens: {tokens}")
513
+ print(f"Token IDs: {token_ids}")
514
+
515
+ # Decode
516
+ decoded_text = decode_text(tokenizer, token_ids)
517
+ print(f"\nDecoded Text: {decoded_text}")
518
+
519
+ def main():
520
+ # Create output directory if it doesn't exist
521
+ output_dir = Path("output")
522
+ output_dir.mkdir(exist_ok=True)
523
+
524
+ # Dataset URL and paths
525
+ dataset_url = "https://objectstore.e2enetworks.net/ai4b-public-nlu-nlg/v1-indiccorp/hi.txt"
526
+ raw_dataset_path = Path("raw_hindi_dataset.txt")
527
+ preprocessed_path = output_dir / "preprocessed_hindi.txt"
528
+
529
+ # Step 1: Download dataset if it doesn't exist or is too small
530
+ if not raw_dataset_path.exists() or raw_dataset_path.stat().st_size < (10 * 1024 * 1024 * 1024):
531
+ print("Step 1: Downloading dataset (10GB limit)...")
532
+ try:
533
+ download_dataset(dataset_url, raw_dataset_path, max_size_gb=10)
534
+ except requests.exceptions.RequestException as e:
535
+ print(f"Error downloading dataset: {e}")
536
+ if not raw_dataset_path.exists():
537
+ return
538
+ print("Continuing with existing partial download...")
539
+ else:
540
+ print("Sufficient dataset already exists, skipping download.")
541
+
542
+ # Step 2: Prepare and preprocess the dataset
543
+ print("Step 2: Preprocessing dataset...")
544
+ try:
545
+ # Sample 2 Million lines from the first 3 Million lines
546
+ raw_data = prepare_dataset(
547
+ raw_dataset_path,
548
+ sample_size=2_000_000,
549
+ max_lines=3_000_000
550
+ )
551
+ except FileNotFoundError:
552
+ print(f"Error: Input file '{raw_dataset_path}' not found!")
553
+ return
554
+ except Exception as e:
555
+ print(f"Error preparing dataset: {e}")
556
+ return
557
+
558
+ # Preprocess the text
559
+ print("Cleaning and normalizing text...")
560
+ preprocessed_data = [preprocess_hindi_text(line) for line in tqdm(raw_data)]
561
+
562
+ # Save the preprocessed dataset
563
+ with open(preprocessed_path, "w", encoding="utf-8") as file:
564
+ file.write("\n".join(preprocessed_data))
565
+
566
+ # Initialize and train our custom BPE tokenizer
567
+ tokenizer = BPETokenizer(vocab_size=5000)
568
+ tokenizer.train(preprocessed_data, min_frequency=2)
569
+
570
+ # Save the tokenizer
571
+ config_path = output_dir / "hindi_encoder.json"
572
+ tokenizer.save(str(config_path))
573
+
574
+ # Test the tokenizer
575
+ #test_text = "नमस्ते भारत! यह एक परीक्षण वाक्य है।"
576
+ test_text = "फिर पानी भी कम मात्रा में"
577
+ test_tokenizer(tokenizer, test_text)
578
+
579
+ return tokenizer
580
+
581
+ def load_tokenizer(config_path):
582
+ """
583
+ Loads a previously trained tokenizer from a configuration file.
584
+
585
+ Args:
586
+ config_path (str): Path to the tokenizer configuration file
587
+
588
+ Returns:
589
+ Tokenizer: Loaded tokenizer
590
+ """
591
+ return BPETokenizer.load(config_path)
592
+
593
+ if __name__ == "__main__":
594
+ main()
output/hindi_encoder.json ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ numpy
2
+ requests
3
+ tqdm
4
+ matplotlib
5
+
use_tokenizer.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+ from hindi_tokenizer import load_tokenizer, encode_text, decode_text
3
+
4
+ def main():
5
+ # Load the trained tokenizer
6
+ output_dir = Path("output")
7
+ config_path = output_dir / "hindi_encoder.json"
8
+
9
+ if not config_path.exists():
10
+ print("Error: Tokenizer configuration not found! Please train the tokenizer first.")
11
+ return
12
+
13
+ tokenizer = load_tokenizer(str(config_path))
14
+
15
+ # Interactive loop
16
+ print("Hindi Text Encoder/Decoder (type 'quit' to exit)")
17
+ print("-" * 50)
18
+
19
+ while True:
20
+ text = input("\nEnter Hindi text to encode/decode: ")
21
+
22
+ if text.lower() == 'quit':
23
+ break
24
+
25
+ if not text.strip():
26
+ continue
27
+
28
+ # Encode the text
29
+ token_ids, tokens = encode_text(tokenizer, text)
30
+ print("\nEncoding:")
31
+ print(f"Tokens: {tokens}")
32
+ print(f"Token IDs: {token_ids}")
33
+
34
+ # Decode back
35
+ decoded_text = decode_text(tokenizer, token_ids)
36
+ print("\nDecoding:")
37
+ print(f"Text: {decoded_text}")
38
+
39
+ if __name__ == "__main__":
40
+ main()