Commit
·
da971a5
1
Parent(s):
02c1a7d
Adding hindi BPE tokenizer
Browse files- README.md +239 -2
- app.py +177 -0
- hindi_tokenizer.py +594 -0
- output/hindi_encoder.json +0 -0
- requirements.txt +5 -0
- use_tokenizer.py +40 -0
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
emoji: 🌍
|
4 |
colorFrom: yellow
|
5 |
colorTo: indigo
|
@@ -10,4 +10,241 @@ pinned: false
|
|
10 |
short_description: Hindi BPE tokenizer
|
11 |
---
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: Hindi BPE Tokenizer
|
3 |
emoji: 🌍
|
4 |
colorFrom: yellow
|
5 |
colorTo: indigo
|
|
|
10 |
short_description: Hindi BPE tokenizer
|
11 |
---
|
12 |
|
13 |
+
# Hindi BPE Tokenizer
|
14 |
+
|
15 |
+
This Python script is designed for the preprocessing of Hindi text and the training of a Byte Pair Encoding (BPE) tokenizer specifically tailored for the Hindi language. It automatically fetches and processes a segment of the IndicCorp Hindi dataset.
|
16 |
+
|
17 |
+
## Key Features
|
18 |
+
|
19 |
+
- **Intelligent Dataset Management**:
|
20 |
+
- Downloads the initial 10GB of the IndicCorp Hindi dataset
|
21 |
+
- Capable of resuming interrupted downloads
|
22 |
+
- Samples 2 million lines from the first 3 million available
|
23 |
+
- Includes progress indicators for both downloading and processing
|
24 |
+
|
25 |
+
- **Text Preprocessing**:
|
26 |
+
- Filters to retain only Hindi characters (Unicode range: \u0900-\u097F)
|
27 |
+
- Eliminates digits (both English and Devanagari)
|
28 |
+
- Normalizes punctuation (converts Hindi full stops '।' to '.')
|
29 |
+
- Cleans up whitespace
|
30 |
+
|
31 |
+
- **BPE Tokenizer Training**:
|
32 |
+
- Enhanced training using numpy's vectorized operations
|
33 |
+
- Processes data in batches for improved efficiency
|
34 |
+
- Configurable vocabulary size: 5000 tokens
|
35 |
+
- Special tokens included: `<pad>`, `<unk>`, `<s>`, `</s>`
|
36 |
+
- Minimum token frequency set to 2
|
37 |
+
- Tracks progress with compression ratios
|
38 |
+
|
39 |
+
## Prerequisites
|
40 |
+
|
41 |
+
To install the necessary packages, run:
|
42 |
+
```
|
43 |
+
pip install numpy requests tqdm matplotlib
|
44 |
+
```
|
45 |
+
|
46 |
+
## Getting Started
|
47 |
+
|
48 |
+
1. Execute the tokenizer training script:
|
49 |
+
```
|
50 |
+
python hindi_tokenizer.py
|
51 |
+
```
|
52 |
+
|
53 |
+
2. Utilize the interactive encoder/decoder:
|
54 |
+
```
|
55 |
+
python use_tokenizer.py
|
56 |
+
```
|
57 |
+
|
58 |
+
## Directory Layout
|
59 |
+
```
|
60 |
+
.
|
61 |
+
├── hindi_tokenizer.py # Primary training script
|
62 |
+
├── use_tokenizer.py # Tool for interactive encoding/decoding
|
63 |
+
├── raw_hindi_dataset.txt # Downloaded dataset (10GB)
|
64 |
+
└── output/
|
65 |
+
├── preprocessed_hindi.txt # Cleaned text output
|
66 |
+
└── hindi_encoder.json # Configuration for the tokenizer
|
67 |
+
```
|
68 |
+
|
69 |
+
## Dataset Information
|
70 |
+
|
71 |
+
- **Source**: IndicCorp Hindi Collection
|
72 |
+
- **URL**: https://objectstore.e2enetworks.net/ai4b-public-nlu-nlg/v1-indiccorp/hi.txt
|
73 |
+
- **Download Size**: First 10GB of a ~20GB file
|
74 |
+
- **Training Sample**: 2,000,000 lines from the initial 3 million lines
|
75 |
+
|
76 |
+
## Example Usage
|
77 |
+
|
78 |
+
### Training the Tokenizer
|
79 |
+
```
|
80 |
+
from hindi_tokenizer import main
|
81 |
+
# Train and retrieve the tokenizer
|
82 |
+
tokenizer = main()
|
83 |
+
```
|
84 |
+
|
85 |
+
### Utilizing the Trained Tokenizer
|
86 |
+
```
|
87 |
+
from hindi_tokenizer import load_tokenizer, encode_text, decode_text
|
88 |
+
# Load the pre-existing tokenizer
|
89 |
+
tokenizer = load_tokenizer("output/hindi_encoder.json")
|
90 |
+
# Encode a sample text
|
91 |
+
text = "नमस्ते भारत!"
|
92 |
+
token_ids, tokens = encode_text(tokenizer, text)
|
93 |
+
print(f"Tokens: {tokens}")
|
94 |
+
print(f"Token IDs: {token_ids}")
|
95 |
+
# Decode back to the original text
|
96 |
+
decoded_text = decode_text(tokenizer, token_ids)
|
97 |
+
print(f"Decoded: {decoded_text}")
|
98 |
+
```
|
99 |
+
|
100 |
+
## Technical Insights
|
101 |
+
|
102 |
+
### Preprocessing Steps
|
103 |
+
1. Character filtering: `[^\u0900-\u097F\s।,.!?\-]`
|
104 |
+
2. Removal of digits: `[0-9०-९]`
|
105 |
+
3. Normalization of punctuation: `।` → `.`
|
106 |
+
4. Whitespace normalization
|
107 |
+
|
108 |
+
### Tokenizer Settings
|
109 |
+
- Model Type: Byte Pair Encoding (BPE)
|
110 |
+
- Vocabulary Size: 5000
|
111 |
+
- Number of Special Tokens: 4
|
112 |
+
- Batch Size for Training: 1,000
|
113 |
+
- Interval for Statistics Tracking: 500
|
114 |
+
- Utilizes numpy for vectorized operations
|
115 |
+
|
116 |
+
### Performance Enhancements
|
117 |
+
- Vectorized operations based on Numpy
|
118 |
+
- Batch processing for merge operations
|
119 |
+
- Optimized memory usage
|
120 |
+
- Sliding window technique for pair counting
|
121 |
+
- Pre-allocated arrays for enhanced speed
|
122 |
+
- Updates to statistics in batches
|
123 |
+
|
124 |
+
## Error Management
|
125 |
+
|
126 |
+
The script incorporates thorough error handling for:
|
127 |
+
- Network-related issues during downloads
|
128 |
+
- Resuming partial downloads
|
129 |
+
- File input/output operations
|
130 |
+
- Processing of the dataset
|
131 |
+
- Verification of compression ratios
|
132 |
+
|
133 |
+
## BPE Tokenizer Training Logs
|
134 |
+
```
|
135 |
+
(temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗ python hindi_tokenizer.py
|
136 |
+
Sufficient dataset already exists, skipping download.
|
137 |
+
Step 2: Preprocessing dataset...
|
138 |
+
Reading and preparing dataset...
|
139 |
+
Reading lines: 2000005it [00:01, 1093427.18it/s]
|
140 |
+
Cleaning and normalizing text...
|
141 |
+
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000000/2000000 [00:17<00:00, 114213.87it/s]
|
142 |
+
Initializing vocabulary...
|
143 |
+
Computing initial frequencies...
|
144 |
+
Training BPE: 10%|███████████████████▌ | 500/4887 [05:05<14:23, 5.08it/s]
|
145 |
+
Iteration 613
|
146 |
+
Created token: 'रं' (merged 77,383 times)
|
147 |
+
Current vocabulary size: 613
|
148 |
+
Current data size: 266,508,022
|
149 |
+
Current compression ratio: 1.68
|
150 |
+
--------------------------------------------------------------------------------
|
151 |
+
Training BPE: 20%|██████████████████████████████████████▉ | 1000/4887 [06:42<12:09, 5.33it/s]
|
152 |
+
Iteration 1,113
|
153 |
+
Created token: 'ह,' (merged 14,825 times)
|
154 |
+
Current vocabulary size: 1,113
|
155 |
+
Current data size: 266,508,022
|
156 |
+
Current compression ratio: 1.74
|
157 |
+
--------------------------------------------------------------------------------
|
158 |
+
Training BPE: 31%|██████████████████████████████████████████████████████████▎ | 1500/4887 [09:55<06:43, 8.40it/s]
|
159 |
+
Iteration 1,613
|
160 |
+
Created token: 'ो ह' (merged 45,509 times)
|
161 |
+
Current vocabulary size: 1,613
|
162 |
+
Current data size: 266,508,022
|
163 |
+
Current compression ratio: 2.24
|
164 |
+
--------------------------------------------------------------------------------
|
165 |
+
Training BPE: 41%|█████████████████████████████████████████████████████████████████████████████▊ | 2000/4887 [10:51<05:14, 9.18it/s]
|
166 |
+
Iteration 2,113
|
167 |
+
Created token: 'पर्' (merged 26,421 times)
|
168 |
+
Current vocabulary size: 2,113
|
169 |
+
Current data size: 266,508,022
|
170 |
+
Current compression ratio: 2.39
|
171 |
+
--------------------------------------------------------------------------------
|
172 |
+
Training BPE: 51%|█████████████████████████████████████████████████████████████████████████████████████████████████▏ | 2499/4887 [13:17<03:45, 10.61it/s]
|
173 |
+
Iteration 2,613
|
174 |
+
Created token: 'हार ' (merged 15,505 times)
|
175 |
+
Current vocabulary size: 2,613
|
176 |
+
Current data size: 266,508,022
|
177 |
+
Current compression ratio: 2.66
|
178 |
+
--------------------------------------------------------------------------------
|
179 |
+
Training BPE: 61%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 2999/4887 [14:02<02:48, 11.22it/s]
|
180 |
+
Iteration 3,113
|
181 |
+
Created token: 'िले ' (merged 11,115 times)
|
182 |
+
Current vocabulary size: 3,113
|
183 |
+
Current data size: 266,508,022
|
184 |
+
Current compression ratio: 2.79
|
185 |
+
--------------------------------------------------------------------------------
|
186 |
+
Training BPE: 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 3500/4887 [16:13<01:57, 11.83it/s]
|
187 |
+
Iteration 3,613
|
188 |
+
Created token: 'ठाक' (merged 7,706 times)
|
189 |
+
Current vocabulary size: 3,613
|
190 |
+
Current data size: 266,508,022
|
191 |
+
Current compression ratio: 2.93
|
192 |
+
--------------------------------------------------------------------------------
|
193 |
+
Training BPE: 82%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 4000/4887 [16:54<01:11, 12.48it/s]
|
194 |
+
Iteration 4,113
|
195 |
+
Created token: 'ंग��' (merged 6,185 times)
|
196 |
+
Current vocabulary size: 4,113
|
197 |
+
Current data size: 266,508,022
|
198 |
+
Current compression ratio: 3.03
|
199 |
+
--------------------------------------------------------------------------------
|
200 |
+
Training BPE: 92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 4499/4887 [18:52<00:30, 12.78it/s]
|
201 |
+
Iteration 4,613
|
202 |
+
Created token: 'बेहद' (merged 4,949 times)
|
203 |
+
Current vocabulary size: 4,613
|
204 |
+
Current data size: 266,508,022
|
205 |
+
Current compression ratio: 3.13
|
206 |
+
--------------------------------------------------------------------------------
|
207 |
+
Training BPE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4887/4887 [19:21<00:00, 4.21it/s]
|
208 |
+
|
209 |
+
Training completed. Final vocabulary size: 5000
|
210 |
+
Final compression ratio: 3.22
|
211 |
+
|
212 |
+
Tokenizer Test:
|
213 |
+
--------------------------------------------------
|
214 |
+
Original Text: फिर पानी भी कम मात्रा में
|
215 |
+
|
216 |
+
Tokens: ['फिर', 'पा', 'नी', 'भी', 'कम', 'मा', 'त्र', 'ा', 'में']
|
217 |
+
Token IDs: [4947, 215, 225, 210, 450, 172, 1314, 70, 1163]
|
218 |
+
|
219 |
+
Decoded Text: फिर पा नी भी कम मा त्र ा में
|
220 |
+
(temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗
|
221 |
+
```
|
222 |
+
|
223 |
+
## BPE Tokenizer Sample Usage Logs
|
224 |
+
```
|
225 |
+
(temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗ python use_tokenizer.py
|
226 |
+
Loaded vocabulary size: 5000
|
227 |
+
Max token ID: 4999
|
228 |
+
Sample tokens: [(0, '<pad>'), (1, '<unk>'), (2, '<s>'), (3, '</s>'), (4, ' ')]
|
229 |
+
Hindi Text Encoder/Decoder (type 'quit' to exit)
|
230 |
+
--------------------------------------------------
|
231 |
+
|
232 |
+
Enter Hindi text to encode/decode: शब्दकोश एक बड़ी सूची या किताब होती है
|
233 |
+
|
234 |
+
Encoding:
|
235 |
+
Tokens: ['शब्द', 'को', 'श', 'एक', 'बड़', 'ी', 'सूच', 'ी', 'या', 'कि', 'ताब', 'होत', 'ी', 'है']
|
236 |
+
Token IDs: [3645, 150, 63, 259, 1767, 72, 3922, 72, 134, 151, 2092, 1484, 72, 132]
|
237 |
+
|
238 |
+
Decoding:
|
239 |
+
Text: शब्द को श एक बड़ ी सूच ी या कि ताब होत ी है
|
240 |
+
|
241 |
+
Enter Hindi text to encode/decode: quit
|
242 |
+
(temporary) ➜ erav3-s11-hindi-tokenizer git:(master) ✗
|
243 |
+
```
|
244 |
+
|
245 |
+
## Contributions
|
246 |
+
|
247 |
+
We welcome you to report issues or submit pull requests for enhancements.
|
248 |
+
|
249 |
+
## License
|
250 |
+
MIT License
|
app.py
ADDED
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
from pathlib import Path
|
3 |
+
from hindi_tokenizer import load_tokenizer, encode_text, decode_text
|
4 |
+
|
5 |
+
def load_hindi_tokenizer():
|
6 |
+
"""Load the trained Hindi BPE tokenizer"""
|
7 |
+
output_dir = Path("output")
|
8 |
+
config_path = output_dir / "hindi_encoder.json"
|
9 |
+
|
10 |
+
if not config_path.exists():
|
11 |
+
st.error("Error: Tokenizer configuration not found! Please train the tokenizer first.")
|
12 |
+
st.stop()
|
13 |
+
|
14 |
+
return load_tokenizer(str(config_path))
|
15 |
+
|
16 |
+
def main():
|
17 |
+
st.set_page_config(
|
18 |
+
page_title="Hindi BPE Tokenizer",
|
19 |
+
page_icon="🇮🇳",
|
20 |
+
layout="wide",
|
21 |
+
initial_sidebar_state="expanded"
|
22 |
+
)
|
23 |
+
|
24 |
+
# Set custom CSS for styling
|
25 |
+
st.markdown(
|
26 |
+
"""
|
27 |
+
<style>
|
28 |
+
.stApp {
|
29 |
+
background-color: #2E2E2E; /* Dark background */
|
30 |
+
color: #FFFFFF; /* White text for better contrast */
|
31 |
+
}
|
32 |
+
.stButton {
|
33 |
+
background-color: #4CAF50; /* Green button */
|
34 |
+
color: white;
|
35 |
+
border-radius: 8px; /* Rounded corners */
|
36 |
+
padding: 10px 20px; /* Padding for buttons */
|
37 |
+
font-size: 16px; /* Larger font size */
|
38 |
+
}
|
39 |
+
.stButton:hover {
|
40 |
+
background-color: #45a049; /* Darker green on hover */
|
41 |
+
}
|
42 |
+
.stTextInput, .stTextArea {
|
43 |
+
background-color: #ffffff; /* White input fields */
|
44 |
+
border: 2px solid #4CAF50; /* Green border */
|
45 |
+
border-radius: 8px; /* Rounded corners */
|
46 |
+
padding: 10px; /* Padding for input fields */
|
47 |
+
font-size: 16px; /* Larger font size */
|
48 |
+
}
|
49 |
+
.stHeader {
|
50 |
+
color: #FFD700; /* Gold color for header text */
|
51 |
+
font-size: 28px; /* Larger header font size */
|
52 |
+
font-weight: bold; /* Bold header */
|
53 |
+
}
|
54 |
+
.stMarkdown {
|
55 |
+
color: #FFFFFF; /* White markdown text */
|
56 |
+
font-size: 16px; /* Larger markdown font size */
|
57 |
+
}
|
58 |
+
.stTextInput:focus, .stTextArea:focus {
|
59 |
+
border-color: #45a049; /* Change border color on focus */
|
60 |
+
box-shadow: 0 0 5px rgba(76, 175, 80, 0.5); /* Add shadow on focus */
|
61 |
+
}
|
62 |
+
</style>
|
63 |
+
""",
|
64 |
+
unsafe_allow_html=True
|
65 |
+
)
|
66 |
+
|
67 |
+
st.title("Hindi BPE Tokenizer")
|
68 |
+
st.markdown("A web interface for encoding and decoding Hindi text using BPE tokenization")
|
69 |
+
|
70 |
+
# Load tokenizer
|
71 |
+
try:
|
72 |
+
tokenizer = load_hindi_tokenizer()
|
73 |
+
except Exception as e:
|
74 |
+
st.error(f"Error loading tokenizer: {e}")
|
75 |
+
st.stop()
|
76 |
+
|
77 |
+
# Create two columns
|
78 |
+
encode_col, decode_col = st.columns(2)
|
79 |
+
|
80 |
+
# Encoding Section
|
81 |
+
with encode_col:
|
82 |
+
st.header("Encode Hindi Text")
|
83 |
+
st.markdown("Convert Hindi text into token IDs")
|
84 |
+
|
85 |
+
input_text = st.text_area(
|
86 |
+
"Enter Hindi Text",
|
87 |
+
placeholder="यहाँ हिंदी टेक्स्ट लिखें...",
|
88 |
+
height=150,
|
89 |
+
key="encode_input"
|
90 |
+
)
|
91 |
+
|
92 |
+
if st.button("Encode", key="encode_button"):
|
93 |
+
if input_text.strip():
|
94 |
+
try:
|
95 |
+
token_ids, tokens = encode_text(tokenizer, input_text)
|
96 |
+
|
97 |
+
st.subheader("Results:")
|
98 |
+
st.markdown("**Tokens:**")
|
99 |
+
st.write(tokens)
|
100 |
+
|
101 |
+
st.markdown("**Token IDs:**")
|
102 |
+
st.write(token_ids)
|
103 |
+
|
104 |
+
# Display as comma-separated string for easy copying
|
105 |
+
st.markdown("**Token IDs (comma-separated):**")
|
106 |
+
st.code(", ".join(map(str, token_ids)))
|
107 |
+
|
108 |
+
except Exception as e:
|
109 |
+
st.error(f"Error during encoding: {e}")
|
110 |
+
else:
|
111 |
+
st.warning("Please enter some text to encode")
|
112 |
+
|
113 |
+
# Decoding Section
|
114 |
+
with decode_col:
|
115 |
+
st.header("Decode Token IDs")
|
116 |
+
st.markdown("Convert token IDs back to Hindi text")
|
117 |
+
|
118 |
+
input_ids = st.text_area(
|
119 |
+
"Enter Token IDs (comma-separated)",
|
120 |
+
placeholder="2197, 1024, 402, 7, 924...",
|
121 |
+
height=150,
|
122 |
+
key="decode_input"
|
123 |
+
)
|
124 |
+
|
125 |
+
if st.button("Decode", key="decode_button"):
|
126 |
+
if input_ids.strip():
|
127 |
+
try:
|
128 |
+
# Convert string of IDs to list of integers
|
129 |
+
token_ids = [int(id.strip()) for id in input_ids.split(",")]
|
130 |
+
|
131 |
+
decoded_text = decode_text(tokenizer, token_ids)
|
132 |
+
|
133 |
+
st.subheader("Results:")
|
134 |
+
st.markdown("**Decoded Text:**")
|
135 |
+
st.write(decoded_text)
|
136 |
+
|
137 |
+
# Display in a box for better visibility
|
138 |
+
st.text_area(
|
139 |
+
"Decoded Text (copyable)",
|
140 |
+
value=decoded_text,
|
141 |
+
height=100,
|
142 |
+
key="decoded_output"
|
143 |
+
)
|
144 |
+
|
145 |
+
except ValueError:
|
146 |
+
st.error("Invalid input format. Please enter comma-separated numbers.")
|
147 |
+
except Exception as e:
|
148 |
+
st.error(f"Error during decoding: {e}")
|
149 |
+
else:
|
150 |
+
st.warning("Please enter token IDs to decode")
|
151 |
+
|
152 |
+
# Add information section at the bottom
|
153 |
+
st.markdown("---")
|
154 |
+
st.markdown("### About the Tokenizer")
|
155 |
+
|
156 |
+
info_col1, info_col2 = st.columns(2)
|
157 |
+
|
158 |
+
with info_col1:
|
159 |
+
st.markdown("""
|
160 |
+
**Tokenizer Details:**
|
161 |
+
- Type: Byte Pair Encoding (BPE)
|
162 |
+
- Vocabulary Size: 5000 tokens
|
163 |
+
- Special Tokens: `<pad>`, `<unk>`, `<s>`, `</s>`
|
164 |
+
- Minimum Token Frequency: 2
|
165 |
+
""")
|
166 |
+
|
167 |
+
with info_col2:
|
168 |
+
st.markdown("""
|
169 |
+
**Preprocessing:**
|
170 |
+
- Retains Hindi Unicode (\\u0900-\\u097F)
|
171 |
+
- Removes digits and special characters
|
172 |
+
- Normalizes punctuation
|
173 |
+
- Cleans whitespace
|
174 |
+
""")
|
175 |
+
|
176 |
+
if __name__ == "__main__":
|
177 |
+
main()
|
hindi_tokenizer.py
ADDED
@@ -0,0 +1,594 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import re
|
2 |
+
import requests
|
3 |
+
from pathlib import Path
|
4 |
+
from collections import defaultdict, Counter
|
5 |
+
from tqdm import tqdm
|
6 |
+
import matplotlib.pyplot as plt
|
7 |
+
import json
|
8 |
+
import numpy as np
|
9 |
+
|
10 |
+
class TrieNode:
|
11 |
+
"""Node in the prefix tree (trie) for fast token matching"""
|
12 |
+
def __init__(self):
|
13 |
+
self.children = {}
|
14 |
+
self.is_token = False
|
15 |
+
self.token = None
|
16 |
+
|
17 |
+
class BPETokenizer:
|
18 |
+
def __init__(self, vocab_size=5000):
|
19 |
+
self.vocab_size = vocab_size
|
20 |
+
self.chars = [] # List of unique characters
|
21 |
+
self.stoi = {} # String to index mapping
|
22 |
+
self.itos = {} # Index to string mapping
|
23 |
+
self.data = [] # Encoded text data
|
24 |
+
self.special_tokens = ["<pad>", "<unk>", "<s>", "</s>"]
|
25 |
+
|
26 |
+
# Statistics tracking
|
27 |
+
self.stats = {
|
28 |
+
"vocab_sizes": [],
|
29 |
+
"data_sizes": [],
|
30 |
+
"compression_ratios": [],
|
31 |
+
"merge_counts": [],
|
32 |
+
"tokens_created": [],
|
33 |
+
"max_token_lengths": [1],
|
34 |
+
}
|
35 |
+
|
36 |
+
self.original_length = 0
|
37 |
+
self.max_token_length = 1
|
38 |
+
|
39 |
+
def initialize_vocab(self, text):
|
40 |
+
"""Initialize vocabulary from characters in text"""
|
41 |
+
# Preprocess text first
|
42 |
+
text = preprocess_hindi_text(text)
|
43 |
+
|
44 |
+
# Get unique characters and add special tokens
|
45 |
+
chars = sorted(list(set(text)))
|
46 |
+
all_tokens = self.special_tokens + chars
|
47 |
+
|
48 |
+
# Create mappings
|
49 |
+
self.stoi = {ch: i for i, ch in enumerate(all_tokens)}
|
50 |
+
self.itos = {i: ch for i, ch in enumerate(all_tokens)}
|
51 |
+
|
52 |
+
# Initial encoding
|
53 |
+
self.data = [self.stoi[c] for c in text]
|
54 |
+
self.original_length = len(self.data)
|
55 |
+
|
56 |
+
# Initialize stats
|
57 |
+
self.stats["vocab_sizes"].append(len(self.stoi))
|
58 |
+
self.stats["data_sizes"].append(len(self.data))
|
59 |
+
self.stats["compression_ratios"].append(1.0)
|
60 |
+
|
61 |
+
def get_digram_stats(self):
|
62 |
+
"""Optimized digram counting using Counter"""
|
63 |
+
# Pre-compute pairs for all data at once
|
64 |
+
pairs = zip(self.data, self.data[1:])
|
65 |
+
return Counter((int(pair[0]), int(pair[1])) for pair in pairs)
|
66 |
+
|
67 |
+
def replace_byte_pair_in_data(self, pair, new_token):
|
68 |
+
"""Optimized pair replacement using numpy"""
|
69 |
+
data = np.array(self.data)
|
70 |
+
i = 0
|
71 |
+
result = []
|
72 |
+
|
73 |
+
# Use numpy's vectorized operations
|
74 |
+
while i < len(data) - 1:
|
75 |
+
if data[i] == pair[0] and data[i + 1] == pair[1]:
|
76 |
+
result.append(new_token)
|
77 |
+
i += 2
|
78 |
+
else:
|
79 |
+
result.append(data[i])
|
80 |
+
i += 1
|
81 |
+
|
82 |
+
if i == len(data) - 1:
|
83 |
+
result.append(data[-1])
|
84 |
+
|
85 |
+
return result
|
86 |
+
|
87 |
+
def encode_pair(self, pair):
|
88 |
+
"""Add a new token to vocabulary from pair"""
|
89 |
+
pair_str = self.itos[pair[0]] + self.itos[pair[1]]
|
90 |
+
next_idx = len(self.itos)
|
91 |
+
self.stoi[pair_str] = next_idx
|
92 |
+
self.itos[next_idx] = pair_str
|
93 |
+
|
94 |
+
# Update max token length
|
95 |
+
self.max_token_length = max(self.max_token_length, len(pair_str))
|
96 |
+
return next_idx
|
97 |
+
|
98 |
+
def train(self, texts, min_frequency=2, print_interval=500):
|
99 |
+
"""Optimized BPE training with vectorized operations"""
|
100 |
+
# Combine all texts and initialize vocab
|
101 |
+
print("Initializing vocabulary...")
|
102 |
+
full_text = " ".join(texts)
|
103 |
+
self.initialize_vocab(full_text)
|
104 |
+
|
105 |
+
# Convert data to numpy array for faster operations
|
106 |
+
data = np.array(self.data, dtype=np.int32)
|
107 |
+
|
108 |
+
# Pre-compute character frequencies using numpy
|
109 |
+
print("Computing initial frequencies...")
|
110 |
+
unique, counts = np.unique(data, return_counts=True)
|
111 |
+
char_freqs = dict(zip(unique, counts))
|
112 |
+
|
113 |
+
# Initialize progress bar
|
114 |
+
pbar = tqdm(total=self.vocab_size - len(self.stoi),
|
115 |
+
desc="Training BPE",
|
116 |
+
position=0)
|
117 |
+
|
118 |
+
# Batch processing parameters
|
119 |
+
batch_size = min(1000, self.vocab_size - len(self.stoi))
|
120 |
+
stats_buffer = []
|
121 |
+
|
122 |
+
while len(self.stoi) < self.vocab_size:
|
123 |
+
# Get pair frequencies using vectorized operations
|
124 |
+
# Create a view of consecutive pairs
|
125 |
+
pair_view = np.lib.stride_tricks.sliding_window_view(data, 2)
|
126 |
+
|
127 |
+
# Convert to tuples for counting
|
128 |
+
pairs = [tuple(pair) for pair in pair_view]
|
129 |
+
pair_counts = Counter(pairs)
|
130 |
+
|
131 |
+
if not pair_counts:
|
132 |
+
break
|
133 |
+
|
134 |
+
# Get top pairs for batch processing
|
135 |
+
top_pairs = sorted(pair_counts.items(), key=lambda x: x[1], reverse=True)[:batch_size]
|
136 |
+
|
137 |
+
# Process batch of pairs
|
138 |
+
for (token1, token2), freq in top_pairs:
|
139 |
+
if len(self.stoi) >= self.vocab_size:
|
140 |
+
break
|
141 |
+
|
142 |
+
# Create new token
|
143 |
+
new_idx = self.encode_pair((token1, token2))
|
144 |
+
|
145 |
+
# Vectorized pair replacement
|
146 |
+
# Create a boolean mask for matching pairs
|
147 |
+
pair_mask = (data[:-1] == token1) & (data[1:] == token2)
|
148 |
+
if not np.any(pair_mask):
|
149 |
+
continue
|
150 |
+
|
151 |
+
# Create new data array efficiently
|
152 |
+
indices = np.where(pair_mask)[0]
|
153 |
+
new_data = np.empty(len(data) - len(indices), dtype=np.int32)
|
154 |
+
|
155 |
+
# Fill new data array using vectorized operations
|
156 |
+
pos = 0
|
157 |
+
prev_idx = 0
|
158 |
+
for idx in indices:
|
159 |
+
# Copy unchanged elements
|
160 |
+
new_data[pos:pos + (idx - prev_idx)] = data[prev_idx:idx]
|
161 |
+
pos += idx - prev_idx
|
162 |
+
# Add merged token
|
163 |
+
new_data[pos] = new_idx
|
164 |
+
pos += 1
|
165 |
+
prev_idx = idx + 2
|
166 |
+
|
167 |
+
# Copy remaining elements
|
168 |
+
if prev_idx < len(data):
|
169 |
+
new_data[pos:] = data[prev_idx:]
|
170 |
+
|
171 |
+
data = new_data
|
172 |
+
|
173 |
+
# Update statistics
|
174 |
+
stats_buffer.append({
|
175 |
+
'vocab_size': len(self.stoi),
|
176 |
+
'data_size': len(data),
|
177 |
+
'merge_count': freq,
|
178 |
+
'new_token': self.itos[new_idx]
|
179 |
+
})
|
180 |
+
|
181 |
+
pbar.update(1)
|
182 |
+
|
183 |
+
# Batch update statistics
|
184 |
+
if len(stats_buffer) >= print_interval:
|
185 |
+
self._update_stats_batch(stats_buffer)
|
186 |
+
if print_interval:
|
187 |
+
self.print_progress(
|
188 |
+
len(self.stoi),
|
189 |
+
stats_buffer[-1]['new_token'],
|
190 |
+
stats_buffer[-1]['merge_count']
|
191 |
+
)
|
192 |
+
stats_buffer = []
|
193 |
+
|
194 |
+
# Final statistics update
|
195 |
+
if stats_buffer:
|
196 |
+
self._update_stats_batch(stats_buffer)
|
197 |
+
|
198 |
+
pbar.close()
|
199 |
+
self.data = data.tolist()
|
200 |
+
|
201 |
+
# Calculate final compression ratio
|
202 |
+
final_ratio = self.original_length / len(self.data)
|
203 |
+
print(f"\nTraining completed. Final vocabulary size: {len(self.stoi)}")
|
204 |
+
print(f"Final compression ratio: {final_ratio:.2f}")
|
205 |
+
|
206 |
+
def _update_stats_batch(self, stats_buffer):
|
207 |
+
"""Update statistics in batch for better performance"""
|
208 |
+
if not stats_buffer:
|
209 |
+
return
|
210 |
+
|
211 |
+
# Update all statistics at once
|
212 |
+
self.stats["vocab_sizes"].extend(s['vocab_size'] for s in stats_buffer)
|
213 |
+
self.stats["data_sizes"].extend(s['data_size'] for s in stats_buffer)
|
214 |
+
self.stats["merge_counts"].extend(s['merge_count'] for s in stats_buffer)
|
215 |
+
self.stats["tokens_created"].extend(s['new_token'] for s in stats_buffer)
|
216 |
+
|
217 |
+
# Update compression ratios
|
218 |
+
new_ratios = [self.original_length / s['data_size'] for s in stats_buffer]
|
219 |
+
self.stats["compression_ratios"].extend(new_ratios)
|
220 |
+
|
221 |
+
# Update max token lengths
|
222 |
+
self.stats["max_token_lengths"].extend([self.max_token_length] * len(stats_buffer))
|
223 |
+
|
224 |
+
def print_progress(self, iteration, new_token, merge_count):
|
225 |
+
"""Print training progress"""
|
226 |
+
print(f"\nIteration {iteration:,}")
|
227 |
+
print(f"Created token: '{new_token}' (merged {merge_count:,} times)")
|
228 |
+
print(f"Current vocabulary size: {len(self.stoi):,}")
|
229 |
+
print(f"Current data size: {len(self.data):,}")
|
230 |
+
print(f"Current compression ratio: {self.stats['compression_ratios'][-1]:.2f}")
|
231 |
+
print("-" * 80)
|
232 |
+
|
233 |
+
def plot_statistics(self):
|
234 |
+
"""Plot training statistics"""
|
235 |
+
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
|
236 |
+
|
237 |
+
# Plot 1: Vocabulary Size vs Data Size
|
238 |
+
ax1.plot(self.stats["vocab_sizes"], self.stats["data_sizes"])
|
239 |
+
ax1.set_xlabel("Vocabulary Size")
|
240 |
+
ax1.set_ylabel("Dataset Size")
|
241 |
+
ax1.set_title("Vocabulary Size vs Dataset Size")
|
242 |
+
|
243 |
+
# Plot 2: Compression Ratio vs Vocabulary Size
|
244 |
+
ax2.plot(self.stats["vocab_sizes"], self.stats["compression_ratios"])
|
245 |
+
ax2.set_xlabel("Vocabulary Size")
|
246 |
+
ax2.set_ylabel("Compression Ratio")
|
247 |
+
ax2.set_title("Compression Ratio vs Vocabulary Size")
|
248 |
+
|
249 |
+
# Plot 3: Merge Counts Distribution
|
250 |
+
if self.stats["merge_counts"]:
|
251 |
+
ax3.hist(self.stats["merge_counts"], bins=30)
|
252 |
+
ax3.set_xlabel("Number of Merges")
|
253 |
+
ax3.set_ylabel("Frequency")
|
254 |
+
ax3.set_title("Distribution of Merge Counts")
|
255 |
+
|
256 |
+
# Plot 4: Token Lengths Over Time
|
257 |
+
if self.stats["tokens_created"]:
|
258 |
+
token_lengths = [len(token) for token in self.stats["tokens_created"]]
|
259 |
+
ax4.plot(range(len(token_lengths)), token_lengths)
|
260 |
+
ax4.set_xlabel("Merge Operation")
|
261 |
+
ax4.set_ylabel("New Token Length")
|
262 |
+
ax4.set_title("Token Length Evolution")
|
263 |
+
|
264 |
+
plt.tight_layout()
|
265 |
+
plt.show()
|
266 |
+
|
267 |
+
def save(self, filepath: str) -> None:
|
268 |
+
"""Save tokenizer state to a JSON file"""
|
269 |
+
state = {
|
270 |
+
"stoi": self.stoi,
|
271 |
+
"itos": self.itos,
|
272 |
+
"max_token_length": self.max_token_length,
|
273 |
+
"stats": self.stats,
|
274 |
+
"special_tokens": self.special_tokens
|
275 |
+
}
|
276 |
+
|
277 |
+
with open(filepath, "w", encoding="utf-8") as f:
|
278 |
+
json.dump(state, f, ensure_ascii=False, indent=2)
|
279 |
+
|
280 |
+
@classmethod
|
281 |
+
def load(cls, filepath: str) -> "BPETokenizer":
|
282 |
+
"""Load tokenizer state from a JSON file"""
|
283 |
+
with open(filepath, "r", encoding="utf-8") as f:
|
284 |
+
state = json.load(f)
|
285 |
+
|
286 |
+
# Create new instance
|
287 |
+
instance = cls()
|
288 |
+
|
289 |
+
# Convert string keys to integers in itos
|
290 |
+
instance.itos = {int(k): v for k, v in state["itos"].items()}
|
291 |
+
# Convert string values to integers in stoi
|
292 |
+
instance.stoi = {k: int(v) for k, v in state["stoi"].items()}
|
293 |
+
instance.max_token_length = state["max_token_length"]
|
294 |
+
instance.stats = state["stats"]
|
295 |
+
instance.special_tokens = state["special_tokens"]
|
296 |
+
|
297 |
+
# Debug info
|
298 |
+
print(f"Loaded vocabulary size: {len(instance.itos)}")
|
299 |
+
print(f"Max token ID: {max(instance.itos.keys())}")
|
300 |
+
print(f"Sample tokens: {list(instance.itos.items())[:5]}")
|
301 |
+
|
302 |
+
return instance
|
303 |
+
|
304 |
+
def encode(self, text: str):
|
305 |
+
"""Convert text to token indices"""
|
306 |
+
# Preprocess input text
|
307 |
+
text = preprocess_hindi_text(text)
|
308 |
+
|
309 |
+
tokens = []
|
310 |
+
token_ids = []
|
311 |
+
|
312 |
+
# Split text into words
|
313 |
+
words = text.split()
|
314 |
+
|
315 |
+
for word in words:
|
316 |
+
# Try to find longest matching token
|
317 |
+
while word:
|
318 |
+
longest_match = None
|
319 |
+
for token, idx in sorted(self.stoi.items(), key=lambda x: len(x[0]), reverse=True):
|
320 |
+
if word.startswith(token):
|
321 |
+
longest_match = (token, idx)
|
322 |
+
break
|
323 |
+
|
324 |
+
if longest_match:
|
325 |
+
token, idx = longest_match
|
326 |
+
tokens.append(token)
|
327 |
+
token_ids.append(idx)
|
328 |
+
word = word[len(token):]
|
329 |
+
else:
|
330 |
+
# Skip unknown character and continue
|
331 |
+
word = word[1:]
|
332 |
+
|
333 |
+
return token_ids, tokens
|
334 |
+
|
335 |
+
def decode(self, token_ids: list) -> str:
|
336 |
+
"""Convert token indices back to text with better error handling"""
|
337 |
+
decoded_tokens = []
|
338 |
+
max_id = max(self.itos.keys())
|
339 |
+
|
340 |
+
for idx in token_ids:
|
341 |
+
try:
|
342 |
+
# Convert to int and check range
|
343 |
+
idx = int(idx) if isinstance(idx, str) else idx
|
344 |
+
if idx < 0 or idx > max_id:
|
345 |
+
continue
|
346 |
+
|
347 |
+
# Get token from vocabulary
|
348 |
+
if idx in self.itos:
|
349 |
+
token = self.itos[idx]
|
350 |
+
if token not in self.special_tokens:
|
351 |
+
# Add token with space
|
352 |
+
decoded_tokens.append(token)
|
353 |
+
|
354 |
+
except (ValueError, KeyError):
|
355 |
+
continue
|
356 |
+
|
357 |
+
# Join all tokens with spaces and clean up extra spaces
|
358 |
+
result = " ".join(token for token in decoded_tokens if token.strip())
|
359 |
+
# Remove duplicate spaces and strip
|
360 |
+
result = " ".join(result.split())
|
361 |
+
return result
|
362 |
+
|
363 |
+
def download_dataset(url, filepath, max_size_gb=2):
|
364 |
+
"""
|
365 |
+
Downloads a portion of the dataset with size limit and resume capability.
|
366 |
+
|
367 |
+
Args:
|
368 |
+
url (str): URL of the dataset
|
369 |
+
filepath (Path): Path where the file should be saved
|
370 |
+
max_size_gb (float): Maximum size to download in gigabytes
|
371 |
+
"""
|
372 |
+
max_size_bytes = max_size_gb * 1024 * 1024 * 1024 # Convert GB to bytes
|
373 |
+
|
374 |
+
# Check if we already have enough data
|
375 |
+
if filepath.exists() and filepath.stat().st_size >= max_size_bytes:
|
376 |
+
print(f"Already have {max_size_gb}GB of data, skipping download.")
|
377 |
+
return
|
378 |
+
|
379 |
+
print(f"Downloading first {max_size_gb}GB from {url}")
|
380 |
+
|
381 |
+
# Get the current size if file exists (for resume)
|
382 |
+
current_size = filepath.stat().st_size if filepath.exists() else 0
|
383 |
+
|
384 |
+
# Set up headers for resume
|
385 |
+
headers = {'Range': f'bytes={current_size}-'} if current_size > 0 else {}
|
386 |
+
|
387 |
+
try:
|
388 |
+
response = requests.get(url, stream=True, headers=headers)
|
389 |
+
response.raise_for_status()
|
390 |
+
|
391 |
+
# Get file size for progress bar
|
392 |
+
total_size = min(
|
393 |
+
int(response.headers.get('content-length', 0)) + current_size,
|
394 |
+
max_size_bytes
|
395 |
+
)
|
396 |
+
|
397 |
+
mode = 'ab' if current_size > 0 else 'wb'
|
398 |
+
with open(filepath, mode) as file, tqdm(
|
399 |
+
desc="Downloading",
|
400 |
+
initial=current_size,
|
401 |
+
total=total_size,
|
402 |
+
unit='iB',
|
403 |
+
unit_scale=True,
|
404 |
+
unit_divisor=1024,
|
405 |
+
) as progress_bar:
|
406 |
+
for data in response.iter_content(chunk_size=8192):
|
407 |
+
if not data:
|
408 |
+
break
|
409 |
+
|
410 |
+
size = file.write(data)
|
411 |
+
progress_bar.update(size)
|
412 |
+
|
413 |
+
# Check if we've reached the size limit
|
414 |
+
if file.tell() >= max_size_bytes:
|
415 |
+
print(f"\nReached {max_size_gb}GB limit, stopping download.")
|
416 |
+
break
|
417 |
+
|
418 |
+
except requests.exceptions.RequestException as e:
|
419 |
+
print(f"Error during download: {e}")
|
420 |
+
if filepath.exists():
|
421 |
+
print("Partial download remains available for resume.")
|
422 |
+
raise
|
423 |
+
|
424 |
+
def prepare_dataset(input_path, sample_size=None, max_lines=None):
|
425 |
+
"""
|
426 |
+
Prepares the dataset by optionally sampling and basic cleaning.
|
427 |
+
|
428 |
+
Args:
|
429 |
+
input_path (Path): Path to the raw dataset
|
430 |
+
sample_size (int, optional): Number of lines to sample. If None, use entire dataset
|
431 |
+
max_lines (int, optional): Maximum number of lines to read from file
|
432 |
+
|
433 |
+
Returns:
|
434 |
+
list: Processed lines from the dataset
|
435 |
+
"""
|
436 |
+
print("Reading and preparing dataset...")
|
437 |
+
lines = []
|
438 |
+
|
439 |
+
with open(input_path, 'r', encoding='utf-8') as file:
|
440 |
+
for i, line in enumerate(tqdm(file, desc="Reading lines")):
|
441 |
+
if max_lines and i >= max_lines:
|
442 |
+
break
|
443 |
+
|
444 |
+
if line.strip():
|
445 |
+
lines.append(line)
|
446 |
+
if sample_size and len(lines) >= sample_size:
|
447 |
+
break
|
448 |
+
|
449 |
+
return lines
|
450 |
+
|
451 |
+
def preprocess_hindi_text(text):
|
452 |
+
"""
|
453 |
+
Preprocesses Hindi text by removing unwanted characters and normalizing punctuation.
|
454 |
+
|
455 |
+
Args:
|
456 |
+
text (str): Raw Hindi text input
|
457 |
+
|
458 |
+
Returns:
|
459 |
+
str: Cleaned and normalized text
|
460 |
+
"""
|
461 |
+
# Remove <unk> tokens first
|
462 |
+
text = text.replace("<unk>", "")
|
463 |
+
|
464 |
+
# Retain Hindi characters and punctuation
|
465 |
+
text = re.sub(r"[^\u0900-\u097F\s।,.!?\-]", "", text)
|
466 |
+
# Remove digits (both English and Hindi)
|
467 |
+
text = re.sub(r"[0-9०-९]", "", text)
|
468 |
+
# Normalize full stops and whitespace
|
469 |
+
text = re.sub(r"।", ".", text)
|
470 |
+
text = re.sub(r"\s+", " ", text).strip()
|
471 |
+
return text
|
472 |
+
|
473 |
+
def calculate_compression_ratio(tokenizer, corpus_path):
|
474 |
+
"""
|
475 |
+
Calculates the compression ratio for the tokenizer on the given corpus.
|
476 |
+
|
477 |
+
Args:
|
478 |
+
tokenizer (Tokenizer): Trained BPE tokenizer
|
479 |
+
corpus_path (str): Path to the preprocessed corpus
|
480 |
+
|
481 |
+
Returns:
|
482 |
+
float: Compression ratio (characters/tokens)
|
483 |
+
"""
|
484 |
+
with open(corpus_path, "r", encoding="utf-8") as file:
|
485 |
+
corpus = file.readlines()
|
486 |
+
|
487 |
+
total_chars = sum(len(line) for line in corpus)
|
488 |
+
total_tokens = sum(len(tokenizer.encode(line).tokens) for line in corpus)
|
489 |
+
return total_chars / total_tokens
|
490 |
+
|
491 |
+
def encode_text(tokenizer, text):
|
492 |
+
cleaned_text = preprocess_hindi_text(text)
|
493 |
+
return tokenizer.encode(cleaned_text)
|
494 |
+
|
495 |
+
def decode_text(tokenizer, token_ids):
|
496 |
+
return tokenizer.decode(token_ids)
|
497 |
+
|
498 |
+
def test_tokenizer(tokenizer, test_text):
|
499 |
+
"""
|
500 |
+
Tests the tokenizer by encoding and decoding sample text.
|
501 |
+
|
502 |
+
Args:
|
503 |
+
tokenizer (Tokenizer): Trained BPE tokenizer
|
504 |
+
test_text (str): Sample text for testing
|
505 |
+
"""
|
506 |
+
print("\nTokenizer Test:")
|
507 |
+
print("-" * 50)
|
508 |
+
print(f"Original Text: {test_text}")
|
509 |
+
|
510 |
+
# Encode
|
511 |
+
token_ids, tokens = encode_text(tokenizer, test_text)
|
512 |
+
print(f"\nTokens: {tokens}")
|
513 |
+
print(f"Token IDs: {token_ids}")
|
514 |
+
|
515 |
+
# Decode
|
516 |
+
decoded_text = decode_text(tokenizer, token_ids)
|
517 |
+
print(f"\nDecoded Text: {decoded_text}")
|
518 |
+
|
519 |
+
def main():
|
520 |
+
# Create output directory if it doesn't exist
|
521 |
+
output_dir = Path("output")
|
522 |
+
output_dir.mkdir(exist_ok=True)
|
523 |
+
|
524 |
+
# Dataset URL and paths
|
525 |
+
dataset_url = "https://objectstore.e2enetworks.net/ai4b-public-nlu-nlg/v1-indiccorp/hi.txt"
|
526 |
+
raw_dataset_path = Path("raw_hindi_dataset.txt")
|
527 |
+
preprocessed_path = output_dir / "preprocessed_hindi.txt"
|
528 |
+
|
529 |
+
# Step 1: Download dataset if it doesn't exist or is too small
|
530 |
+
if not raw_dataset_path.exists() or raw_dataset_path.stat().st_size < (10 * 1024 * 1024 * 1024):
|
531 |
+
print("Step 1: Downloading dataset (10GB limit)...")
|
532 |
+
try:
|
533 |
+
download_dataset(dataset_url, raw_dataset_path, max_size_gb=10)
|
534 |
+
except requests.exceptions.RequestException as e:
|
535 |
+
print(f"Error downloading dataset: {e}")
|
536 |
+
if not raw_dataset_path.exists():
|
537 |
+
return
|
538 |
+
print("Continuing with existing partial download...")
|
539 |
+
else:
|
540 |
+
print("Sufficient dataset already exists, skipping download.")
|
541 |
+
|
542 |
+
# Step 2: Prepare and preprocess the dataset
|
543 |
+
print("Step 2: Preprocessing dataset...")
|
544 |
+
try:
|
545 |
+
# Sample 2 Million lines from the first 3 Million lines
|
546 |
+
raw_data = prepare_dataset(
|
547 |
+
raw_dataset_path,
|
548 |
+
sample_size=2_000_000,
|
549 |
+
max_lines=3_000_000
|
550 |
+
)
|
551 |
+
except FileNotFoundError:
|
552 |
+
print(f"Error: Input file '{raw_dataset_path}' not found!")
|
553 |
+
return
|
554 |
+
except Exception as e:
|
555 |
+
print(f"Error preparing dataset: {e}")
|
556 |
+
return
|
557 |
+
|
558 |
+
# Preprocess the text
|
559 |
+
print("Cleaning and normalizing text...")
|
560 |
+
preprocessed_data = [preprocess_hindi_text(line) for line in tqdm(raw_data)]
|
561 |
+
|
562 |
+
# Save the preprocessed dataset
|
563 |
+
with open(preprocessed_path, "w", encoding="utf-8") as file:
|
564 |
+
file.write("\n".join(preprocessed_data))
|
565 |
+
|
566 |
+
# Initialize and train our custom BPE tokenizer
|
567 |
+
tokenizer = BPETokenizer(vocab_size=5000)
|
568 |
+
tokenizer.train(preprocessed_data, min_frequency=2)
|
569 |
+
|
570 |
+
# Save the tokenizer
|
571 |
+
config_path = output_dir / "hindi_encoder.json"
|
572 |
+
tokenizer.save(str(config_path))
|
573 |
+
|
574 |
+
# Test the tokenizer
|
575 |
+
#test_text = "नमस्ते भारत! यह एक परीक्षण वाक्य है।"
|
576 |
+
test_text = "फिर पानी भी कम मात्रा में"
|
577 |
+
test_tokenizer(tokenizer, test_text)
|
578 |
+
|
579 |
+
return tokenizer
|
580 |
+
|
581 |
+
def load_tokenizer(config_path):
|
582 |
+
"""
|
583 |
+
Loads a previously trained tokenizer from a configuration file.
|
584 |
+
|
585 |
+
Args:
|
586 |
+
config_path (str): Path to the tokenizer configuration file
|
587 |
+
|
588 |
+
Returns:
|
589 |
+
Tokenizer: Loaded tokenizer
|
590 |
+
"""
|
591 |
+
return BPETokenizer.load(config_path)
|
592 |
+
|
593 |
+
if __name__ == "__main__":
|
594 |
+
main()
|
output/hindi_encoder.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
numpy
|
2 |
+
requests
|
3 |
+
tqdm
|
4 |
+
matplotlib
|
5 |
+
|
use_tokenizer.py
ADDED
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from pathlib import Path
|
2 |
+
from hindi_tokenizer import load_tokenizer, encode_text, decode_text
|
3 |
+
|
4 |
+
def main():
|
5 |
+
# Load the trained tokenizer
|
6 |
+
output_dir = Path("output")
|
7 |
+
config_path = output_dir / "hindi_encoder.json"
|
8 |
+
|
9 |
+
if not config_path.exists():
|
10 |
+
print("Error: Tokenizer configuration not found! Please train the tokenizer first.")
|
11 |
+
return
|
12 |
+
|
13 |
+
tokenizer = load_tokenizer(str(config_path))
|
14 |
+
|
15 |
+
# Interactive loop
|
16 |
+
print("Hindi Text Encoder/Decoder (type 'quit' to exit)")
|
17 |
+
print("-" * 50)
|
18 |
+
|
19 |
+
while True:
|
20 |
+
text = input("\nEnter Hindi text to encode/decode: ")
|
21 |
+
|
22 |
+
if text.lower() == 'quit':
|
23 |
+
break
|
24 |
+
|
25 |
+
if not text.strip():
|
26 |
+
continue
|
27 |
+
|
28 |
+
# Encode the text
|
29 |
+
token_ids, tokens = encode_text(tokenizer, text)
|
30 |
+
print("\nEncoding:")
|
31 |
+
print(f"Tokens: {tokens}")
|
32 |
+
print(f"Token IDs: {token_ids}")
|
33 |
+
|
34 |
+
# Decode back
|
35 |
+
decoded_text = decode_text(tokenizer, token_ids)
|
36 |
+
print("\nDecoding:")
|
37 |
+
print(f"Text: {decoded_text}")
|
38 |
+
|
39 |
+
if __name__ == "__main__":
|
40 |
+
main()
|