peteparker456 commited on
Commit
1cfe633
·
verified ·
1 Parent(s): 65f25ef

Upload tokenizer

Browse files
Files changed (6) hide show
  1. README.md +199 -0
  2. merges.txt +144 -0
  3. special_tokens_map.json +5 -0
  4. tokenizer.json +1021 -0
  5. tokenizer_config.json +20 -0
  6. vocab.json +1 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
merges.txt ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #version: 0.2
2
+ a a
3
+ e n
4
+ a n
5
+ Ġ p
6
+ g a
7
+ t h
8
+ l a
9
+ d h
10
+ k u
11
+ r a
12
+ n a
13
+ d i
14
+ i n
15
+ i r
16
+ Ġ m
17
+ Ġ s
18
+ Ġ e
19
+ dh u
20
+ e r
21
+ Ġp an
22
+ k ku
23
+ u m
24
+ e en
25
+ a m
26
+ d a
27
+ k a
28
+ n aa
29
+ dh a
30
+ c h
31
+ o n
32
+ een ga
33
+ ir u
34
+ e s
35
+ o r
36
+ Ġ en
37
+ th u
38
+ p a
39
+ Ġ iru
40
+ n ga
41
+ a l
42
+ d u
43
+ t i
44
+ a h
45
+ i l
46
+ o o
47
+ a r
48
+ t y
49
+ k i
50
+ Ġ n
51
+ Ġm u
52
+ Ġ b
53
+ Ġ th
54
+ Ġ v
55
+ o m
56
+ Ġ a
57
+ e e
58
+ m a
59
+ t t
60
+ Ġ c
61
+ Ġ k
62
+ y um
63
+ l i
64
+ Ġ f
65
+ Ġen na
66
+ yum aa
67
+ p pa
68
+ r e
69
+ v a
70
+ Ġ h
71
+ Ġ aa
72
+ k k
73
+ n u
74
+ Ġpan na
75
+ o da
76
+ ra m
77
+ e l
78
+ Ġ naa
79
+ aa n
80
+ y a
81
+ Ġp a
82
+ Ġp o
83
+ Ġmu di
84
+ ppa di
85
+ l an
86
+ à ®
87
+ l o
88
+ am il
89
+ Ġiru kk
90
+ Ġmudi yumaa
91
+ e a
92
+ k ka
93
+ l u
94
+ Ġ ka
95
+ en d
96
+ ra dhu
97
+ na i
98
+ i di
99
+ u n
100
+ ti m
101
+ r u
102
+ u nga
103
+ v an
104
+ in g
105
+ Ġe ppadi
106
+ s t
107
+ u r
108
+ v e
109
+ y aa
110
+ Ġ t
111
+ Ġs aa
112
+ er i
113
+ Ġpan n
114
+ Ġn eenga
115
+ l la
116
+ s h
117
+ Ġ w
118
+ Ġ re
119
+ en t
120
+ th a
121
+ Ġe v
122
+ es t
123
+ tim e
124
+ g u
125
+ i g
126
+ r i
127
+ Ġ in
128
+ aa nga
129
+ Ġp idi
130
+ s i
131
+ Ġs e
132
+ Ġiru kku
133
+ Ġev lo
134
+ c e
135
+ c om
136
+ h t
137
+ l e
138
+ s e
139
+ an i
140
+ Ġp aa
141
+ da y
142
+ lu kku
143
+ T amil
144
+ l aa
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
@@ -0,0 +1,1021 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "<|endoftext|>",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": true,
13
+ "special": true
14
+ }
15
+ ],
16
+ "normalizer": null,
17
+ "pre_tokenizer": {
18
+ "type": "ByteLevel",
19
+ "add_prefix_space": false,
20
+ "trim_offsets": true,
21
+ "use_regex": true
22
+ },
23
+ "post_processor": {
24
+ "type": "ByteLevel",
25
+ "add_prefix_space": true,
26
+ "trim_offsets": false,
27
+ "use_regex": true
28
+ },
29
+ "decoder": {
30
+ "type": "ByteLevel",
31
+ "add_prefix_space": true,
32
+ "trim_offsets": true,
33
+ "use_regex": true
34
+ },
35
+ "model": {
36
+ "type": "BPE",
37
+ "dropout": null,
38
+ "unk_token": null,
39
+ "continuing_subword_prefix": "",
40
+ "end_of_word_suffix": "",
41
+ "fuse_unk": false,
42
+ "byte_fallback": false,
43
+ "ignore_merges": false,
44
+ "vocab": {
45
+ "<|endoftext|>": 0,
46
+ "!": 1,
47
+ "\"": 2,
48
+ "#": 3,
49
+ "$": 4,
50
+ "%": 5,
51
+ "&": 6,
52
+ "'": 7,
53
+ "(": 8,
54
+ ")": 9,
55
+ "*": 10,
56
+ "+": 11,
57
+ ",": 12,
58
+ "-": 13,
59
+ ".": 14,
60
+ "/": 15,
61
+ "0": 16,
62
+ "1": 17,
63
+ "2": 18,
64
+ "3": 19,
65
+ "4": 20,
66
+ "5": 21,
67
+ "6": 22,
68
+ "7": 23,
69
+ "8": 24,
70
+ "9": 25,
71
+ ":": 26,
72
+ ";": 27,
73
+ "<": 28,
74
+ "=": 29,
75
+ ">": 30,
76
+ "?": 31,
77
+ "@": 32,
78
+ "A": 33,
79
+ "B": 34,
80
+ "C": 35,
81
+ "D": 36,
82
+ "E": 37,
83
+ "F": 38,
84
+ "G": 39,
85
+ "H": 40,
86
+ "I": 41,
87
+ "J": 42,
88
+ "K": 43,
89
+ "L": 44,
90
+ "M": 45,
91
+ "N": 46,
92
+ "O": 47,
93
+ "P": 48,
94
+ "Q": 49,
95
+ "R": 50,
96
+ "S": 51,
97
+ "T": 52,
98
+ "U": 53,
99
+ "V": 54,
100
+ "W": 55,
101
+ "X": 56,
102
+ "Y": 57,
103
+ "Z": 58,
104
+ "[": 59,
105
+ "\\": 60,
106
+ "]": 61,
107
+ "^": 62,
108
+ "_": 63,
109
+ "`": 64,
110
+ "a": 65,
111
+ "b": 66,
112
+ "c": 67,
113
+ "d": 68,
114
+ "e": 69,
115
+ "f": 70,
116
+ "g": 71,
117
+ "h": 72,
118
+ "i": 73,
119
+ "j": 74,
120
+ "k": 75,
121
+ "l": 76,
122
+ "m": 77,
123
+ "n": 78,
124
+ "o": 79,
125
+ "p": 80,
126
+ "q": 81,
127
+ "r": 82,
128
+ "s": 83,
129
+ "t": 84,
130
+ "u": 85,
131
+ "v": 86,
132
+ "w": 87,
133
+ "x": 88,
134
+ "y": 89,
135
+ "z": 90,
136
+ "{": 91,
137
+ "|": 92,
138
+ "}": 93,
139
+ "~": 94,
140
+ "¡": 95,
141
+ "¢": 96,
142
+ "£": 97,
143
+ "¤": 98,
144
+ "¥": 99,
145
+ "¦": 100,
146
+ "§": 101,
147
+ "¨": 102,
148
+ "©": 103,
149
+ "ª": 104,
150
+ "«": 105,
151
+ "¬": 106,
152
+ "®": 107,
153
+ "¯": 108,
154
+ "°": 109,
155
+ "±": 110,
156
+ "²": 111,
157
+ "³": 112,
158
+ "´": 113,
159
+ "µ": 114,
160
+ "¶": 115,
161
+ "·": 116,
162
+ "¸": 117,
163
+ "¹": 118,
164
+ "º": 119,
165
+ "»": 120,
166
+ "¼": 121,
167
+ "½": 122,
168
+ "¾": 123,
169
+ "¿": 124,
170
+ "À": 125,
171
+ "Á": 126,
172
+ "Â": 127,
173
+ "Ã": 128,
174
+ "Ä": 129,
175
+ "Å": 130,
176
+ "Æ": 131,
177
+ "Ç": 132,
178
+ "È": 133,
179
+ "É": 134,
180
+ "Ê": 135,
181
+ "Ë": 136,
182
+ "Ì": 137,
183
+ "Í": 138,
184
+ "Î": 139,
185
+ "Ï": 140,
186
+ "Ð": 141,
187
+ "Ñ": 142,
188
+ "Ò": 143,
189
+ "Ó": 144,
190
+ "Ô": 145,
191
+ "Õ": 146,
192
+ "Ö": 147,
193
+ "×": 148,
194
+ "Ø": 149,
195
+ "Ù": 150,
196
+ "Ú": 151,
197
+ "Û": 152,
198
+ "Ü": 153,
199
+ "Ý": 154,
200
+ "Þ": 155,
201
+ "ß": 156,
202
+ "à": 157,
203
+ "á": 158,
204
+ "â": 159,
205
+ "ã": 160,
206
+ "ä": 161,
207
+ "å": 162,
208
+ "æ": 163,
209
+ "ç": 164,
210
+ "è": 165,
211
+ "é": 166,
212
+ "ê": 167,
213
+ "ë": 168,
214
+ "ì": 169,
215
+ "í": 170,
216
+ "î": 171,
217
+ "ï": 172,
218
+ "ð": 173,
219
+ "ñ": 174,
220
+ "ò": 175,
221
+ "ó": 176,
222
+ "ô": 177,
223
+ "õ": 178,
224
+ "ö": 179,
225
+ "÷": 180,
226
+ "ø": 181,
227
+ "ù": 182,
228
+ "ú": 183,
229
+ "û": 184,
230
+ "ü": 185,
231
+ "ý": 186,
232
+ "þ": 187,
233
+ "ÿ": 188,
234
+ "Ā": 189,
235
+ "ā": 190,
236
+ "Ă": 191,
237
+ "ă": 192,
238
+ "Ą": 193,
239
+ "ą": 194,
240
+ "Ć": 195,
241
+ "ć": 196,
242
+ "Ĉ": 197,
243
+ "ĉ": 198,
244
+ "Ċ": 199,
245
+ "ċ": 200,
246
+ "Č": 201,
247
+ "č": 202,
248
+ "Ď": 203,
249
+ "ď": 204,
250
+ "Đ": 205,
251
+ "đ": 206,
252
+ "Ē": 207,
253
+ "ē": 208,
254
+ "Ĕ": 209,
255
+ "ĕ": 210,
256
+ "Ė": 211,
257
+ "ė": 212,
258
+ "Ę": 213,
259
+ "ę": 214,
260
+ "Ě": 215,
261
+ "ě": 216,
262
+ "Ĝ": 217,
263
+ "ĝ": 218,
264
+ "Ğ": 219,
265
+ "ğ": 220,
266
+ "Ġ": 221,
267
+ "ġ": 222,
268
+ "Ģ": 223,
269
+ "ģ": 224,
270
+ "Ĥ": 225,
271
+ "ĥ": 226,
272
+ "Ħ": 227,
273
+ "ħ": 228,
274
+ "Ĩ": 229,
275
+ "ĩ": 230,
276
+ "Ī": 231,
277
+ "ī": 232,
278
+ "Ĭ": 233,
279
+ "ĭ": 234,
280
+ "Į": 235,
281
+ "į": 236,
282
+ "İ": 237,
283
+ "ı": 238,
284
+ "IJ": 239,
285
+ "ij": 240,
286
+ "Ĵ": 241,
287
+ "ĵ": 242,
288
+ "Ķ": 243,
289
+ "ķ": 244,
290
+ "ĸ": 245,
291
+ "Ĺ": 246,
292
+ "ĺ": 247,
293
+ "Ļ": 248,
294
+ "ļ": 249,
295
+ "Ľ": 250,
296
+ "ľ": 251,
297
+ "Ŀ": 252,
298
+ "ŀ": 253,
299
+ "Ł": 254,
300
+ "ł": 255,
301
+ "Ń": 256,
302
+ "aa": 257,
303
+ "en": 258,
304
+ "an": 259,
305
+ "Ġp": 260,
306
+ "ga": 261,
307
+ "th": 262,
308
+ "la": 263,
309
+ "dh": 264,
310
+ "ku": 265,
311
+ "ra": 266,
312
+ "na": 267,
313
+ "di": 268,
314
+ "in": 269,
315
+ "ir": 270,
316
+ "Ġm": 271,
317
+ "Ġs": 272,
318
+ "Ġe": 273,
319
+ "dhu": 274,
320
+ "er": 275,
321
+ "Ġpan": 276,
322
+ "kku": 277,
323
+ "um": 278,
324
+ "een": 279,
325
+ "am": 280,
326
+ "da": 281,
327
+ "ka": 282,
328
+ "naa": 283,
329
+ "dha": 284,
330
+ "ch": 285,
331
+ "on": 286,
332
+ "eenga": 287,
333
+ "iru": 288,
334
+ "es": 289,
335
+ "or": 290,
336
+ "Ġen": 291,
337
+ "thu": 292,
338
+ "pa": 293,
339
+ "Ġiru": 294,
340
+ "nga": 295,
341
+ "al": 296,
342
+ "du": 297,
343
+ "ti": 298,
344
+ "ah": 299,
345
+ "il": 300,
346
+ "oo": 301,
347
+ "ar": 302,
348
+ "ty": 303,
349
+ "ki": 304,
350
+ "Ġn": 305,
351
+ "Ġmu": 306,
352
+ "Ġb": 307,
353
+ "Ġth": 308,
354
+ "Ġv": 309,
355
+ "om": 310,
356
+ "Ġa": 311,
357
+ "ee": 312,
358
+ "ma": 313,
359
+ "tt": 314,
360
+ "Ġc": 315,
361
+ "Ġk": 316,
362
+ "yum": 317,
363
+ "li": 318,
364
+ "Ġf": 319,
365
+ "Ġenna": 320,
366
+ "yumaa": 321,
367
+ "ppa": 322,
368
+ "re": 323,
369
+ "va": 324,
370
+ "Ġh": 325,
371
+ "Ġaa": 326,
372
+ "kk": 327,
373
+ "nu": 328,
374
+ "Ġpanna": 329,
375
+ "oda": 330,
376
+ "ram": 331,
377
+ "el": 332,
378
+ "Ġnaa": 333,
379
+ "aan": 334,
380
+ "ya": 335,
381
+ "Ġpa": 336,
382
+ "Ġpo": 337,
383
+ "Ġmudi": 338,
384
+ "ppadi": 339,
385
+ "lan": 340,
386
+ "à®": 341,
387
+ "lo": 342,
388
+ "amil": 343,
389
+ "Ġirukk": 344,
390
+ "Ġmudiyumaa": 345,
391
+ "ea": 346,
392
+ "kka": 347,
393
+ "lu": 348,
394
+ "Ġka": 349,
395
+ "end": 350,
396
+ "radhu": 351,
397
+ "nai": 352,
398
+ "idi": 353,
399
+ "un": 354,
400
+ "tim": 355,
401
+ "ru": 356,
402
+ "unga": 357,
403
+ "van": 358,
404
+ "ing": 359,
405
+ "Ġeppadi": 360,
406
+ "st": 361,
407
+ "ur": 362,
408
+ "ve": 363,
409
+ "yaa": 364,
410
+ "Ġt": 365,
411
+ "Ġsaa": 366,
412
+ "eri": 367,
413
+ "Ġpann": 368,
414
+ "Ġneenga": 369,
415
+ "lla": 370,
416
+ "sh": 371,
417
+ "Ġw": 372,
418
+ "Ġre": 373,
419
+ "ent": 374,
420
+ "tha": 375,
421
+ "Ġev": 376,
422
+ "est": 377,
423
+ "time": 378,
424
+ "gu": 379,
425
+ "ig": 380,
426
+ "ri": 381,
427
+ "Ġin": 382,
428
+ "aanga": 383,
429
+ "Ġpidi": 384,
430
+ "si": 385,
431
+ "Ġse": 386,
432
+ "Ġirukku": 387,
433
+ "Ġevlo": 388,
434
+ "ce": 389,
435
+ "com": 390,
436
+ "ht": 391,
437
+ "le": 392,
438
+ "se": 393,
439
+ "ani": 394,
440
+ "Ġpaa": 395,
441
+ "day": 396,
442
+ "lukku": 397,
443
+ "Tamil": 398,
444
+ "laa": 399
445
+ },
446
+ "merges": [
447
+ [
448
+ "a",
449
+ "a"
450
+ ],
451
+ [
452
+ "e",
453
+ "n"
454
+ ],
455
+ [
456
+ "a",
457
+ "n"
458
+ ],
459
+ [
460
+ "Ġ",
461
+ "p"
462
+ ],
463
+ [
464
+ "g",
465
+ "a"
466
+ ],
467
+ [
468
+ "t",
469
+ "h"
470
+ ],
471
+ [
472
+ "l",
473
+ "a"
474
+ ],
475
+ [
476
+ "d",
477
+ "h"
478
+ ],
479
+ [
480
+ "k",
481
+ "u"
482
+ ],
483
+ [
484
+ "r",
485
+ "a"
486
+ ],
487
+ [
488
+ "n",
489
+ "a"
490
+ ],
491
+ [
492
+ "d",
493
+ "i"
494
+ ],
495
+ [
496
+ "i",
497
+ "n"
498
+ ],
499
+ [
500
+ "i",
501
+ "r"
502
+ ],
503
+ [
504
+ "Ġ",
505
+ "m"
506
+ ],
507
+ [
508
+ "Ġ",
509
+ "s"
510
+ ],
511
+ [
512
+ "Ġ",
513
+ "e"
514
+ ],
515
+ [
516
+ "dh",
517
+ "u"
518
+ ],
519
+ [
520
+ "e",
521
+ "r"
522
+ ],
523
+ [
524
+ "Ġp",
525
+ "an"
526
+ ],
527
+ [
528
+ "k",
529
+ "ku"
530
+ ],
531
+ [
532
+ "u",
533
+ "m"
534
+ ],
535
+ [
536
+ "e",
537
+ "en"
538
+ ],
539
+ [
540
+ "a",
541
+ "m"
542
+ ],
543
+ [
544
+ "d",
545
+ "a"
546
+ ],
547
+ [
548
+ "k",
549
+ "a"
550
+ ],
551
+ [
552
+ "n",
553
+ "aa"
554
+ ],
555
+ [
556
+ "dh",
557
+ "a"
558
+ ],
559
+ [
560
+ "c",
561
+ "h"
562
+ ],
563
+ [
564
+ "o",
565
+ "n"
566
+ ],
567
+ [
568
+ "een",
569
+ "ga"
570
+ ],
571
+ [
572
+ "ir",
573
+ "u"
574
+ ],
575
+ [
576
+ "e",
577
+ "s"
578
+ ],
579
+ [
580
+ "o",
581
+ "r"
582
+ ],
583
+ [
584
+ "Ġ",
585
+ "en"
586
+ ],
587
+ [
588
+ "th",
589
+ "u"
590
+ ],
591
+ [
592
+ "p",
593
+ "a"
594
+ ],
595
+ [
596
+ "Ġ",
597
+ "iru"
598
+ ],
599
+ [
600
+ "n",
601
+ "ga"
602
+ ],
603
+ [
604
+ "a",
605
+ "l"
606
+ ],
607
+ [
608
+ "d",
609
+ "u"
610
+ ],
611
+ [
612
+ "t",
613
+ "i"
614
+ ],
615
+ [
616
+ "a",
617
+ "h"
618
+ ],
619
+ [
620
+ "i",
621
+ "l"
622
+ ],
623
+ [
624
+ "o",
625
+ "o"
626
+ ],
627
+ [
628
+ "a",
629
+ "r"
630
+ ],
631
+ [
632
+ "t",
633
+ "y"
634
+ ],
635
+ [
636
+ "k",
637
+ "i"
638
+ ],
639
+ [
640
+ "Ġ",
641
+ "n"
642
+ ],
643
+ [
644
+ "Ġm",
645
+ "u"
646
+ ],
647
+ [
648
+ "Ġ",
649
+ "b"
650
+ ],
651
+ [
652
+ "Ġ",
653
+ "th"
654
+ ],
655
+ [
656
+ "Ġ",
657
+ "v"
658
+ ],
659
+ [
660
+ "o",
661
+ "m"
662
+ ],
663
+ [
664
+ "Ġ",
665
+ "a"
666
+ ],
667
+ [
668
+ "e",
669
+ "e"
670
+ ],
671
+ [
672
+ "m",
673
+ "a"
674
+ ],
675
+ [
676
+ "t",
677
+ "t"
678
+ ],
679
+ [
680
+ "Ġ",
681
+ "c"
682
+ ],
683
+ [
684
+ "Ġ",
685
+ "k"
686
+ ],
687
+ [
688
+ "y",
689
+ "um"
690
+ ],
691
+ [
692
+ "l",
693
+ "i"
694
+ ],
695
+ [
696
+ "Ġ",
697
+ "f"
698
+ ],
699
+ [
700
+ "Ġen",
701
+ "na"
702
+ ],
703
+ [
704
+ "yum",
705
+ "aa"
706
+ ],
707
+ [
708
+ "p",
709
+ "pa"
710
+ ],
711
+ [
712
+ "r",
713
+ "e"
714
+ ],
715
+ [
716
+ "v",
717
+ "a"
718
+ ],
719
+ [
720
+ "Ġ",
721
+ "h"
722
+ ],
723
+ [
724
+ "Ġ",
725
+ "aa"
726
+ ],
727
+ [
728
+ "k",
729
+ "k"
730
+ ],
731
+ [
732
+ "n",
733
+ "u"
734
+ ],
735
+ [
736
+ "Ġpan",
737
+ "na"
738
+ ],
739
+ [
740
+ "o",
741
+ "da"
742
+ ],
743
+ [
744
+ "ra",
745
+ "m"
746
+ ],
747
+ [
748
+ "e",
749
+ "l"
750
+ ],
751
+ [
752
+ "Ġ",
753
+ "naa"
754
+ ],
755
+ [
756
+ "aa",
757
+ "n"
758
+ ],
759
+ [
760
+ "y",
761
+ "a"
762
+ ],
763
+ [
764
+ "Ġp",
765
+ "a"
766
+ ],
767
+ [
768
+ "Ġp",
769
+ "o"
770
+ ],
771
+ [
772
+ "Ġmu",
773
+ "di"
774
+ ],
775
+ [
776
+ "ppa",
777
+ "di"
778
+ ],
779
+ [
780
+ "l",
781
+ "an"
782
+ ],
783
+ [
784
+ "à",
785
+ "®"
786
+ ],
787
+ [
788
+ "l",
789
+ "o"
790
+ ],
791
+ [
792
+ "am",
793
+ "il"
794
+ ],
795
+ [
796
+ "Ġiru",
797
+ "kk"
798
+ ],
799
+ [
800
+ "Ġmudi",
801
+ "yumaa"
802
+ ],
803
+ [
804
+ "e",
805
+ "a"
806
+ ],
807
+ [
808
+ "k",
809
+ "ka"
810
+ ],
811
+ [
812
+ "l",
813
+ "u"
814
+ ],
815
+ [
816
+ "Ġ",
817
+ "ka"
818
+ ],
819
+ [
820
+ "en",
821
+ "d"
822
+ ],
823
+ [
824
+ "ra",
825
+ "dhu"
826
+ ],
827
+ [
828
+ "na",
829
+ "i"
830
+ ],
831
+ [
832
+ "i",
833
+ "di"
834
+ ],
835
+ [
836
+ "u",
837
+ "n"
838
+ ],
839
+ [
840
+ "ti",
841
+ "m"
842
+ ],
843
+ [
844
+ "r",
845
+ "u"
846
+ ],
847
+ [
848
+ "u",
849
+ "nga"
850
+ ],
851
+ [
852
+ "v",
853
+ "an"
854
+ ],
855
+ [
856
+ "in",
857
+ "g"
858
+ ],
859
+ [
860
+ "Ġe",
861
+ "ppadi"
862
+ ],
863
+ [
864
+ "s",
865
+ "t"
866
+ ],
867
+ [
868
+ "u",
869
+ "r"
870
+ ],
871
+ [
872
+ "v",
873
+ "e"
874
+ ],
875
+ [
876
+ "y",
877
+ "aa"
878
+ ],
879
+ [
880
+ "Ġ",
881
+ "t"
882
+ ],
883
+ [
884
+ "Ġs",
885
+ "aa"
886
+ ],
887
+ [
888
+ "er",
889
+ "i"
890
+ ],
891
+ [
892
+ "Ġpan",
893
+ "n"
894
+ ],
895
+ [
896
+ "Ġn",
897
+ "eenga"
898
+ ],
899
+ [
900
+ "l",
901
+ "la"
902
+ ],
903
+ [
904
+ "s",
905
+ "h"
906
+ ],
907
+ [
908
+ "Ġ",
909
+ "w"
910
+ ],
911
+ [
912
+ "Ġ",
913
+ "re"
914
+ ],
915
+ [
916
+ "en",
917
+ "t"
918
+ ],
919
+ [
920
+ "th",
921
+ "a"
922
+ ],
923
+ [
924
+ "Ġe",
925
+ "v"
926
+ ],
927
+ [
928
+ "es",
929
+ "t"
930
+ ],
931
+ [
932
+ "tim",
933
+ "e"
934
+ ],
935
+ [
936
+ "g",
937
+ "u"
938
+ ],
939
+ [
940
+ "i",
941
+ "g"
942
+ ],
943
+ [
944
+ "r",
945
+ "i"
946
+ ],
947
+ [
948
+ "Ġ",
949
+ "in"
950
+ ],
951
+ [
952
+ "aa",
953
+ "nga"
954
+ ],
955
+ [
956
+ "Ġp",
957
+ "idi"
958
+ ],
959
+ [
960
+ "s",
961
+ "i"
962
+ ],
963
+ [
964
+ "Ġs",
965
+ "e"
966
+ ],
967
+ [
968
+ "Ġiru",
969
+ "kku"
970
+ ],
971
+ [
972
+ "Ġev",
973
+ "lo"
974
+ ],
975
+ [
976
+ "c",
977
+ "e"
978
+ ],
979
+ [
980
+ "c",
981
+ "om"
982
+ ],
983
+ [
984
+ "h",
985
+ "t"
986
+ ],
987
+ [
988
+ "l",
989
+ "e"
990
+ ],
991
+ [
992
+ "s",
993
+ "e"
994
+ ],
995
+ [
996
+ "an",
997
+ "i"
998
+ ],
999
+ [
1000
+ "Ġp",
1001
+ "aa"
1002
+ ],
1003
+ [
1004
+ "da",
1005
+ "y"
1006
+ ],
1007
+ [
1008
+ "lu",
1009
+ "kku"
1010
+ ],
1011
+ [
1012
+ "T",
1013
+ "amil"
1014
+ ],
1015
+ [
1016
+ "l",
1017
+ "aa"
1018
+ ]
1019
+ ]
1020
+ }
1021
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": false,
15
+ "eos_token": "<|endoftext|>",
16
+ "extra_special_tokens": {},
17
+ "model_max_length": 1024,
18
+ "tokenizer_class": "GPT2Tokenizer",
19
+ "unk_token": "<|endoftext|>"
20
+ }
vocab.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"<|endoftext|>":0,"!":1,"\"":2,"#":3,"$":4,"%":5,"&":6,"'":7,"(":8,")":9,"*":10,"+":11,",":12,"-":13,".":14,"/":15,"0":16,"1":17,"2":18,"3":19,"4":20,"5":21,"6":22,"7":23,"8":24,"9":25,":":26,";":27,"<":28,"=":29,">":30,"?":31,"@":32,"A":33,"B":34,"C":35,"D":36,"E":37,"F":38,"G":39,"H":40,"I":41,"J":42,"K":43,"L":44,"M":45,"N":46,"O":47,"P":48,"Q":49,"R":50,"S":51,"T":52,"U":53,"V":54,"W":55,"X":56,"Y":57,"Z":58,"[":59,"\\":60,"]":61,"^":62,"_":63,"`":64,"a":65,"b":66,"c":67,"d":68,"e":69,"f":70,"g":71,"h":72,"i":73,"j":74,"k":75,"l":76,"m":77,"n":78,"o":79,"p":80,"q":81,"r":82,"s":83,"t":84,"u":85,"v":86,"w":87,"x":88,"y":89,"z":90,"{":91,"|":92,"}":93,"~":94,"¡":95,"¢":96,"£":97,"¤":98,"¥":99,"¦":100,"§":101,"¨":102,"©":103,"ª":104,"«":105,"¬":106,"®":107,"¯":108,"°":109,"±":110,"²":111,"³":112,"´":113,"µ":114,"¶":115,"·":116,"¸":117,"¹":118,"º":119,"»":120,"¼":121,"½":122,"¾":123,"¿":124,"À":125,"Á":126,"Â":127,"Ã":128,"Ä":129,"Å":130,"Æ":131,"Ç":132,"È":133,"É":134,"Ê":135,"Ë":136,"Ì":137,"Í":138,"Î":139,"Ï":140,"Ð":141,"Ñ":142,"Ò":143,"Ó":144,"Ô":145,"Õ":146,"Ö":147,"×":148,"Ø":149,"Ù":150,"Ú":151,"Û":152,"Ü":153,"Ý":154,"Þ":155,"ß":156,"à":157,"á":158,"â":159,"ã":160,"ä":161,"å":162,"æ":163,"ç":164,"è":165,"é":166,"ê":167,"ë":168,"ì":169,"í":170,"î":171,"ï":172,"ð":173,"ñ":174,"ò":175,"ó":176,"ô":177,"õ":178,"ö":179,"÷":180,"ø":181,"ù":182,"ú":183,"û":184,"ü":185,"ý":186,"þ":187,"ÿ":188,"Ā":189,"ā":190,"Ă":191,"ă":192,"Ą":193,"ą":194,"Ć":195,"ć":196,"Ĉ":197,"ĉ":198,"Ċ":199,"ċ":200,"Č":201,"č":202,"Ď":203,"ď":204,"Đ":205,"đ":206,"Ē":207,"ē":208,"Ĕ":209,"ĕ":210,"Ė":211,"ė":212,"Ę":213,"ę":214,"Ě":215,"ě":216,"Ĝ":217,"ĝ":218,"Ğ":219,"ğ":220,"Ġ":221,"ġ":222,"Ģ":223,"ģ":224,"Ĥ":225,"ĥ":226,"Ħ":227,"ħ":228,"Ĩ":229,"ĩ":230,"Ī":231,"ī":232,"Ĭ":233,"ĭ":234,"Į":235,"į":236,"İ":237,"ı":238,"IJ":239,"ij":240,"Ĵ":241,"ĵ":242,"Ķ":243,"ķ":244,"ĸ":245,"Ĺ":246,"ĺ":247,"Ļ":248,"ļ":249,"Ľ":250,"ľ":251,"Ŀ":252,"ŀ":253,"Ł":254,"ł":255,"Ń":256,"aa":257,"en":258,"an":259,"Ġp":260,"ga":261,"th":262,"la":263,"dh":264,"ku":265,"ra":266,"na":267,"di":268,"in":269,"ir":270,"Ġm":271,"Ġs":272,"Ġe":273,"dhu":274,"er":275,"Ġpan":276,"kku":277,"um":278,"een":279,"am":280,"da":281,"ka":282,"naa":283,"dha":284,"ch":285,"on":286,"eenga":287,"iru":288,"es":289,"or":290,"Ġen":291,"thu":292,"pa":293,"Ġiru":294,"nga":295,"al":296,"du":297,"ti":298,"ah":299,"il":300,"oo":301,"ar":302,"ty":303,"ki":304,"Ġn":305,"Ġmu":306,"Ġb":307,"Ġth":308,"Ġv":309,"om":310,"Ġa":311,"ee":312,"ma":313,"tt":314,"Ġc":315,"Ġk":316,"yum":317,"li":318,"Ġf":319,"Ġenna":320,"yumaa":321,"ppa":322,"re":323,"va":324,"Ġh":325,"Ġaa":326,"kk":327,"nu":328,"Ġpanna":329,"oda":330,"ram":331,"el":332,"Ġnaa":333,"aan":334,"ya":335,"Ġpa":336,"Ġpo":337,"Ġmudi":338,"ppadi":339,"lan":340,"à®":341,"lo":342,"amil":343,"Ġirukk":344,"Ġmudiyumaa":345,"ea":346,"kka":347,"lu":348,"Ġka":349,"end":350,"radhu":351,"nai":352,"idi":353,"un":354,"tim":355,"ru":356,"unga":357,"van":358,"ing":359,"Ġeppadi":360,"st":361,"ur":362,"ve":363,"yaa":364,"Ġt":365,"Ġsaa":366,"eri":367,"Ġpann":368,"Ġneenga":369,"lla":370,"sh":371,"Ġw":372,"Ġre":373,"ent":374,"tha":375,"Ġev":376,"est":377,"time":378,"gu":379,"ig":380,"ri":381,"Ġin":382,"aanga":383,"Ġpidi":384,"si":385,"Ġse":386,"Ġirukku":387,"Ġevlo":388,"ce":389,"com":390,"ht":391,"le":392,"se":393,"ani":394,"Ġpaa":395,"day":396,"lukku":397,"Tamil":398,"laa":399}