|
# BPE-32k SlimPajama-3M |
|
BPE tokeniser with vocabulary size 32768, trained on the first 3 million examples in [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B). |
|
|
|
## Tokeniser details |
|
BPE trainer implementation: |
|
- Back-end: [SentencePiece](https://github.com/google/sentencepiece)'s `SentencePieceTrainer`. |
|
- Front-end: [TkTkT](https://github.com/bauwenst/TkTkT)'s [`BPEVocabulariser`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/models/bpe/vocabularisation.py#L210) |
|
|
|
Preprocessor: |
|
- During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181) |
|
- During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105) |
|
1. NFKC normalisation |
|
2. Punctuation splitter, whitespace splitter, English contraction splitter |
|
3. GPT-2's pseudo-byte mapping |
|
4. Start-of-word marker `Ġ` |
|
5. Digit and hyphen isolation |
|
|
|
## Training details |
|
**Time:** 3h10m |
|
- Preprocessing and counting the 3M corpus: 2h45m |
|
- BPE merges: 25m |
|
|
|
**Memory:** 33.42 GiB peak usage. |
|
|
|
**Data sizes:** |
|
- Examples considered: 3 000 000 |
|
- Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters). |
|
- Characters counted: 6 685 212 190 |
|
- Unique words after whitespace splitting: 9 254 839 |
|
|