# ULM-32k SlimPajama-3M ULM tokeniser with vocabulary size 32768, trained on the first 3 million examples in [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B). ## Tokeniser details ULM trainer implementation: - Back-end: [SentencePiece](https://github.com/google/sentencepiece)'s `SentencePieceTrainer`. - Front-end: [TkTkT](https://github.com/bauwenst/TkTkT)'s [`KudoPieceTrainer`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/models/kudopiece/vocabularisation.py#L40) Preprocessor: - During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181) - During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105) 1. NFKC normalisation 2. Punctuation splitter, whitespace splitter, English contraction splitter 3. GPT-2's pseudo-byte mapping 4. Start-of-word marker `Ġ` 5. Digit and hyphen isolation ## Training details **Time:** 3h40m - Preprocessing and counting the 3M corpus: 2h45m - ULM algorithm: 55m **Memory:** 257 GiB peak usage (i.e. about 80 GiB RAM per million sentences). **Data sizes:** - Examples considered: 3 000 000 - Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters). - Characters counted: 6 685 212 190 - Unique words after whitespace splitting: 9 254 839