Bauwens
/

BPE-32k_SlimPajama-3M

Model card Files Files and versions Community

BPE-32k_SlimPajama-3M / README.md

Bauwens's picture

Update README.md

9ede907 verified 5 months ago

|

history blame contribute delete

1.5 kB

	# BPE-32k SlimPajama-3M
	BPE tokeniser with vocabulary size 32768, trained on the first 3 million examples in [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B).

	## Tokeniser details
	BPE trainer implementation:
	- Back-end: [SentencePiece](https://github.com/google/sentencepiece)'s `SentencePieceTrainer`.
	- Front-end: [TkTkT](https://github.com/bauwenst/TkTkT)'s [`BPEVocabulariser`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/models/bpe/vocabularisation.py#L210)

	Preprocessor:
	- During training: TkTkT's [`SentencePiecePreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L181)
	- During inference: TkTkT's [`ModernEnglishPreprocessor`](https://github.com/bauwenst/TkTkT/blob/341ae85980a5a9a2d60dbdc88645f8828b5c3a06/src/tktkt/preparation/instances.py#L105)
	1. NFKC normalisation
	2. Punctuation splitter, whitespace splitter, English contraction splitter
	3. GPT-2's pseudo-byte mapping
	4. Start-of-word marker `Ġ`
	5. Digit and hyphen isolation

	## Training details
	Time: 3h10m
	- Preprocessing and counting the 3M corpus: 2h45m
	- BPE merges: 25m

	Memory: 33.42 GiB peak usage.

	Data sizes:
	- Examples considered: 3 000 000
	- Examples used: 2 609 893 (390 107 examples dropped for being > 8192 characters).
	- Characters counted: 6 685 212 190
	- Unique words after whitespace splitting: 9 254 839