Pretraining Using HF Tokenizers and Transformers

#36
by akhooli - opened

I looked for an end to end example of pretraining a fresh ModernBERT model including the tokenizer (ex. a new language), or fine-tuning an existing checkpoint (ex. ModernBERT-Base) using a custom tokenizer (to account for a different vocabulary of another language family).
A HuggingFace implementation is preferred (saw this but current code is not working).

Hello,

The pre-training codebase should do the trick, it is its main purpose and is optimized. While it is using Composer, you should be able to leverage HF models and tokenizers.
For continued pre-training, someone reported having issue with loading the weights of ModernBERT, so we will investigate and potentially release Composer checkpoints alongside the HF ones when we release all the pre-training checkpoints (which, as stated in the issue, should be better starting points than the post-decay ones).

Thanks. I had a look again at the repo and noticed the FlexBert uses the old bert-base tokenizer. I guess I should wait a bit as the HF way of doing it may require some additional tweaks - ex. issue 163.
Update: got inspiration from this discussion and trained a tiny model.

Sign up or log in to comment