view post Post 1821 We distill a more accurate and concise dataset from DeepSeek R1, and also provide a distillation pipeline code repository.π€Dataset: SmallDoge/SmallThoughtsCode: https://github.com/SmallDoges/small-thoughts See translation π 8 8 β€οΈ 1 1 + Reply
Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture Paper β’ 2412.11834 β’ Published Dec 16, 2024 β’ 7
view post Post 2077 Only a single RTX 4090 running model pre-training is really slow, even for small language models!!! (https://huggingface.co/collections/JingzeShi/doge-slm-677fd879f8c4fd0f43e05458) See translation 2 replies Β· π 8 8 π€― 6 6 π 4 4 + Reply
view post Post 1714 π€©warmup -> stable -> decay leanring rate scheduler: πuse the Stable Phase CheckPoints to Continue Training the model on Any New Dataset without spikes of the training!!! SmallDoge/Doge-20M-checkpoint SmallDoge/Doge-60M-checkpoint See translation 4 replies Β· π₯ 7 7 π 1 1 π 1 1 π€ 1 1 + Reply