floom
's Collections
Model Training
updated
Rethinking Optimization and Architecture for Tiny Language Models
Paper
•
2402.02791
•
Published
•
12
More Agents Is All You Need
Paper
•
2402.05120
•
Published
•
51
Scaling Laws for Forgetting When Fine-Tuning Large Language Models
Paper
•
2401.05605
•
Published
Aligning Large Language Models with Counterfactual DPO
Paper
•
2401.09566
•
Published
•
2
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper
•
2402.10193
•
Published
•
19
Instruction-tuned Language Models are Better Knowledge Learners
Paper
•
2402.12847
•
Published
•
25
V-STaR: Training Verifiers for Self-Taught Reasoners
Paper
•
2402.06457
•
Published
•
9
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a
Single GPU
Paper
•
2403.06504
•
Published
•
53
Language models scale reliably with over-training and on downstream
tasks
Paper
•
2403.08540
•
Published
•
14
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Paper
•
2403.13372
•
Published
•
62
RAFT: Adapting Language Model to Domain Specific RAG
Paper
•
2403.10131
•
Published
•
67
Quiet-STaR: Language Models Can Teach Themselves to Think Before
Speaking
Paper
•
2403.09629
•
Published
•
75
Simple and Scalable Strategies to Continually Pre-train Large Language
Models
Paper
•
2403.08763
•
Published
•
49
Gemma: Open Models Based on Gemini Research and Technology
Paper
•
2403.08295
•
Published
•
47
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
•
2403.05530
•
Published
•
61
Teaching Large Language Models to Reason with Reinforcement Learning
Paper
•
2403.04642
•
Published
•
46
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper
•
2403.03507
•
Published
•
183
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
Paper
•
2403.02884
•
Published
•
15
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper
•
2402.17764
•
Published
•
606
Nemotron-4 15B Technical Report
Paper
•
2402.16819
•
Published
•
42
MegaScale: Scaling Large Language Model Training to More Than 10,000
GPUs
Paper
•
2402.15627
•
Published
•
34
Tele-FLM Technical Report
Paper
•
2404.16645
•
Published
•
17
Paper
•
2405.15682
•
Published
•
21
Xwin-LM: Strong and Scalable Alignment Practice for LLMs
Paper
•
2405.20335
•
Published
•
18
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small
Reference Models
Paper
•
2405.20541
•
Published
•
22
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive
Low-Rank Gradients
Paper
•
2407.08296
•
Published
•
31
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate
Scheduler
Paper
•
2408.13359
•
Published
•
23