Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer Paper • 2503.02495 • Published 6 days ago • 7
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections Paper • 2502.12170 • Published 25 days ago • 12
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections Paper • 2502.12170 • Published 25 days ago • 12 • 2
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling Paper • 2501.16975 • Published Jan 28 • 26
Improving Transformers with Dynamically Composable Multi-Head Attention Paper • 2405.08553 • Published May 14, 2024 • 1