Papers
arxiv:2410.24050

Clustering Head: A Visual Case Study of the Training Dynamics in Transformers

Published on Oct 31, 2024
Authors:
,

Abstract

This paper introduces the sparse modular addition task and examines how transformers learn it. We focus on transformers with embeddings in R^2 and introduce a visual sandbox that provides comprehensive visualizations of each layer throughout the training process. We reveal a type of circuit, called "clustering heads," which learns the problem's invariants. We analyze the training dynamics of these circuits, highlighting two-stage learning, loss spikes due to high curvature or normalization layers, and the effects of initialization and curriculum learning.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.24050 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.24050 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.24050 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.