Bigger AI models may work better because meanings collide less inside them
Overlapping token vectors show how superposition can pack more meaning into wider language models.📷 AI-generated / Tech&Space
- ★The work links scaling laws to strong superposition, not only to rare-token distributions
- ★The analysis covers output layers of models such as OPT, GPT-2, Qwen2.5, and Pythia
- ★Wider models reduce interference noise, but overlap makes interpretability harder
The Decoder reports that an MIT study offers a mechanical explanation for one of modern AI’s most stubborn facts: larger language models often improve in a clean, predictable way. The scaling law is no longer only an empirical curve; it becomes a clue about how models organize meaning.
The key term is superposition. A language model has to represent far more tokens and concepts than it has clean, independent dimensions. Instead of giving every concept its own drawer, many concepts share the same internal space.
A weaker explanation says the model cleanly stores only the most common concepts while rare ones fall away. The MIT work, as summarized by The Decoder, points to a stronger version: models represent all tokens, but with controlled noise caused by packed representations.
Larger models do not win only by memorizing more; wider representations reduce the noise between overlapping meanings.
Scaling curves and compressed concept vectors connect model width with lower interference noise.📷 AI-generated / Tech&Space
Why does a bigger model help? Because a wider internal space reduces interference. In strong superposition, the error does not mainly come from missing concepts. It comes from too many concepts overlapping.
The authors reportedly examined output layers in models including OPT, GPT-2, Qwen2.5, and Pythia. The result matters because it connects the abstract scaling curve to the model’s internal geometry.
The boundary is just as interesting. If a model becomes wide enough to represent every token without overlap, the power law should weaken because the source of the noise has disappeared.
The less comfortable implication is interpretability. The denser the model packs meaning, the harder it becomes to trace what is happening inside. Superposition may explain why scaling works, but it also explains why the model becomes less readable.

