Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?
Abstract
Research on large language models reveals an asymmetric spectral scaling law in feed-forward networks, indicating that increasing width primarily adds low-energy directions while dominant modes saturate early, leading to underutilized latent space.
As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.
Community
In this (EMNLP 2025 Main Conference) paper, we study how FFNs in transformers actually use their hidden dimensions, through the lens of (eigen)spectral analysis. Three key insights:
Asymmetry in the spectrum
We find that soft rank (tail spectrum) follows an almost perfect power law with width, while hard rank (dominant modes) grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mainly adds low-energy tail directions, while the “useful” subspace saturates much earlier.Why GPT-2 use 4× and LLaMA 2.67× FFN width?
This spectral asymmetry offers a simple rationale for why GPT-2 chose 4x and LLaMA chose ~2.67x FFN width multipliers. Hard rank saturates early, but soft rank keeps expanding, so the marginal gain in expressivity drops sharply after ~2.67–4x. Families emphasizing tail expressivity (GPT-2) go to 4x, while parameter-efficient ones (LLaMA) can stop closer to 2.67x without losing dominant capacity.Hyperspherical learning flattens the asymmetry
When we constrain activations and weights to the unit sphere (e.g., in nGPT, Loshchilov et al., ICLR'25), dominant modes continue to grow instead of stalling, tails expand without overwhelming, and spectral utilization flattens into near-linear scaling. This indicates that width is actually being used, not just pooled in the tail. Thus, hyperspherical learning emerges as a promising representational technique for more balanced FFN utilization.
Note:
What’s exciting about hyperspherical learning [1] is that, it isn’t just our curiosity, it’s been getting attention lately in the community [2], and our results add a new perspective: it helps FFNs make more balanced and efficient use of width.
[1] Liu at al., Learning with hyperspherical uniformity, AISTATS 2021
[2] Jeremy Bernstein, Modular Manifolds, https://thinkingmachines.ai/blog/modular-manifolds/, 2025
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Model Merging Scaling Laws in Large Language Models (2025)
- Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning (2025)
- Scaling with Collapse: Efficient and Predictable Training of LLM Families (2025)
- Scaling Laws are Redundancy Laws (2025)
- xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity (2025)
- Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models (2025)
- Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper