Papers
arxiv:2510.00537

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

Published on Oct 1
· Submitted by Nandan Kumar Jha on Oct 3
Authors:

Abstract

Research on large language models reveals an asymmetric spectral scaling law in feed-forward networks, indicating that increasing width primarily adds low-energy directions while dominant modes saturate early, leading to underutilized latent space.

AI-generated summary

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

Community

Paper author Paper submitter
edited 2 days ago

In this (EMNLP 2025 Main Conference) paper, we study how FFNs in transformers actually use their hidden dimensions, through the lens of (eigen)spectral analysis. Three key insights:

  1. Asymmetry in the spectrum
    We find that soft rank (tail spectrum) follows an almost perfect power law with width, while hard rank (dominant modes) grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mainly adds low-energy tail directions, while the “useful” subspace saturates much earlier.

  2. Why GPT-2 use 4× and LLaMA 2.67× FFN width?
    This spectral asymmetry offers a simple rationale for why GPT-2 chose 4x and LLaMA chose ~2.67x FFN width multipliers. Hard rank saturates early, but soft rank keeps expanding, so the marginal gain in expressivity drops sharply after ~2.67–4x. Families emphasizing tail expressivity (GPT-2) go to 4x, while parameter-efficient ones (LLaMA) can stop closer to 2.67x without losing dominant capacity.

  3. Hyperspherical learning flattens the asymmetry
    When we constrain activations and weights to the unit sphere (e.g., in nGPT, Loshchilov et al., ICLR'25), dominant modes continue to grow instead of stalling, tails expand without overwhelming, and spectral utilization flattens into near-linear scaling. This indicates that width is actually being used, not just pooled in the tail. Thus, hyperspherical learning emerges as a promising representational technique for more balanced FFN utilization.

Note:
What’s exciting about hyperspherical learning [1] is that, it isn’t just our curiosity, it’s been getting attention lately in the community [2], and our results add a new perspective: it helps FFNs make more balanced and efficient use of width.

[1] Liu at al., Learning with hyperspherical uniformity, AISTATS 2021

[2] Jeremy Bernstein, Modular Manifolds, https://thinkingmachines.ai/blog/modular-manifolds/, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.00537 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.00537 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.00537 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.