Thinking While Listening: Simple Test Time Scaling For Audio Classification
Abstract
A framework incorporating reasoning into audio classification improves performance through test-time scaling and lightweight retraining of embedding matrices.
We propose a framework that enables neural models to "think while listening" to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach--retraining only the embedding matrix of a frozen, smaller model like GPT-2--can surpass the performance of billion-parameter text-based reasoning models.
Community
We propose a framework that enables neural models to “think while listening” to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we ad- dress two central questions: (i) how can thinking be incorpo- rated into existing audio classification pipelines to enable rea- soning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demon- strate that in both settings, our models exhibit improved clas- sification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning mod- els, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight ap- proach—retraining only the embedding matrix of a frozen, smaller model like GPT-2 can surpass the performance of billion-parameter text-based reasoning models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models (2025)
- Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation (2025)
- AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing? (2025)
- Deep Think with Confidence (2025)
- Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models (2025)
- Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models (2025)
- Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper