Efficient Test-Time Scaling for Small Vision-Language Models
Abstract
Two novel test-time scaling strategies, Test-Time Augmentation and Test-Time Adaptation, improve small vision-language models' performance without compromising computational efficiency.
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
Community
We propose two efficient and effective methods improving multimodal small language models at test-time: TTAug (input augmentation + token-level aggregation) and TTAdapt (parameter adaptation via pseudolabels).
๐ Project Page: https://monurcan.github.io/efficient_test_time_scaling
๐ป Code: https://github.com/monurcan/efficient_test_time_scaling
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Variation-aware Vision Token Dropping for Faster Large Vision-Language Models (2025)
- Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models (2025)
- ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding (2025)
- SpecVLM: Fast Speculative Decoding in Vision-Language Models (2025)
- Visual Representation Alignment for Multimodal Large Language Models (2025)
- BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion (2025)
- Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper