arxiv:2510.03574

Efficient Test-Time Scaling for Small Vision-Language Models

Published on Oct 3

· Submitted by

onurcan on Oct 6

Upvote

Authors:

Mehmet Onurcan Kaya ,

Abstract

Two novel test-time scaling strategies, Test-Time Augmentation and Test-Time Adaptation, improve small vision-language models' performance without compromising computational efficiency.

AI-generated summary

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

monurcan

Paper author Paper submitter 1 day ago

We propose two efficient and effective methods improving multimodal small language models at test-time: TTAug (input augmentation + token-level aggregation) and TTAdapt (parameter adaptation via pseudolabels).

🌐 Project Page: https://monurcan.github.io/efficient_test_time_scaling
💻 Code: https://github.com/monurcan/efficient_test_time_scaling