Recognition: unknown
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3
The pith
Test-time scaling over perception breaks the circular dependency in multimodal visual reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TTSP treats perception itself as a scalable inference process: it generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty.
What carries the argument
The TTSP loop of trace generation, entropy filtering, knowledge distillation, and uncertainty-directed refinement.
If this is right
- TTSP improves performance on high-resolution and general multimodal reasoning tasks for backbones of varying sizes.
- The framework exhibits favorable scaling behavior as more perception traces are generated.
- Token usage remains efficient while accuracy rises, suggesting perception scaling can be cheaper than model scaling.
- Robustness increases under perceptual uncertainty by focusing exploration on unresolved areas.
Where Pith is reading between the lines
- The same iterative filtering pattern could be applied to other modalities where evidence gathering and decision-making are interdependent.
- Token-efficient perception scaling may allow smaller backbones to match larger ones on visual tasks without retraining.
- Different confidence estimators or distillation formats could be substituted and compared directly on the same trace set.
Load-bearing premise
Entropy-based confidence on perception traces can separate useful observations from unreliable ones without discarding evidence the final answer needs or adding new systematic errors.
What would settle it
A controlled test on a benchmark where high-entropy traces contain the decisive visual detail; removing or inverting the entropy filter should then cause measurable accuracy drops relative to the full TTSP pipeline.
Figures
read the original abstract
Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process. TTSP generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Extensive experiments on high-resolution and general multimodal reasoning benchmarks show that TTSP consistently outperforms strong baselines across backbone sizes, while also exhibiting favorable scalability and token efficiency. Our results suggest that scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a 'Grounding Paradox' in multimodal LLMs that invoke visual tools (e.g., zooming, cropping) during inference: models must decide where to look before possessing the evidence to decide correctly. It proposes Test-time Scaling over Perception (TTSP), which generates multiple exploratory perception traces, applies entropy-based confidence estimation to filter unreliable traces, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Experiments on high-resolution and general multimodal reasoning benchmarks are reported to show consistent outperformance over strong baselines across backbone sizes, plus favorable scalability and token efficiency.
Significance. If the empirical claims hold, TTSP offers a concrete mechanism for test-time scaling of perception itself, which could improve robustness in fine-grained visual reasoning tasks where current MLLMs are brittle. The iterative generate-filter-distill loop is a structured way to handle perceptual uncertainty without requiring additional training, and the reported token efficiency suggests practical advantages over naive scaling of context or model size.
major comments (1)
- [Method (perception trace filtering and distillation)] The entropy-based filtering step is load-bearing for the central claim that TTSP resolves the Grounding Paradox without introducing new biases. In fine-grained visual tasks, high entropy frequently signals legitimate perceptual ambiguity rather than outright error; discarding such traces risks eliminating evidence needed to resolve uncertainty in later iterations. The manuscript should add (a) a correlation analysis between per-trace entropy and ground-truth accuracy on held-out examples and (b) an ablation on the filtering threshold (or confidence cutoff) showing that performance does not degrade when uncertain-but-correct traces are retained. Without these, the filtering heuristic remains an unvalidated assumption.
minor comments (2)
- [Abstract] The abstract asserts 'consistent outperformance' and 'favorable scalability' but contains no numerical results, baseline names, or dataset sizes. Adding one or two key quantitative highlights (e.g., accuracy deltas and token counts on the primary benchmark) would strengthen the summary.
- [Method] Notation for entropy estimation and the distillation step should be formalized with an equation or pseudocode; current description leaves the precise confidence threshold and knowledge representation ambiguous.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the significance of our work. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Method (perception trace filtering and distillation)] The entropy-based filtering step is load-bearing for the central claim that TTSP resolves the Grounding Paradox without introducing new biases. In fine-grained visual tasks, high entropy frequently signals legitimate perceptual ambiguity rather than outright error; discarding such traces risks eliminating evidence needed to resolve uncertainty in later iterations. The manuscript should add (a) a correlation analysis between per-trace entropy and ground-truth accuracy on held-out examples and (b) an ablation on the filtering threshold (or confidence cutoff) showing that performance does not degrade when uncertain-but-correct traces are retained. Without these, the filtering heuristic remains an unvalidated assumption.
Authors: We appreciate this insightful observation on the entropy-based filtering mechanism. We agree that high entropy can reflect genuine perceptual ambiguity rather than error, and that additional validation is needed to confirm the heuristic does not discard useful evidence. In the revised manuscript, we will add (a) a correlation analysis between per-trace entropy and ground-truth accuracy on held-out examples, and (b) an ablation study on the filtering threshold (including cases where uncertain-but-correct traces are retained) to demonstrate that performance remains stable. These analyses will empirically support the filtering step and clarify its behavior under ambiguity. revision: yes
Circularity Check
No derivation chain present; empirical framework only
full rationale
The paper describes TTSP as a procedural empirical framework: generate multiple perception traces, apply entropy-based filtering, distill observations, and iterate. No equations, first-principles derivations, predictions, or mathematical reductions appear in the abstract or method summary. No self-citations, ansatzes, or uniqueness theorems are invoked to support any claim. The reader's assessment correctly notes the absence of derivations or fitted-parameter predictions. Without a derivation chain to inspect, none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, etc.) can apply. The central proposal is an algorithmic recipe evaluated empirically, not a result forced by its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Pranjal Aggarwal, Aman Madaan, Yiming Yang, et al. 2023. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12375–12396
2023
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. 2025. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 9062–9072
2025
-
[7]
Elias Dritsas, Maria Trigka, Christos Troussas, and Phivos Mylonas. 2025. Multi- modal interaction, interfaces, and communication: a survey.Multimodal Tech- nologies and Interaction9, 1 (2025), 6
2025
-
[8]
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. 2025. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879(2025)
work page internal anchor Pith review arXiv 2025
- [9]
-
[10]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Com- positional visual reasoning without training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14953–14962
2023
-
[12]
Nan He, Yiming Chen, Zheng Jiang, Song Yang, and Lifeng Sun. 2025. DynFed: Adaptive Federated Learning via Quantization-Aware Knowledge Distillation. InProceedings of the 33rd ACM International Conference on Multimedia. 11844– 11852
2025
-
[13]
Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, and Minfeng Xu. 2026. MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning. InThe Fourteenth International Conference on Learning Representations
2026
-
[14]
Zheng Jiang, Nan He, Yiming Chen, and Lifeng Sun. 2026. SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Trans- port.arXiv preprint arXiv:2604.06631(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [16]
-
[17]
Alan Latham and Derek P McCormack. 2009. Thinking with images in non- representational cities: Vignettes from Berlin.Area41, 3 (2009), 252–262
2009
- [18]
- [19]
- [20]
-
[21]
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
-
[23]
Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems36 (2023), 46534–46594
2023
-
[24]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 20286–20332
2025
- [25]
-
[26]
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. 2025. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918(2025)
work page internal anchor Pith review arXiv 2025
-
[28]
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. 2025. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv e-prints(2025), arXiv–2503
2025
-
[29]
Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. 2025. Confidence improves self-consistency in llms. In Findings of the Association for Computational Linguistics: ACL 2025. 20090–20111
2025
-
[30]
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Sule Bai, Zijian Kang, Jiashi Feng, Wang Zhuochen, et al. 2026. Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method. InThe Fourteenth International Conference on Learning Representations
2026
-
[31]
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. 2025. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven rein- forcement learning.arXiv preprint arXiv:2505.15966(2025)
work page internal anchor Pith review arXiv 2025
-
[32]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. 2025. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 7907–7915
2025
-
[33]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [34]
- [35]
- [36]
-
[37]
Penghao Wu and Saining Xie. 2024. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13084–13094. Zheng Jiang et al
2024
-
[38]
Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. 2025. Llava-critic: Learning to evaluate mul- timodal models. InProceedings of the Computer Vision and Pattern Recognition Conference. 13618–13628
2025
- [39]
-
[40]
Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xi- aobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, et al. 2025. Mur: Momen- tum uncertainty guided reasoning for large language models.arXiv preprint arXiv:2507.14958(2025)
work page internal anchor Pith review arXiv 2025
-
[41]
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, et al. 2025. A survey on test-time scaling in large language models: What, how, where, and how well? arXiv preprint arXiv:2503.24235(2025)
work page internal anchor Pith review arXiv 2025
-
[42]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. 2025. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630(2025)
work page internal anchor Pith review arXiv 2025
-
[43]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. 2024. Mme- realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257(2024)
-
[44]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. 2025. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362(2025)
work page internal anchor Pith review arXiv 2025
-
[45]
Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, and Weizhu Chen. 2026. Test-time Recursive Thinking: Self-Improvement without External Feedback.arXiv preprint arXiv:2602.03094 (2026). Test-time Scaling over Perception A Additional Experimental Results A.1 Details on TreeBench Benchmark We provide a more de...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.