arxiv: 2604.11025 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

Zheng Jiang , Yiming Chen , Nan He , Jiahui Chen , Chaoyang Li , Houde Qian , Lifeng Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal reasoningtest-time scalinggrounding paradoxperception tracesentropy filteringMLLMvisual reasoningiterative inference

0 comments

The pith

Test-time scaling over perception breaks the circular dependency in multimodal visual reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies the Grounding Paradox in multimodal large language models, where a system must decide where to direct visual attention such as zooming or cropping before it has gathered the evidence needed to make that choice correctly. To resolve this, the authors introduce Test-Time Scaling over Perception, a method that runs multiple exploratory perception actions in parallel, discards unreliable ones via entropy-based scoring, converts the reliable observations into structured knowledge, and then uses that knowledge to steer the next round of exploration. Experiments across high-resolution and general multimodal benchmarks demonstrate that the approach improves accuracy over strong baselines for models of different sizes while using tokens efficiently. If correct, this reframes perception not as a fixed preprocessing step but as an inference process that can be scaled at test time to handle uncertainty.

Core claim

TTSP treats perception itself as a scalable inference process: it generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty.

What carries the argument

The TTSP loop of trace generation, entropy filtering, knowledge distillation, and uncertainty-directed refinement.

If this is right

TTSP improves performance on high-resolution and general multimodal reasoning tasks for backbones of varying sizes.
The framework exhibits favorable scaling behavior as more perception traces are generated.
Token usage remains efficient while accuracy rises, suggesting perception scaling can be cheaper than model scaling.
Robustness increases under perceptual uncertainty by focusing exploration on unresolved areas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same iterative filtering pattern could be applied to other modalities where evidence gathering and decision-making are interdependent.
Token-efficient perception scaling may allow smaller backbones to match larger ones on visual tasks without retraining.
Different confidence estimators or distillation formats could be substituted and compared directly on the same trace set.

Load-bearing premise

Entropy-based confidence on perception traces can separate useful observations from unreliable ones without discarding evidence the final answer needs or adding new systematic errors.

What would settle it

A controlled test on a benchmark where high-entropy traces contain the decisive visual detail; removing or inverting the entropy filter should then cause measurable accuracy drops relative to the full TTSP pipeline.

Figures

Figures reproduced from arXiv: 2604.11025 by Chaoyang Li, Houde Qian, Jiahui Chen, Lifeng Sun, Nan He, Yiming Chen, Zheng Jiang.

**Figure 1.** Figure 1: Illustration of the Grounding Paradox. Despite this promise, however, Thinking with Images has yet to resolve the central challenge of fine-grained visual reasoning [33]. In practice, tool-augmented MLLMs still frequently inspect irrelevant regions, overlook critical evidence, or fail to invoke tools even when detailed information is clearly required. These failures are particularly pronounced in tasks i… view at source ↗

**Figure 2.** Figure 2: Overview of TTSP. In each round, TTSP samples a mixture of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Scalability analysis of TTSP along three dimensions: (a) model size, (b) perception width, and (c) perception depth. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Token efficiency comparison. Computational Complexity Analysis. The computational cost of TTSP is dominated by perception-trace generation. With 𝑁 rounds and 𝐾 traces per round, the total number of model forward passes scales as O (𝑁 ·𝐾 ·𝑇max), where 𝑇max denotes the maximum number of interaction turns per trace. In addition, knowledge extraction introduces only 𝑁 − 1 extra greedy inference calls, one afte… view at source ↗

**Figure 6.** Figure 6: Analysis of inference behavior across rounds. Left: average number of tool calls per round. Right: average chain [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt Template for Perceptual Exploration. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt Template for Knowledge-Guided Exploration. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt Template for Knowledge Extraction. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt Template for image_zoom_in_tool [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process. TTSP generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Extensive experiments on high-resolution and general multimodal reasoning benchmarks show that TTSP consistently outperforms strong baselines across backbone sizes, while also exhibiting favorable scalability and token efficiency. Our results suggest that scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names the Grounding Paradox in MLLM visual tool use and offers TTSP as a test-time loop of multiple traces, entropy filtering, distillation, and iteration, but the evidence that this actually resolves the issue is still thin.

read the letter

The main point to take away is that the authors have put a name on a real circular problem in current multimodal models: they need to pick where to zoom or crop before they have enough visual evidence to pick correctly. TTSP tries to break that by running several perception traces at once, dropping the high-entropy ones, turning the survivors into structured knowledge, and repeating the process on what is left unresolved. That four-stage pipeline is the concrete new piece, even though it recycles ideas from test-time scaling and visual tool calling that already exist in the literature. The framing itself is useful because it focuses attention on perception as something that can be scaled rather than treated as a single upfront step. The paper does a clean job showing why this matters for high-resolution and fine-grained reasoning tasks. The soft spot is the entropy filter. High entropy can mark either a bad trace or a legitimately ambiguous region that still contains the information needed later, and the description gives no correlation with ground-truth accuracy or ablation on the threshold to show the filter keeps what matters. The abstract states consistent gains and good token efficiency across backbone sizes, but without the actual numbers, baselines, or error breakdowns visible here, those claims stay hard to weigh. The concern in the stress-test note lands because nothing counters the risk that useful uncertain observations get discarded. This is for people working on reliable multimodal reasoning and test-time methods. A reader who already follows visual tool use or uncertainty handling in MLLMs will get a workable idea to try or adapt. It deserves a serious referee because the problem is stated clearly and the proposal is operational, even if the current validation needs more work on the filtering step and reproducibility details.

Referee Report

1 major / 2 minor

Summary. The paper identifies a 'Grounding Paradox' in multimodal LLMs that invoke visual tools (e.g., zooming, cropping) during inference: models must decide where to look before possessing the evidence to decide correctly. It proposes Test-time Scaling over Perception (TTSP), which generates multiple exploratory perception traces, applies entropy-based confidence estimation to filter unreliable traces, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Experiments on high-resolution and general multimodal reasoning benchmarks are reported to show consistent outperformance over strong baselines across backbone sizes, plus favorable scalability and token efficiency.

Significance. If the empirical claims hold, TTSP offers a concrete mechanism for test-time scaling of perception itself, which could improve robustness in fine-grained visual reasoning tasks where current MLLMs are brittle. The iterative generate-filter-distill loop is a structured way to handle perceptual uncertainty without requiring additional training, and the reported token efficiency suggests practical advantages over naive scaling of context or model size.

major comments (1)

[Method (perception trace filtering and distillation)] The entropy-based filtering step is load-bearing for the central claim that TTSP resolves the Grounding Paradox without introducing new biases. In fine-grained visual tasks, high entropy frequently signals legitimate perceptual ambiguity rather than outright error; discarding such traces risks eliminating evidence needed to resolve uncertainty in later iterations. The manuscript should add (a) a correlation analysis between per-trace entropy and ground-truth accuracy on held-out examples and (b) an ablation on the filtering threshold (or confidence cutoff) showing that performance does not degrade when uncertain-but-correct traces are retained. Without these, the filtering heuristic remains an unvalidated assumption.

minor comments (2)

[Abstract] The abstract asserts 'consistent outperformance' and 'favorable scalability' but contains no numerical results, baseline names, or dataset sizes. Adding one or two key quantitative highlights (e.g., accuracy deltas and token counts on the primary benchmark) would strengthen the summary.
[Method] Notation for entropy estimation and the distillation step should be formalized with an equation or pseudocode; current description leaves the precise confidence threshold and knowledge representation ambiguous.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of our work. We address the major comment point by point below.

read point-by-point responses

Referee: [Method (perception trace filtering and distillation)] The entropy-based filtering step is load-bearing for the central claim that TTSP resolves the Grounding Paradox without introducing new biases. In fine-grained visual tasks, high entropy frequently signals legitimate perceptual ambiguity rather than outright error; discarding such traces risks eliminating evidence needed to resolve uncertainty in later iterations. The manuscript should add (a) a correlation analysis between per-trace entropy and ground-truth accuracy on held-out examples and (b) an ablation on the filtering threshold (or confidence cutoff) showing that performance does not degrade when uncertain-but-correct traces are retained. Without these, the filtering heuristic remains an unvalidated assumption.

Authors: We appreciate this insightful observation on the entropy-based filtering mechanism. We agree that high entropy can reflect genuine perceptual ambiguity rather than error, and that additional validation is needed to confirm the heuristic does not discard useful evidence. In the revised manuscript, we will add (a) a correlation analysis between per-trace entropy and ground-truth accuracy on held-out examples, and (b) an ablation study on the filtering threshold (including cases where uncertain-but-correct traces are retained) to demonstrate that performance remains stable. These analyses will empirically support the filtering step and clarify its behavior under ambiguity. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical framework only

full rationale

The paper describes TTSP as a procedural empirical framework: generate multiple perception traces, apply entropy-based filtering, distill observations, and iterate. No equations, first-principles derivations, predictions, or mathematical reductions appear in the abstract or method summary. No self-citations, ansatzes, or uniqueness theorems are invoked to support any claim. The reader's assessment correctly notes the absence of derivations or fitted-parameter predictions. Without a derivation chain to inspect, none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, etc.) can apply. The central proposal is an algorithmic recipe evaluated empirically, not a result forced by its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical structure, parameters, or new entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5476 in / 1106 out tokens · 59644 ms · 2026-05-10T16:35:15.163661+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 28 canonical work pages · 15 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Pranjal Aggarwal, Aman Madaan, Yiming Yang, et al. 2023. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12375–12396

2023
[3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang. 2025. Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235(2025)

work page arXiv 2025
[5]

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. 2025. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 9062–9072

2025
[7]

Elias Dritsas, Maria Trigka, Christos Troussas, and Phivos Mylonas. 2025. Multi- modal interaction, interfaces, and communication: a survey.Multimodal Tech- nologies and Interaction9, 1 (2025), 6

2025
[8]

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. 2025. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879(2025)

work page internal anchor Pith review arXiv 2025
[9]

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. 2025. Deep think with confidence.arXiv preprint arXiv:2508.15260(2025)

work page arXiv 2025
[10]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Com- positional visual reasoning without training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14953–14962

2023
[12]

Nan He, Yiming Chen, Zheng Jiang, Song Yang, and Lifeng Sun. 2025. DynFed: Adaptive Federated Learning via Quantization-Aware Knowledge Distillation. InProceedings of the 33rd ACM International Conference on Multimedia. 11844– 11852

2025
[13]

Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, and Minfeng Xu. 2026. MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning. InThe Fourteenth International Conference on Learning Representations

2026
[14]

Zheng Jiang, Nan He, Yiming Chen, and Lifeng Sun. 2026. SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Trans- port.arXiv preprint arXiv:2604.06631(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Zhewei Kang, Xuandong Zhao, and Dawn Song. 2025. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581 (2025)

work page arXiv 2025
[17]

Alan Latham and Derek P McCormack. 2009. Thinking with images in non- representational cities: Vignettes from Berlin.Area41, 3 (2009), 252–262

2009
[18]

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. 2025. Latent visual rea- soning.arXiv preprint arXiv:2509.24251(2025)

work page arXiv 2025
[19]

Haobin Li, Yutong Yang, Yijie Lin, Dai Xiang, Mouxing Yang, and Xi Peng. 2026. Reliable Thinking with Images.arXiv preprint arXiv:2602.12916(2026)

work page arXiv 2026
[20]

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. 2024. Escape sky-high cost: Early-stopping self- consistency for multi-step reasoning.arXiv preprint arXiv:2401.10480(2024)

work page arXiv 2024
[21]

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
[23]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems36 (2023), 46534–46594

2023
[24]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. 2025. s1: Simple test-time scaling. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 20286–20332

2025
[25]

Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. 2025. Learning adaptive parallel reasoning with language models.arXiv preprint arXiv:2504.15466(2025)

work page arXiv 2025
[26]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. 2025. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918(2025)

work page internal anchor Pith review arXiv 2025
[28]

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. 2025. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv e-prints(2025), arXiv–2503

2025
[29]

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. 2025. Confidence improves self-consistency in llms. In Findings of the Association for Computational Linguistics: ACL 2025. 20090–20111

2025
[30]

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Sule Bai, Zijian Kang, Jiashi Feng, Wang Zhuochen, et al. 2026. Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method. InThe Fourteenth International Conference on Learning Representations

2026
[31]

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. 2025. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven rein- forcement learning.arXiv preprint arXiv:2505.15966(2025)

work page internal anchor Pith review arXiv 2025
[32]

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. 2025. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 7907–7915

2025
[33]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. 2026. Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception. arXiv preprint arXiv:2602.11858(2026)

work page arXiv 2026
[35]

Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Ben- jamin Van Durme. 2025. Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418(2025)

work page arXiv 2025
[36]

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965 (2025)

work page arXiv 2025
[37]

Penghao Wu and Saining Xie. 2024. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13084–13094. Zheng Jiang et al

2024
[38]

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. 2025. Llava-critic: Learning to evaluate mul- timodal models. InProceedings of the Computer Vision and Pattern Recognition Conference. 13618–13628

2025
[39]

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. 2025. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279(2025)

work page arXiv 2025
[40]

Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xi- aobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, et al. 2025. Mur: Momen- tum uncertainty guided reasoning for large language models.arXiv preprint arXiv:2507.14958(2025)

work page internal anchor Pith review arXiv 2025
[41]

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, et al. 2025. A survey on test-time scaling in large language models: What, how, where, and how well? arXiv preprint arXiv:2503.24235(2025)

work page internal anchor Pith review arXiv 2025
[42]

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. 2025. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630(2025)

work page internal anchor Pith review arXiv 2025
[43]

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. 2024. Mme- realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257(2024)

work page arXiv 2024
[44]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. 2025. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362(2025)

work page internal anchor Pith review arXiv 2025
[45]

Test-time recursive thinking: Self-improvement without external feedback.arXiv preprint arXiv:2602.03094, 2026

Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, and Weizhu Chen. 2026. Test-time Recursive Thinking: Self-Improvement without External Feedback.arXiv preprint arXiv:2602.03094 (2026). Test-time Scaling over Perception A Additional Experimental Results A.1 Details on TreeBench Benchmark We provide a more de...

work page arXiv 2026