Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning

Caifeng Shan; Chen Ma; Haokun Lin; Li Zhu; Teng Wang; Yichen Wu; Yinan Zhou; Ying Shan; Yuxin Chen; Zhenan Sun

arxiv: 2606.30217 · v1 · pith:5DT3YTZOnew · submitted 2026-06-29 · 💻 cs.CL

Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning

Yinan Zhou , Haokun Lin , Yichen Wu , Caifeng Shan , Zhenan Sun , Yuxin Chen , Teng Wang , Chen Ma

show 2 more authors

Li Zhu Ying Shan

This is my paper

Pith reviewed 2026-06-30 05:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords proactive routingmultimodal modelsvisual reasoningdraft-target pairinginference accelerationconfidence estimationmodel competence prediction

0 comments

The pith

Joint ratings of draft and target model competence enable proactive routing of visual queries before any output generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current routing methods for pairing draft and target multimodal models rely on post-hoc signals or supervised fine-tuning, which limits their use in visual reasoning. This paper proposes learning internal confidence estimates in the draft model and joint predictions for the target model to decide routing early. A sympathetic reader would care because this allows deciding which model handles a query before computation begins, promising better efficiency. The approach prioritizes routing to the model best suited for each instance. Experiments show it speeds up inference without accuracy drops on reasoning benchmarks.

Core claim

PRP introduces Draft Rating Learning to give the draft model an internal confidence estimator and Joint Rating Learning to predict target model performance on a query. These enable fine-grained proactive routing at the instance level, accelerating inference substantially while maintaining overall performance on multimodal reasoning tasks.

What carries the argument

The Proactive Routing Paradigm (PRP) using Draft Rating Learning (DRL) and Joint Rating Learning (JRL) to evaluate model competence before output.

If this is right

Fine-grained instance-level proactive routing decisions become feasible before any thinking occurs.
Substantial acceleration of inference is achieved without compromising overall performance.
The target model is allocated samples it excels at rather than the hardest queries.
Routing operates under multimodal settings without relying on post-hoc token probabilities or data-sensitive fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This routing could lower the average compute per query in production multimodal systems.
Similar rating mechanisms might improve efficiency in other model collaboration setups like language-only tasks.
Testing on additional benchmarks with varying query difficulties would further validate the ratings' predictive power.

Load-bearing premise

The internal confidence estimator and joint rating can accurately predict how well each model will handle unseen queries before producing any output.

What would settle it

If routing based on these ratings results in lower accuracy than using the target model alone on the same set of visual reasoning queries, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.30217 by Caifeng Shan, Chen Ma, Haokun Lin, Li Zhu, Teng Wang, Yichen Wu, Yinan Zhou, Ying Shan, Yuxin Chen, Zhenan Sun.

**Figure 1.** Figure 1: Overview comparing two prior paradigms with ours. We enable proactive routing at the onset of inference. Our RL-trained rater provides finer-grained and distinctive signals, and explicitly accounts for the capability of the target model. of reinforcement learning–based post-training methods [6,37,50], their reasoning capabilities are further strengthened, exhibiting emergent behaviors such as “aha moments… view at source ↗

**Figure 2.** Figure 2: Draft model’s percentile-range accuracy comparison of signal ranking across 3 paradigms versus random on MathVerse. Our PRP exhibits the performance retention and distinctive rating distribution, laying a solid foundation for fine-grained routing. ditionally predicts the target model’s suitability for each instance, allowing the system to route not only difficult problems but also those that the target mod… view at source ↗

**Figure 3.** Figure 3: Left: Illustration of Draft Rating Learning. We design reward Rdft and adopt Dynamic Score Substitution to avoid advantage vanishing, while Task-specific Optimization independently optimizes the Rdft-related tokens during training. Right: Illustration of Joint Rating Learning with the target model. We extend the Draft Rating Learning pipeline by introducing Rtgt, which is trained with pre-sampled Acctgt … view at source ↗

**Figure 4.** Figure 4: Left: The per-score instance count and accuracy of Mddf t on MathVerse. Right1: Mdjoint ’s draft score distribution on MathVerse and the draft’s corresponding accuracy. Right-2: The target score distribution and Mt’s corresponding accuracy. In all plots, bars denote the number of samples in each score bin, and lines indicate the accuracy within each bin. 𝑑1= 0.99 𝑑2= 0.00 𝑡1 = 0.08 𝑑1 = 0.99 𝑡2 = 0.86 𝑑2… view at source ↗

**Figure 5.** Figure 5: Left: Three-stage Score-based Proactive Routing. Right: Internal Probability Ranking exhibits a strong correlation with accuracy. scheme that delivers efficient acceleration while preserving performance. Moreover, we introduce a fine-grained partitioning method based on the token-level probabilities of predicted ratings, thereby enabling a more precise evaluation stage. We further analyze the superiority … view at source ↗

**Figure 6.** Figure 6: Comparison of decision latency between traditional routing-after-completion and our proactive routing schemes (Mddf t and Mdjoint ). Intuitively, when scores are high, a more concentrated digit distribution together with a larger pmax indicates greater certainty, which yields a larger p and correlates with higher accuracy; for low scores, the trend reverses. Finally, we sort samples within each score band… view at source ↗

**Figure 7.** Figure 7: Detailed comparison of our proposed routing with random and 2 previous paradigm on ChartQA, MathVista, and MathVerse. the draft model and the actual accuracy (ranked from correct to incorrect). A stronger negative correlation indicates that the ratings more accurately reflects its actual performance. Our DRL and JRL consistently achieve higher correlations, proving that our ratings are more closely aligne… view at source ↗

**Figure 8.** Figure 8: Ablation Study: (a)(b) Ablation study of Internal Probability Ranking (IPR) on ChartQA and MathVista.(c) Score Distribution of SFT Rating Learning. (d) Routing Comparison of SFT and DRL. (e). Mdjoint ’s ablation of different τ for the scaling in signal s = d+ d−t τ on MathVista. (f) Similar accuracy trends across evaluation and test sets allows threshold estimation. (g) Qualitative Result of samples routed… view at source ↗

**Figure 9.** Figure 9: Score distribution comparison on ChartQA and ours out-domain disribution on DocVQA. A More Experiments A.1 Detail Rating distributions ChartQA. As shown in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Score distribution comparison on MathVerse [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Score distribution comparison on MathVista. Motivated by this observation, we conduct an in-depth analysis of the failure modes of DRL. The three panels on the right visualize how accuracy changes under Paradigm1, Paradigm2, and our routing with explicit per-split accuracies. The yellow curve denotes the overall accuracy of the draft–target mixture, the green curve is the accuracy of the (1 − pt%) portion… view at source ↗

**Figure 12.** Figure 12: Discussion of DRL-based Routing Failure on MathVerse. The fundamental reason for this DRL routing failure is that it ignores the target model’s capability during allocation. Instead of naively sending “easy” examples to the draft and “hard” examples to the target, routing should account for the capability gap between the two models and assign each example to the model for which it is most suitable. This o… view at source ↗

**Figure 13.** Figure 13: Generalization of Learned Rating Capability on M3CoT:(a)(b) are the rating distribution and the routing comparison of the rating model trained using DRL with math data. We generalize well on M3CoT. (c)(d) are the rating distribution and accuracy statistic of each rating of the rating model trained using DRL with M3CoT training data. Generalization of Rating Capability on M3CoT. Based on the data present… view at source ↗

**Figure 14.** Figure 14: We apply a weaker and smaller MLLM VL-Rethinker-7B [44] as new target model and obtain a similar effect [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: The performance on MathVerse with different α in DRL and α&β in JRL. questions, the final answer can often be guessed without attending to the figure. As a result, models may produce the correct final answer while exhibiting incorrect or severely hallucinatory reasoning, introducing bias. We find such severe hallucinations on geometry tasks to be prevalent among GRPO-based models. The right panel of [PI… view at source ↗

**Figure 16.** Figure 16: The DRL-based routing samples on ChartQA [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: The JRL-based routing samples (preserved by draft models) on MathVista [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: The JRL-based routing samples (routed to target models) on MathVista [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: The JRL-based routing samples on MathVerse [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

**Figure 20.** Figure 20: Failure cases with high ratings during routing. The left panel reveals the lack of training on simple math problems. The right panel shows the annotation error in MathVista discovered by our ratings [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

read the original abstract

Large multimodal models have achieved strong reasoning on complex visual tasks, but their inference efficiency is often restricted by long chains of thought. A promising solution is to pair a small draft model with a large target model, enabling cooperative inference employing a routing signal that adaptively routes queries to either the draft or target model based on their difficulties for optimal efficiency and accuracy. Yet, the remaining bottleneck is to establish a reliable query difficulty signal under multimodal settings. Existing approaches designed for language models either rely on post-hoc token probabilities, which fall short in multimodal scenarios, or depend on supervised fine-tuning, which is a data-sensitive strategy. Both paradigms perform routing only after a complete output, and ignore whether the target model can actually solve the routed instances. To address this, we propose PRP, a Proactive Routing Paradigm that enables early decision-making by jointly evaluating the competence of both the draft and target models. Our Draft Rating Learning (DRL) equips the draft model with an internal confidence estimator, while Joint Rating Learning (JRL) predicts how well the target model can handle a given query, thereby prioritizing the allocation of samples it excels at rather than the hardest ones. These ratings enable fine-grained, instance-level \textbf{Proactive Routing} and substantially accelerate inference without compromising overall performance. Extensive experiments across multiple multimodal reasoning benchmarks validate our effectiveness and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The proactive routing idea targets a real efficiency issue but the abstract supplies zero evidence that the DRL or JRL ratings actually predict competence on unseen queries.

read the letter

The main thing your colleague should know is that this paper claims a proactive routing method can cut inference cost on visual reasoning tasks by deciding early whether to use the small draft model or the large target, using learned ratings instead of post-generation signals. The approach adds Draft Rating Learning to give the draft an internal confidence score and Joint Rating Learning to estimate how well the target would handle the query, then routes accordingly before any output is produced.

What is new is the upfront decision that avoids both post-hoc token probabilities (which the abstract says fail in multimodal cases) and supervised fine-tuning on difficulty labels. The framing correctly notes that prior methods ignore whether the target can actually solve the instance and only act after a full generation.

The paper does a reasonable job identifying the practical bottleneck in long chains of thought for large multimodal models and sketching why existing routing signals are insufficient.

The soft spot is substantial and central: the abstract asserts that extensive experiments validate the method across benchmarks and that the ratings enable acceleration without accuracy loss, yet it contains no metrics, baselines, error bars, training details, or correlation numbers between the ratings and actual solve rates. The stress-test concern lands directly here. If the ratings are learned with any supervision tied to outputs or difficulty, they risk the same data sensitivity the paper criticizes in SFT; if unsupervised, their ability to generalize to unseen multimodal queries is even less assured. Without evidence that the rating thresholds separate solvable from unsolvable instances, the instance-level routing rests on an untested assumption.

This is for people working on inference optimization for vision-language models. A reader already building draft-target systems might pick up the routing formulation, but anyone needing reproducible results will find little to go on from the abstract alone.

I would send it to peer review because the efficiency goal is worth referee attention, but only if the full paper contains the missing correlation statistics and ablations; otherwise it is too thin on evidence.

Referee Report

3 major / 2 minor

Summary. The paper proposes PRP, a Proactive Routing Paradigm for pairing a small draft model with a large target model in multimodal reasoning. It introduces Draft Rating Learning (DRL) to add an internal confidence estimator to the draft model and Joint Rating Learning (JRL) to predict target-model competence on a query. These ratings are intended to support instance-level proactive routing decisions before any output is generated, with the claim that this accelerates inference without accuracy loss. The abstract states that extensive experiments on multiple multimodal reasoning benchmarks validate the approach.

Significance. If the DRL and JRL ratings are shown to generalize and correlate with actual solve rates on unseen queries, the method could offer a practical way to improve inference efficiency in visual reasoning pipelines by avoiding full target-model computation on instances the target is unlikely to solve. The proactive (pre-output) nature distinguishes it from post-hoc routing methods.

major comments (3)

[Abstract] Abstract: the claim that 'extensive experiments across multiple multimodal reasoning benchmarks validate our effectiveness and efficiency' is unsupported by any reported metrics, baselines, error bars, or correlation statistics between ratings and solve rates. This directly undermines evaluation of the central claim that the ratings enable acceleration without compromising performance.
[Method] Method section (DRL/JRL descriptions): no training objective, loss formulation, or supervision signal is specified for learning the internal confidence estimator or the joint rating. Without this, it is impossible to determine whether the ratings are unsupervised, supervised on query difficulty, or otherwise, and whether they avoid the data-sensitivity the paper attributes to prior SFT methods.
[Experiments / Results] Results section: no ablation, correlation analysis, or threshold-validation statistics are provided to show that the rating thresholds separate instances the target model solves from those it does not. This is load-bearing for the proactive-routing claim.

minor comments (2)

[Method] Notation for DRL and JRL is introduced without an explicit equation defining how the ratings are combined into the routing decision.
[Abstract / Experiments] The abstract refers to 'multimodal reasoning benchmarks' without naming them; the experiments section should list the specific datasets (e.g., MMMU, MathVista) and the exact metrics used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful for the referee's insightful comments, which highlight areas where the manuscript can be strengthened. We provide detailed responses to each major comment below and commit to making the necessary revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'extensive experiments across multiple multimodal reasoning benchmarks validate our effectiveness and efficiency' is unsupported by any reported metrics, baselines, error bars, or correlation statistics between ratings and solve rates. This directly undermines evaluation of the central claim that the ratings enable acceleration without compromising performance.

Authors: We agree with the referee that the abstract's claim requires supporting details to be fully substantiated. We will revise the abstract to incorporate specific metrics, baselines, error bars, and references to correlation statistics between the ratings and solve rates, drawing from expanded analyses in the results section. revision: yes
Referee: [Method] Method section (DRL/JRL descriptions): no training objective, loss formulation, or supervision signal is specified for learning the internal confidence estimator or the joint rating. Without this, it is impossible to determine whether the ratings are unsupervised, supervised on query difficulty, or otherwise, and whether they avoid the data-sensitivity the paper attributes to prior SFT methods.

Authors: We agree that the method section would be improved by explicitly stating the training objectives. In the revision, we will provide the loss functions used for DRL and JRL, detailing the supervision signals employed and how they differ from standard SFT approaches. revision: yes
Referee: [Experiments / Results] Results section: no ablation, correlation analysis, or threshold-validation statistics are provided to show that the rating thresholds separate instances the target model solves from those it does not. This is load-bearing for the proactive-routing claim.

Authors: We acknowledge the importance of these analyses for validating the routing decisions. We will include additional ablations, correlation plots between ratings and actual solve rates, and threshold validation statistics in the revised results section to demonstrate the effectiveness of the proactive routing. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces DRL (internal confidence estimator on draft model) and JRL (joint rating for target model competence) as learned components that produce routing signals for proactive decisions before output generation. The abstract and skeptic summary describe these as trained estimators whose outputs are then used for instance-level routing decisions, with effectiveness asserted via benchmark experiments rather than by algebraic identity or self-citation. No equations, loss formulations, or prior-author uniqueness theorems are quoted that would make the claimed predictions equivalent to their training inputs by construction. The central claim therefore remains an empirical proposal whose validity rests on external validation rather than definitional reduction, consistent with the reader's low circularity assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the approach is described at the level of learned ratings without further specification.

pith-pipeline@v0.9.1-grok · 5795 in / 939 out tokens · 33577 ms · 2026-06-30T05:52:55.050256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 41 canonical work pages · 22 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Barrak, A., Fourati, Y., Olchawa, M., Ksontini, E., Zoghlami, K.: Cargo: A framework for confidence-aware routing of large language models (2025),https: //arxiv.org/abs/2509.14899

work page arXiv 2025
[3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J.D., Chen, D., Dao, T.: Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.B., Sifre, L., Jumper, J.: Acceler- ating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Chen, L., Gao, H., Liu, T., Huang, Z., Sung, F., Zhou, X., Wu, Y., Chang, B.: G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning (2025),https://arxiv.org/abs/2505.13426

work page arXiv 2025
[6]

Chen, L., Li, L., Zhao, H., Song, Y., Vinci: R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep- Agent/R1-V(2025), accessed: 2025-02-02

2025
[7]

In: Proc

Chen, Q., Qin, L., Zhang, J., Chen, Z., Xu, X., Che, W.: M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In: Proc. of ACL (2024)

2024
[8]

Chen, Y., Ge, Y., Wang, R., Ge, Y., Cheng, J., Shan, Y., Liu, X.: Grpo-care: Consistency-awarereinforcementlearningformultimodalreasoning.arXivpreprint arXiv:2506.16141 (2025)

work page arXiv 2025
[9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

arXiv preprint arXiv:2410.13284 (2024)

Chuang, Y.N., Sarma, P.K., Gopalan, P., Boccio, J., Bolouki, S., Hu, X., Zhou, H.: Learning to route llms with confidence tokens. arXiv preprint arXiv:2410.13284 (2024)

work page arXiv 2024
[11]

arXiv preprint arXiv:2502.04428 (2025)

Chuang, Y.N., Yu, L., Wang, G., Zhang, L., Liu, Z., Cai, X., Sui, Y., Braver- man, V., Hu, X.: Confident or seek stronger: Exploring uncertainty-based on-device llm routing from benchmarking to generalization. arXiv preprint arXiv:2502.04428 (2025)

work page arXiv 2025
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

arXiv preprint arXiv:2505.13427 (2025)

Du, L., Meng, F., Liu, Z., Zhou, Z., Luo, P., Zhang, Q., Shao, W.: Mm-prm: En- hancing multimodal mathematical reasoning with scalable step-level supervision. arXiv preprint arXiv:2505.13427 (2025)

work page arXiv 2025
[14]

arXiv preprint arXiv:2404.16710 (2024)

Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mah- moud, A., Acun, B., Agarwal, S., Roman, A., et al.: Layerskip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710 (2024)

work page arXiv 2024
[15]

arXiv preprint arXiv:2410.03834 (2024)

Feng, T., Shen, Y., You, J.: Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834 (2024)

work page arXiv 2024
[16]

arXiv preprint arXiv:2505.21600 (2025) Proactive Routing for Efficient Visual Reasoning 17

Fu, T., Ge, Y., You, Y., Liu, E., Yuan, Z., Dai, G., Yan, S., Yang, H., Wang, Y.: R2r: Efficiently navigating divergent reasoning paths with small-large model token routing. arXiv preprint arXiv:2505.21600 (2025) Proactive Routing for Efficient Visual Reasoning 17

work page arXiv 2025
[17]

In: Duh, K., Gomez, H., Bethard, S

Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., Gurevych, I.: A survey of con- fidence estimation and calibration in large language models. In: Duh, K., Gomez, H., Bethard, S. (eds.) Proceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). ...

2024
[18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Hugging Face: Open r1: A fully open reproduction of deepseek-r1 (January 2025), https://github.com/huggingface/open-r1

2025
[20]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

In: International Conference on Machine Learning

Leviathan, Y., Kalman, M., Matias, Y.: Fast inference from transformers via spec- ulative decoding. In: International Conference on Machine Learning. pp. 19274– 19286. PMLR (2023)

2023
[22]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Li, Y., Wei, F., Zhang, C., Zhang, H.: Eagle: Speculative sampling requires re- thinking feature uncertainty. arXiv preprint arXiv:2401.15077 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

arXiv preprint arXiv:2501.19324 (2025)

Liao, B., Xu, Y., Dong, H., Li, J., Monz, C., Savarese, S., Sahoo, D., Xiong, C.: Reward-guided speculative decoding for efficient llm reasoning. arXiv preprint arXiv:2501.19324 (2025)

work page arXiv 2025
[24]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[25]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W.S., Lin, M.: Understand- ing r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Visual-RFT: Visual Reinforcement Fine-Tuning

Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Liu,Z.,Zang,Y.,Zou,Y.,Liang,Z.,Dong,X.,Cao,Y.,Duan,H.,Lin,D.,Wang,J.: Visual agentic reinforcement fine-tuning (2025),https://arxiv.org/abs/2505. 14246

2025
[28]

arXiv preprint arXiv:2505.12504 (2025)

Liu, Z., Meng, F., Du, L., Zhou, Z., Yu, C., Shao, W., Zhang, Q.: Cpgd: To- ward stable rule-based reinforcement learning for language models. arXiv preprint arXiv:2505.12504 (2025)

work page arXiv 2025
[29]

In: International Conference on Learning Representa- tions (ICLR) (2024)

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: International Conference on Learning Representa- tions (ICLR) (2024)

2024
[30]

Ovis2.5 Technical Report

Lu, S., Li, Y., Xia, Y., Hu, Y., Zhao, S., Ma, Y., Wei, Z., Li, Y., Duan, L., Zhao, J., et al.: Ovis2. 5 technical report. arXiv preprint arXiv:2508.11737 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

arXiv preprint arXiv:2406.13415 (2024)

Mahaut, M., Aina, L., Czarnowska, P., Hardalov, M., Müller, T., Màrquez, L.: Factual confidence of llms: on reliability and robustness of current estimators. arXiv preprint arXiv:2406.13415 (2024)

work page arXiv 2024
[32]

In: Findings of the association for computational linguistics: ACL 2022

Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)

2022
[33]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 18 Y. Zhou et al

2021
[34]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., Zhang, K., Luo, P., Qiao, Y., Zhang, Q., Shao, W.: Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

In: Proceedings of the29thACMInternationalConferenceonArchitecturalSupportforProgramming Languages and Operating Systems, Volume 3

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R.Y.Y., Zhu, A., Yang, L., Shi, X., et al.: Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In: Proceedings of the29thACMInternationalConferenceonArchitecturalSupportforProgramming Languages and Operating Systems, Volume 3. p...

2024
[36]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V., Chiang, W.L., Wu, T., Gonzalez, J.E., Kadous, M.W., Stoica, I.: Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Peng, Y., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., Yang, X.: Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614, 2025

Qu, X., Li, Y., Su, Z., Sun, W., Yan, J., Liu, D., Cui, G., Liu, D., Liang, S., He, J., et al.: A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. arXiv preprint arXiv:2503.21614 (2025)

work page arXiv 2025
[39]

Advances in Neural Information Processing Systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems36, 53728–53741 (2023)

2023
[40]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

org/abs/2507.06167

Shen, W., Pei, J., Peng, Y., Song, X., Liu, Y., Peng, J., Sun, H., Hao, Y., Wang, P., Zhang, J., Zhou, Y.: Skywork-r1v3 technical report (2025),https://arxiv. org/abs/2507.06167

work page arXiv 2025
[44]

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentiviz- ing self-reflection of vision-language models with reinforcement learning (2025), https://arxiv.org/abs/2504.08837

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

arXiv preprint arXiv:2505.02865 (2025)

Wang, Z., Wang, J., Pan, J., Xia, X., Zhen, H., Yuan, M., Hao, J., Wu, F.: Ac- celerating large language model reasoning via speculative search. arXiv preprint arXiv:2505.02865 (2025)

work page arXiv 2025
[46]

In: Proceedings of the 31st ACM International Conference on Multimedia

Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text- based person retrieval: A large-scale multi-attribute and language search bench- mark. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 4492–4501. MM ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3581783.36...

work page doi:10.1145/3581783.3611709 2023
[47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yang, Y., Zhou, Y., Chen, Y., Zhang, Z., Ma, Z., Yuan, C., Li, B., Gao, J., Hu, W.: Beyond semantic search: Towards referential anchoring in composed image retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 31155–31165 (June 2026)

2026
[48]

Yang, Y., Zhou, Y., Chen, Y., Zhang, Z., Ma, Z., Yuan, C., Li, B., Song, L., Gao, J., Li, P., Hu, W.: Detailfusion: A dual-branch framework with detail enhancement for composed image retrieval (2025),https://arxiv.org/abs/2505.17796 Proactive Routing for Efficient Visual Reasoning 19

work page arXiv 2025
[49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Zhang, J., Huang, J., Yao, H., Liu, S., Zhang, X., Lu, S., Tao, D.: R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Zhang, R., Jiang, D., Zhang, Y., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.W., Gao, P., et al.: Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV)

Zhou, Y., Chen, Y., Lin, H., Wu, Y., Yang, S., Qi, Z., Ma, C., Zhu, L.: Dogr: Towards versatile visual document grounding and referring. In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3596–3606 (2025). https://doi.org/10.1109/ICCV51701.2025.00343

work page doi:10.1109/iccv51701.2025.00343 2025
[53]

IEEE Transactions on Multimedia27, 7510–7521 (2025).https://doi.org/10.1109/TMM.2025.3599088

Zhou, Y., Wang, Y., Lin, H., Ma, C., Zhu, L., Zheng, Z.: Scale up composed image retrieval learning via modification text generation. IEEE Transactions on Multimedia27, 7510–7521 (2025).https://doi.org/10.1109/TMM.2025.3599088

work page doi:10.1109/tmm.2025.3599088 2025
[54]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) Proactive Routing for Efficient Visual Reasoning 1 Fig. 9:Score distribution comparison on ChartQA and ours out-domain dis...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Since I can't view the image, perhaps the problem is standard

Since the user's question here might be similar to their example, maybe the area is 4? Wait, I need to be careful here. Since I can't view the image, perhaps the problem is standard. Wait, maybe the figure is a 3x3 grid but with some squares missing? Alternatively, maybe it's a 2x2 square, so area 4. But the user's previous example had an answer of 2. May...
[56]

unit squares\

Alternatively, maybe it's a more complex figure. Since the hint says the answer is an integer, and given the example format, maybe the correct answer is 4. Wait, maybe the figure is two squares side by side, making the area 2. Hmm, this is tricky. Since the user's instruction mentions that the answer should be in the form like 1, 2, 3, and given that the ...

2048

[1] [1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Barrak, A., Fourati, Y., Olchawa, M., Ksontini, E., Zoghlami, K.: Cargo: A framework for confidence-aware routing of large language models (2025),https: //arxiv.org/abs/2509.14899

work page arXiv 2025

[3] [3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J.D., Chen, D., Dao, T.: Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.B., Sifre, L., Jumper, J.: Acceler- ating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Chen, L., Gao, H., Liu, T., Huang, Z., Sung, F., Zhou, X., Wu, Y., Chang, B.: G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning (2025),https://arxiv.org/abs/2505.13426

work page arXiv 2025

[6] [6]

Chen, L., Li, L., Zhao, H., Song, Y., Vinci: R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep- Agent/R1-V(2025), accessed: 2025-02-02

2025

[7] [7]

In: Proc

Chen, Q., Qin, L., Zhang, J., Chen, Z., Xu, X., Che, W.: M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In: Proc. of ACL (2024)

2024

[8] [8]

Chen, Y., Ge, Y., Wang, R., Ge, Y., Cheng, J., Shan, Y., Liu, X.: Grpo-care: Consistency-awarereinforcementlearningformultimodalreasoning.arXivpreprint arXiv:2506.16141 (2025)

work page arXiv 2025

[9] [9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

arXiv preprint arXiv:2410.13284 (2024)

Chuang, Y.N., Sarma, P.K., Gopalan, P., Boccio, J., Bolouki, S., Hu, X., Zhou, H.: Learning to route llms with confidence tokens. arXiv preprint arXiv:2410.13284 (2024)

work page arXiv 2024

[11] [11]

arXiv preprint arXiv:2502.04428 (2025)

Chuang, Y.N., Yu, L., Wang, G., Zhang, L., Liu, Z., Cai, X., Sui, Y., Braver- man, V., Hu, X.: Confident or seek stronger: Exploring uncertainty-based on-device llm routing from benchmarking to generalization. arXiv preprint arXiv:2502.04428 (2025)

work page arXiv 2025

[12] [12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

arXiv preprint arXiv:2505.13427 (2025)

Du, L., Meng, F., Liu, Z., Zhou, Z., Luo, P., Zhang, Q., Shao, W.: Mm-prm: En- hancing multimodal mathematical reasoning with scalable step-level supervision. arXiv preprint arXiv:2505.13427 (2025)

work page arXiv 2025

[14] [14]

arXiv preprint arXiv:2404.16710 (2024)

Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mah- moud, A., Acun, B., Agarwal, S., Roman, A., et al.: Layerskip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710 (2024)

work page arXiv 2024

[15] [15]

arXiv preprint arXiv:2410.03834 (2024)

Feng, T., Shen, Y., You, J.: Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834 (2024)

work page arXiv 2024

[16] [16]

arXiv preprint arXiv:2505.21600 (2025) Proactive Routing for Efficient Visual Reasoning 17

Fu, T., Ge, Y., You, Y., Liu, E., Yuan, Z., Dai, G., Yan, S., Yang, H., Wang, Y.: R2r: Efficiently navigating divergent reasoning paths with small-large model token routing. arXiv preprint arXiv:2505.21600 (2025) Proactive Routing for Efficient Visual Reasoning 17

work page arXiv 2025

[17] [17]

In: Duh, K., Gomez, H., Bethard, S

Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., Gurevych, I.: A survey of con- fidence estimation and calibration in large language models. In: Duh, K., Gomez, H., Bethard, S. (eds.) Proceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). ...

2024

[18] [18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Hugging Face: Open r1: A fully open reproduction of deepseek-r1 (January 2025), https://github.com/huggingface/open-r1

2025

[20] [20]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

In: International Conference on Machine Learning

Leviathan, Y., Kalman, M., Matias, Y.: Fast inference from transformers via spec- ulative decoding. In: International Conference on Machine Learning. pp. 19274– 19286. PMLR (2023)

2023

[22] [22]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Li, Y., Wei, F., Zhang, C., Zhang, H.: Eagle: Speculative sampling requires re- thinking feature uncertainty. arXiv preprint arXiv:2401.15077 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

arXiv preprint arXiv:2501.19324 (2025)

Liao, B., Xu, Y., Dong, H., Li, J., Monz, C., Savarese, S., Sahoo, D., Xiong, C.: Reward-guided speculative decoding for efficient llm reasoning. arXiv preprint arXiv:2501.19324 (2025)

work page arXiv 2025

[24] [24]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023

[25] [25]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W.S., Lin, M.: Understand- ing r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Visual-RFT: Visual Reinforcement Fine-Tuning

Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Liu,Z.,Zang,Y.,Zou,Y.,Liang,Z.,Dong,X.,Cao,Y.,Duan,H.,Lin,D.,Wang,J.: Visual agentic reinforcement fine-tuning (2025),https://arxiv.org/abs/2505. 14246

2025

[28] [28]

arXiv preprint arXiv:2505.12504 (2025)

Liu, Z., Meng, F., Du, L., Zhou, Z., Yu, C., Shao, W., Zhang, Q.: Cpgd: To- ward stable rule-based reinforcement learning for language models. arXiv preprint arXiv:2505.12504 (2025)

work page arXiv 2025

[29] [29]

In: International Conference on Learning Representa- tions (ICLR) (2024)

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: International Conference on Learning Representa- tions (ICLR) (2024)

2024

[30] [30]

Ovis2.5 Technical Report

Lu, S., Li, Y., Xia, Y., Hu, Y., Zhao, S., Ma, Y., Wei, Z., Li, Y., Duan, L., Zhao, J., et al.: Ovis2. 5 technical report. arXiv preprint arXiv:2508.11737 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

arXiv preprint arXiv:2406.13415 (2024)

Mahaut, M., Aina, L., Czarnowska, P., Hardalov, M., Müller, T., Màrquez, L.: Factual confidence of llms: on reliability and robustness of current estimators. arXiv preprint arXiv:2406.13415 (2024)

work page arXiv 2024

[32] [32]

In: Findings of the association for computational linguistics: ACL 2022

Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)

2022

[33] [33]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 18 Y. Zhou et al

2021

[34] [34]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., Zhang, K., Luo, P., Qiao, Y., Zhang, Q., Shao, W.: Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

In: Proceedings of the29thACMInternationalConferenceonArchitecturalSupportforProgramming Languages and Operating Systems, Volume 3

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R.Y.Y., Zhu, A., Yang, L., Shi, X., et al.: Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In: Proceedings of the29thACMInternationalConferenceonArchitecturalSupportforProgramming Languages and Operating Systems, Volume 3. p...

2024

[36] [36]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V., Chiang, W.L., Wu, T., Gonzalez, J.E., Kadous, M.W., Stoica, I.: Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Peng, Y., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., Yang, X.: Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond.arXiv preprint arXiv:2503.21614, 2025

Qu, X., Li, Y., Su, Z., Sun, W., Yan, J., Liu, D., Cui, G., Liu, D., Liang, S., He, J., et al.: A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. arXiv preprint arXiv:2503.21614 (2025)

work page arXiv 2025

[39] [39]

Advances in Neural Information Processing Systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems36, 53728–53741 (2023)

2023

[40] [40]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

org/abs/2507.06167

Shen, W., Pei, J., Peng, Y., Song, X., Liu, Y., Peng, J., Sun, H., Hao, Y., Wang, P., Zhang, J., Zhou, Y.: Skywork-r1v3 technical report (2025),https://arxiv. org/abs/2507.06167

work page arXiv 2025

[44] [44]

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentiviz- ing self-reflection of vision-language models with reinforcement learning (2025), https://arxiv.org/abs/2504.08837

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

arXiv preprint arXiv:2505.02865 (2025)

Wang, Z., Wang, J., Pan, J., Xia, X., Zhen, H., Yuan, M., Hao, J., Wu, F.: Ac- celerating large language model reasoning via speculative search. arXiv preprint arXiv:2505.02865 (2025)

work page arXiv 2025

[46] [46]

In: Proceedings of the 31st ACM International Conference on Multimedia

Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text- based person retrieval: A large-scale multi-attribute and language search bench- mark. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 4492–4501. MM ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3581783.36...

work page doi:10.1145/3581783.3611709 2023

[47] [47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yang, Y., Zhou, Y., Chen, Y., Zhang, Z., Ma, Z., Yuan, C., Li, B., Gao, J., Hu, W.: Beyond semantic search: Towards referential anchoring in composed image retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 31155–31165 (June 2026)

2026

[48] [48]

Yang, Y., Zhou, Y., Chen, Y., Zhang, Z., Ma, Z., Yuan, C., Li, B., Song, L., Gao, J., Li, P., Hu, W.: Detailfusion: A dual-branch framework with detail enhancement for composed image retrieval (2025),https://arxiv.org/abs/2505.17796 Proactive Routing for Efficient Visual Reasoning 19

work page arXiv 2025

[49] [49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Zhang, J., Huang, J., Yao, H., Liu, S., Zhang, X., Lu, S., Tao, D.: R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Zhang, R., Jiang, D., Zhang, Y., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.W., Gao, P., et al.: Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV)

Zhou, Y., Chen, Y., Lin, H., Wu, Y., Yang, S., Qi, Z., Ma, C., Zhu, L.: Dogr: Towards versatile visual document grounding and referring. In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3596–3606 (2025). https://doi.org/10.1109/ICCV51701.2025.00343

work page doi:10.1109/iccv51701.2025.00343 2025

[53] [53]

IEEE Transactions on Multimedia27, 7510–7521 (2025).https://doi.org/10.1109/TMM.2025.3599088

Zhou, Y., Wang, Y., Lin, H., Ma, C., Zhu, L., Zheng, Z.: Scale up composed image retrieval learning via modification text generation. IEEE Transactions on Multimedia27, 7510–7521 (2025).https://doi.org/10.1109/TMM.2025.3599088

work page doi:10.1109/tmm.2025.3599088 2025

[54] [54]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) Proactive Routing for Efficient Visual Reasoning 1 Fig. 9:Score distribution comparison on ChartQA and ours out-domain dis...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Since I can't view the image, perhaps the problem is standard

Since the user's question here might be similar to their example, maybe the area is 4? Wait, I need to be careful here. Since I can't view the image, perhaps the problem is standard. Wait, maybe the figure is a 3x3 grid but with some squares missing? Alternatively, maybe it's a 2x2 square, so area 4. But the user's previous example had an answer of 2. May...

[56] [56]

unit squares\

Alternatively, maybe it's a more complex figure. Since the hint says the answer is an integer, and given the example format, maybe the correct answer is 4. Wait, maybe the figure is two squares side by side, making the area 2. Hmm, this is tricky. Since the user's instruction mentions that the answer should be in the form like 1, 2, 3, and given that the ...

2048