Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning
Pith reviewed 2026-06-30 05:52 UTC · model grok-4.3
The pith
Joint ratings of draft and target model competence enable proactive routing of visual queries before any output generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRP introduces Draft Rating Learning to give the draft model an internal confidence estimator and Joint Rating Learning to predict target model performance on a query. These enable fine-grained proactive routing at the instance level, accelerating inference substantially while maintaining overall performance on multimodal reasoning tasks.
What carries the argument
The Proactive Routing Paradigm (PRP) using Draft Rating Learning (DRL) and Joint Rating Learning (JRL) to evaluate model competence before output.
If this is right
- Fine-grained instance-level proactive routing decisions become feasible before any thinking occurs.
- Substantial acceleration of inference is achieved without compromising overall performance.
- The target model is allocated samples it excels at rather than the hardest queries.
- Routing operates under multimodal settings without relying on post-hoc token probabilities or data-sensitive fine-tuning.
Where Pith is reading between the lines
- This routing could lower the average compute per query in production multimodal systems.
- Similar rating mechanisms might improve efficiency in other model collaboration setups like language-only tasks.
- Testing on additional benchmarks with varying query difficulties would further validate the ratings' predictive power.
Load-bearing premise
The internal confidence estimator and joint rating can accurately predict how well each model will handle unseen queries before producing any output.
What would settle it
If routing based on these ratings results in lower accuracy than using the target model alone on the same set of visual reasoning queries, the claim would be falsified.
Figures
read the original abstract
Large multimodal models have achieved strong reasoning on complex visual tasks, but their inference efficiency is often restricted by long chains of thought. A promising solution is to pair a small draft model with a large target model, enabling cooperative inference employing a routing signal that adaptively routes queries to either the draft or target model based on their difficulties for optimal efficiency and accuracy. Yet, the remaining bottleneck is to establish a reliable query difficulty signal under multimodal settings. Existing approaches designed for language models either rely on post-hoc token probabilities, which fall short in multimodal scenarios, or depend on supervised fine-tuning, which is a data-sensitive strategy. Both paradigms perform routing only after a complete output, and ignore whether the target model can actually solve the routed instances. To address this, we propose PRP, a Proactive Routing Paradigm that enables early decision-making by jointly evaluating the competence of both the draft and target models. Our Draft Rating Learning (DRL) equips the draft model with an internal confidence estimator, while Joint Rating Learning (JRL) predicts how well the target model can handle a given query, thereby prioritizing the allocation of samples it excels at rather than the hardest ones. These ratings enable fine-grained, instance-level \textbf{Proactive Routing} and substantially accelerate inference without compromising overall performance. Extensive experiments across multiple multimodal reasoning benchmarks validate our effectiveness and efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PRP, a Proactive Routing Paradigm for pairing a small draft model with a large target model in multimodal reasoning. It introduces Draft Rating Learning (DRL) to add an internal confidence estimator to the draft model and Joint Rating Learning (JRL) to predict target-model competence on a query. These ratings are intended to support instance-level proactive routing decisions before any output is generated, with the claim that this accelerates inference without accuracy loss. The abstract states that extensive experiments on multiple multimodal reasoning benchmarks validate the approach.
Significance. If the DRL and JRL ratings are shown to generalize and correlate with actual solve rates on unseen queries, the method could offer a practical way to improve inference efficiency in visual reasoning pipelines by avoiding full target-model computation on instances the target is unlikely to solve. The proactive (pre-output) nature distinguishes it from post-hoc routing methods.
major comments (3)
- [Abstract] Abstract: the claim that 'extensive experiments across multiple multimodal reasoning benchmarks validate our effectiveness and efficiency' is unsupported by any reported metrics, baselines, error bars, or correlation statistics between ratings and solve rates. This directly undermines evaluation of the central claim that the ratings enable acceleration without compromising performance.
- [Method] Method section (DRL/JRL descriptions): no training objective, loss formulation, or supervision signal is specified for learning the internal confidence estimator or the joint rating. Without this, it is impossible to determine whether the ratings are unsupervised, supervised on query difficulty, or otherwise, and whether they avoid the data-sensitivity the paper attributes to prior SFT methods.
- [Experiments / Results] Results section: no ablation, correlation analysis, or threshold-validation statistics are provided to show that the rating thresholds separate instances the target model solves from those it does not. This is load-bearing for the proactive-routing claim.
minor comments (2)
- [Method] Notation for DRL and JRL is introduced without an explicit equation defining how the ratings are combined into the routing decision.
- [Abstract / Experiments] The abstract refers to 'multimodal reasoning benchmarks' without naming them; the experiments section should list the specific datasets (e.g., MMMU, MathVista) and the exact metrics used.
Simulated Author's Rebuttal
We are grateful for the referee's insightful comments, which highlight areas where the manuscript can be strengthened. We provide detailed responses to each major comment below and commit to making the necessary revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'extensive experiments across multiple multimodal reasoning benchmarks validate our effectiveness and efficiency' is unsupported by any reported metrics, baselines, error bars, or correlation statistics between ratings and solve rates. This directly undermines evaluation of the central claim that the ratings enable acceleration without compromising performance.
Authors: We agree with the referee that the abstract's claim requires supporting details to be fully substantiated. We will revise the abstract to incorporate specific metrics, baselines, error bars, and references to correlation statistics between the ratings and solve rates, drawing from expanded analyses in the results section. revision: yes
-
Referee: [Method] Method section (DRL/JRL descriptions): no training objective, loss formulation, or supervision signal is specified for learning the internal confidence estimator or the joint rating. Without this, it is impossible to determine whether the ratings are unsupervised, supervised on query difficulty, or otherwise, and whether they avoid the data-sensitivity the paper attributes to prior SFT methods.
Authors: We agree that the method section would be improved by explicitly stating the training objectives. In the revision, we will provide the loss functions used for DRL and JRL, detailing the supervision signals employed and how they differ from standard SFT approaches. revision: yes
-
Referee: [Experiments / Results] Results section: no ablation, correlation analysis, or threshold-validation statistics are provided to show that the rating thresholds separate instances the target model solves from those it does not. This is load-bearing for the proactive-routing claim.
Authors: We acknowledge the importance of these analyses for validating the routing decisions. We will include additional ablations, correlation plots between ratings and actual solve rates, and threshold validation statistics in the revised results section to demonstrate the effectiveness of the proactive routing. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces DRL (internal confidence estimator on draft model) and JRL (joint rating for target model competence) as learned components that produce routing signals for proactive decisions before output generation. The abstract and skeptic summary describe these as trained estimators whose outputs are then used for instance-level routing decisions, with effectiveness asserted via benchmark experiments rather than by algebraic identity or self-citation. No equations, loss formulations, or prior-author uniqueness theorems are quoted that would make the claimed predictions equivalent to their training inputs by construction. The central claim therefore remains an empirical proposal whose validity rests on external validation rather than definitional reduction, consistent with the reader's low circularity assessment.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [2]
-
[3]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J.D., Chen, D., Dao, T.: Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Accelerating Large Language Model Decoding with Speculative Sampling
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.B., Sifre, L., Jumper, J.: Acceler- ating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [5]
-
[6]
Chen, L., Li, L., Zhao, H., Song, Y., Vinci: R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep- Agent/R1-V(2025), accessed: 2025-02-02
2025
-
[7]
In: Proc
Chen, Q., Qin, L., Zhang, J., Chen, Z., Xu, X., Che, W.: M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In: Proc. of ACL (2024)
2024
- [8]
-
[9]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
arXiv preprint arXiv:2410.13284 (2024)
Chuang, Y.N., Sarma, P.K., Gopalan, P., Boccio, J., Bolouki, S., Hu, X., Zhou, H.: Learning to route llms with confidence tokens. arXiv preprint arXiv:2410.13284 (2024)
-
[11]
arXiv preprint arXiv:2502.04428 (2025)
Chuang, Y.N., Yu, L., Wang, G., Zhang, L., Liu, Z., Cai, X., Sui, Y., Braver- man, V., Hu, X.: Confident or seek stronger: Exploring uncertainty-based on-device llm routing from benchmarking to generalization. arXiv preprint arXiv:2502.04428 (2025)
-
[12]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
arXiv preprint arXiv:2505.13427 (2025)
Du, L., Meng, F., Liu, Z., Zhou, Z., Luo, P., Zhang, Q., Shao, W.: Mm-prm: En- hancing multimodal mathematical reasoning with scalable step-level supervision. arXiv preprint arXiv:2505.13427 (2025)
-
[14]
arXiv preprint arXiv:2404.16710 (2024)
Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mah- moud, A., Acun, B., Agarwal, S., Roman, A., et al.: Layerskip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710 (2024)
-
[15]
arXiv preprint arXiv:2410.03834 (2024)
Feng, T., Shen, Y., You, J.: Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834 (2024)
-
[16]
arXiv preprint arXiv:2505.21600 (2025) Proactive Routing for Efficient Visual Reasoning 17
Fu, T., Ge, Y., You, Y., Liu, E., Yuan, Z., Dai, G., Yan, S., Yang, H., Wang, Y.: R2r: Efficiently navigating divergent reasoning paths with small-large model token routing. arXiv preprint arXiv:2505.21600 (2025) Proactive Routing for Efficient Visual Reasoning 17
-
[17]
In: Duh, K., Gomez, H., Bethard, S
Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., Gurevych, I.: A survey of con- fidence estimation and calibration in large language models. In: Duh, K., Gomez, H., Bethard, S. (eds.) Proceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). ...
2024
-
[18]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Hugging Face: Open r1: A fully open reproduction of deepseek-r1 (January 2025), https://github.com/huggingface/open-r1
2025
-
[20]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
In: International Conference on Machine Learning
Leviathan, Y., Kalman, M., Matias, Y.: Fast inference from transformers via spec- ulative decoding. In: International Conference on Machine Learning. pp. 19274– 19286. PMLR (2023)
2023
-
[22]
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Li, Y., Wei, F., Zhang, C., Zhang, H.: Eagle: Speculative sampling requires re- thinking feature uncertainty. arXiv preprint arXiv:2401.15077 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
arXiv preprint arXiv:2501.19324 (2025)
Liao, B., Xu, Y., Dong, H., Li, J., Monz, C., Savarese, S., Sahoo, D., Xiong, C.: Reward-guided speculative decoding for efficient llm reasoning. arXiv preprint arXiv:2501.19324 (2025)
-
[24]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[25]
Understanding R1-Zero-Like Training: A Critical Perspective
Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W.S., Lin, M.: Understand- ing r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Visual-RFT: Visual Reinforcement Fine-Tuning
Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Liu,Z.,Zang,Y.,Zou,Y.,Liang,Z.,Dong,X.,Cao,Y.,Duan,H.,Lin,D.,Wang,J.: Visual agentic reinforcement fine-tuning (2025),https://arxiv.org/abs/2505. 14246
2025
-
[28]
arXiv preprint arXiv:2505.12504 (2025)
Liu, Z., Meng, F., Du, L., Zhou, Z., Yu, C., Shao, W., Zhang, Q.: Cpgd: To- ward stable rule-based reinforcement learning for language models. arXiv preprint arXiv:2505.12504 (2025)
-
[29]
In: International Conference on Learning Representa- tions (ICLR) (2024)
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In: International Conference on Learning Representa- tions (ICLR) (2024)
2024
-
[30]
Lu, S., Li, Y., Xia, Y., Hu, Y., Zhao, S., Ma, Y., Wei, Z., Li, Y., Duan, L., Zhao, J., et al.: Ovis2. 5 technical report. arXiv preprint arXiv:2508.11737 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
arXiv preprint arXiv:2406.13415 (2024)
Mahaut, M., Aina, L., Czarnowska, P., Hardalov, M., Müller, T., Màrquez, L.: Factual confidence of llms: on reliability and robustness of current estimators. arXiv preprint arXiv:2406.13415 (2024)
-
[32]
In: Findings of the association for computational linguistics: ACL 2022
Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)
2022
-
[33]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 18 Y. Zhou et al
2021
-
[34]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., Zhang, K., Luo, P., Qiao, Y., Zhang, Q., Shao, W.: Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
In: Proceedings of the29thACMInternationalConferenceonArchitecturalSupportforProgramming Languages and Operating Systems, Volume 3
Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R.Y.Y., Zhu, A., Yang, L., Shi, X., et al.: Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In: Proceedings of the29thACMInternationalConferenceonArchitecturalSupportforProgramming Languages and Operating Systems, Volume 3. p...
2024
-
[36]
RouteLLM: Learning to Route LLMs with Preference Data
Ong, I., Almahairi, A., Wu, V., Chiang, W.L., Wu, T., Gonzalez, J.E., Kadous, M.W., Stoica, I.: Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Peng, Y., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., Yang, X.: Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Qu, X., Li, Y., Su, Z., Sun, W., Yan, J., Liu, D., Cui, G., Liu, D., Liang, S., He, J., et al.: A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. arXiv preprint arXiv:2503.21614 (2025)
-
[39]
Advances in Neural Information Processing Systems36, 53728–53741 (2023)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems36, 53728–53741 (2023)
2023
-
[40]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Shen, W., Pei, J., Peng, Y., Song, X., Liu, Y., Peng, J., Sun, H., Hao, Y., Wang, P., Zhang, J., Zhou, Y.: Skywork-r1v3 technical report (2025),https://arxiv. org/abs/2507.06167
-
[44]
Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentiviz- ing self-reflection of vision-language models with reinforcement learning (2025), https://arxiv.org/abs/2504.08837
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
arXiv preprint arXiv:2505.02865 (2025)
Wang, Z., Wang, J., Pan, J., Xia, X., Zhen, H., Yuan, M., Hao, J., Wu, F.: Ac- celerating large language model reasoning via speculative search. arXiv preprint arXiv:2505.02865 (2025)
-
[46]
In: Proceedings of the 31st ACM International Conference on Multimedia
Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text- based person retrieval: A large-scale multi-attribute and language search bench- mark. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 4492–4501. MM ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3581783.36...
-
[47]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yang, Y., Zhou, Y., Chen, Y., Zhang, Z., Ma, Z., Yuan, C., Li, B., Gao, J., Hu, W.: Beyond semantic search: Towards referential anchoring in composed image retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 31155–31165 (June 2026)
2026
-
[48]
Yang, Y., Zhou, Y., Chen, Y., Zhang, Z., Ma, Z., Yuan, C., Li, B., Song, L., Gao, J., Li, P., Hu, W.: Detailfusion: A dual-branch framework with detail enhancement for composed image retrieval (2025),https://arxiv.org/abs/2505.17796 Proactive Routing for Efficient Visual Reasoning 19
-
[49]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Zhang, J., Huang, J., Yao, H., Liu, S., Zhang, X., Lu, S., Tao, D.: R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Zhang, R., Jiang, D., Zhang, Y., Lin, H., Guo, Z., Qiu, P., Zhou, A., Lu, P., Chang, K.W., Gao, P., et al.: Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV)
Zhou, Y., Chen, Y., Lin, H., Wu, Y., Yang, S., Qi, Z., Ma, C., Zhu, L.: Dogr: Towards versatile visual document grounding and referring. In: 2025 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3596–3606 (2025). https://doi.org/10.1109/ICCV51701.2025.00343
-
[53]
IEEE Transactions on Multimedia27, 7510–7521 (2025).https://doi.org/10.1109/TMM.2025.3599088
Zhou, Y., Wang, Y., Lin, H., Ma, C., Zhu, L., Zheng, Z.: Scale up composed image retrieval learning via modification text generation. IEEE Transactions on Multimedia27, 7510–7521 (2025).https://doi.org/10.1109/TMM.2025.3599088
-
[54]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) Proactive Routing for Efficient Visual Reasoning 1 Fig. 9:Score distribution comparison on ChartQA and ours out-domain dis...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Since I can't view the image, perhaps the problem is standard
Since the user's question here might be similar to their example, maybe the area is 4? Wait, I need to be careful here. Since I can't view the image, perhaps the problem is standard. Wait, maybe the figure is a 3x3 grid but with some squares missing? Alternatively, maybe it's a 2x2 square, so area 4. But the user's previous example had an answer of 2. May...
-
[56]
unit squares\
Alternatively, maybe it's a more complex figure. Since the hint says the answer is an integer, and given the example format, maybe the correct answer is 4. Wait, maybe the figure is two squares side by side, making the area 2. Hmm, this is tricky. Since the user's instruction mentions that the answer should be in the form like 1, 2, 3, and given that the ...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.