Recognition: 1 theorem link
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
Pith reviewed 2026-05-08 19:15 UTC · model grok-4.3
The pith
A framework turns fragile multimodal in-context learning into explicit inductive-deductive rule extraction for vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that multimodal ICL fails primarily due to an inductive gap in which models do not extract consistent rules across visual demonstrations, compounded by redundant visual tokens and imbalanced attention; this can be remedied by recasting ICL as an explicit inductive-deductive pipeline that compresses visual tokens by similarity, rebalances attention dynamically, and forces the model to analyze examples, induce a rule, and deductively apply it, reinforced by auxiliary reinforcement learning that rewards verifiable rule use.
What carries the argument
The inductive-deductive chain-of-thought process, which separates per-example analysis, general-rule derivation, and query application, supported by similarity-based visual token compression and dynamic attention rebalancing.
If this is right
- Vision-language models will produce correct answers grounded in extracted rules rather than flawed reasoning across visual perception, logical reasoning, STEM, and sarcasm tasks.
- Reducing redundant visual tokens and rebalancing attention will allow later demonstrations to influence reasoning as much as the first image.
- The auxiliary reinforcement-learning stage will increase the frequency of faithful citation of the induced rule and filtering of irrelevant visual noise.
- The same framework will deliver consistent gains when applied to multiple open-source vision-language models.
Where Pith is reading between the lines
- The same inductive-deductive separation could be tested in pure text in-context learning to see whether explicit rule derivation improves consistency there as well.
- The token-compression step might reduce context length enough to allow longer demonstration sets without exceeding model limits.
- If the rebalancing mechanism proves general, similar attention adjustments could be applied to other multi-image or multi-turn vision tasks outside of in-context learning.
Load-bearing premise
That the performance gains come from genuine rule extraction enabled by the new modules rather than from incidental effects of the added training or prompting steps.
What would settle it
A controlled test that measures whether the model, when prompted to state the derived rule after the analysis step, produces rules that accurately capture the shared pattern in the demonstrations and that using those rules predicts final answer correctness better than standard ICL.
Figures
read the original abstract
In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multimodal ICL in VLMs is limited by an 'inductive gap' (correct answers from flawed reasoning, failure to extract consistent rules across demonstrations), worsened by redundant visual tokens and skewed attention favoring the first image. It proposes a framework restructuring ICL as inductive-deductive: similarity-based visual token compression to filter redundant patches, dynamic attention rebalancing for equitable focus, and a CoT paradigm that analyzes examples, derives a generalizable rule, then applies it to the query. An auxiliary SFT+RL pipeline with verifiable rewards reinforces faithful citation and noise filtering. Evaluations on eight benchmarks (visual perception, logical reasoning, STEM, sarcasm detection) show consistent significant gains over standard ICL for open-source VLMs.
Significance. If the gains are shown to stem specifically from the inductive-deductive CoT enabling rule extraction (rather than from token compression or attention fixes), the work would meaningfully advance multimodal ICL by offering a principled separation of induction and deduction. The verifiable-reward RL component is a constructive element for controllable training. The result would highlight a path toward more reliable generalization from visual demonstrations, with potential impact on reasoning-heavy VLM applications.
major comments (2)
- The central claim that the framework equips models with 'genuine inductive capabilities' (abstract) is load-bearing but unsupported by isolating evidence. The three modules (token compression, attention rebalancing, inductive-deductive CoT) are bundled; without ablations that retain compression/rebalancing while removing the CoT (analyze-derive-apply) step, or direct probes of rule fidelity (e.g., consistency of derived rules across held-out demonstration sets or counterfactual query tests), benchmark gains cannot be attributed to rule extraction rather than noise reduction. The RL 'verifiable rewards' for citation and filtering are described but lack explicit rule-consistency metrics.
- [§5] §5 (Experiments): the abstract asserts 'consistent and significant improvements' across eight benchmarks for multiple VLMs, yet the manuscript supplies no statistical significance tests, run-to-run variance, or comparisons against strong baselines that apply only compression/rebalancing without the CoT. This weakens the cross-benchmark claim and leaves open whether the inductive-deductive component is necessary.
minor comments (2)
- The term 'inductive gap' is introduced without a precise operational definition or contrast to standard ICL failure modes; a short formalization or illustrative example in the introduction would improve clarity.
- The eight benchmarks are referenced in the abstract but not enumerated with citations or task descriptions in the provided text; this should be added to §1 or §5 for reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive review and for recognizing the potential of our inductive-deductive framework in advancing multimodal in-context learning. We address the major comments point by point below. We agree that additional ablations and statistical analyses will strengthen the claims and will incorporate these in the revised manuscript.
read point-by-point responses
-
Referee: The central claim that the framework equips models with 'genuine inductive capabilities' (abstract) is load-bearing but unsupported by isolating evidence. The three modules (token compression, attention rebalancing, inductive-deductive CoT) are bundled; without ablations that retain compression/rebalancing while removing the CoT (analyze-derive-apply) step, or direct probes of rule fidelity (e.g., consistency of derived rules across held-out demonstration sets or counterfactual query tests), benchmark gains cannot be attributed to rule extraction rather than noise reduction. The RL 'verifiable rewards' for citation and filtering are described but lack explicit rule-consistency metrics.
Authors: We thank the referee for highlighting this critical aspect. Our framework is designed such that the inductive-deductive CoT is the core mechanism for enabling rule extraction, with compression and rebalancing serving as supporting visual-level enhancements. However, we acknowledge that the current presentation bundles the components. In the revision, we will add dedicated ablations that apply token compression and attention rebalancing without the analyze-derive-apply CoT steps, allowing direct comparison to the full model. Furthermore, we will include direct probes of rule fidelity, such as consistency checks of derived rules on held-out demonstration sets and evaluations on counterfactual query tests. For the RL pipeline, we will augment the reporting with explicit rule-consistency metrics alongside the verifiable rewards for citation and filtering. revision: yes
-
Referee: §5 (Experiments): the abstract asserts 'consistent and significant improvements' across eight benchmarks for multiple VLMs, yet the manuscript supplies no statistical significance tests, run-to-run variance, or comparisons against strong baselines that apply only compression/rebalancing without the CoT. This weakens the cross-benchmark claim and leaves open whether the inductive-deductive component is necessary.
Authors: We agree that rigorous statistical validation and targeted baselines are essential for substantiating the cross-benchmark claims. The revised manuscript will include statistical significance tests (such as t-tests across multiple random seeds) and report run-to-run variance (mean ± std) for all key results. Additionally, we will introduce strong baselines that incorporate only the similarity-based visual token compression and dynamic attention rebalancing without the inductive-deductive CoT, to explicitly demonstrate the necessity of the reasoning paradigm. revision: yes
Circularity Check
No circularity: empirical framework with independent evaluation
full rationale
The paper introduces an empirical framework consisting of three modules (visual token compression, attention rebalancing, and inductive-deductive CoT) plus an auxiliary SFT+RL pipeline, then reports benchmark improvements. No mathematical derivation, prediction, or first-principles result is claimed that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmark evaluations rather than tautological redefinitions or self-referential loops, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Models often produce correct answers from flawed reasoning in multimodal ICL
- domain assumption Redundant visual tokens obscure textual cues and attention is skewed toward the initial image
Reference graph
Works this paper leans on
-
[1]
Anthropic claude.https://claude.ai/, 2026
Anthropic. Anthropic claude.https://claude.ai/, 2026
2026
-
[2]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015
2015
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page Pith review arXiv 2025
-
[4]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page Pith review arXiv 2025
-
[5]
What makes multimodal in-context learning work? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539–1550, 2024
Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, and Benjamin Pi- wowarski. What makes multimodal in-context learning work? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539–1550, 2024
2024
-
[6]
The role of deductive and inductive reasoning in large language models
Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq- Neng Hwang, and Lei Li. The role of deductive and inductive reasoning in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16780–16790, 2025
2025
-
[7]
Huanqia Cai, Yijun Yang, and Winston Hu. Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models.arXiv preprint arXiv:2502.00698, 2025
-
[8]
Can multimodal large language models truly perform multimodal in-context learning? In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6000–6010
Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, V olker Tresp, and Jindong Gu. Can multimodal large language models truly perform multimodal in-context learning? In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6000–6010. IEEE, 2025
2025
-
[9]
Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, V olker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context.arXiv preprint arXiv:2507.15807, 2025
-
[10]
Kewei Cheng, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Binxuan Huang, Ruirui Li, Shiyang Li, Zheng Li, Yifan Gao, Xian Li, et al. Inductive or deductive? rethinking the fundamental reasoning abilities of llms.arXiv preprint arXiv:2408.00114, 2024
-
[11]
A survey on in-context learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024
2024
-
[12]
Towards multimodal in-context learning for vision and language models
Sivan Doveh, Shaked Perek, M Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. Towards multimodal in-context learning for vision and language models. InEuropean Conference on Computer Vision, pages 250–267. Springer, 2024
2024
-
[13]
Semeval-2022 task 5: Multimedia automatic misogyny identification
Elisabetta Fersini, Francesca Gasparini, Giulia Rizzi, Aurora Saibene, Berta Chulvi, Paolo Rosso, Alyssa Lees, and Jeffrey Sorensen. Semeval-2022 task 5: Multimedia automatic misogyny identification. InProceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 533–549, 2022. 10
2022
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review arXiv 2025
-
[15]
Deepeyesv2: Toward agentic multimodal model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025
-
[16]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
2022
-
[17]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review arXiv 2025
-
[18]
Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, and Xuchen Song. Csvqa: A chinese multimodal benchmark for evaluating stem reasoning capabilities of vlms.arXiv preprint arXiv:2505.24120, 2025
-
[19]
12 EmbeddingGemma: Powerful and Lightweight Text Representations C
Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025
-
[20]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026
work page internal anchor Pith review arXiv 2026
-
[21]
Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. M2IV:: Towards efficient and fine-grained multimodal in-context learning via representation engineering.arXiv preprint arXiv:2504.04633, 2025
-
[22]
Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning
Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, and Ruixiang Tang. Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6619–6627, 2026
2026
-
[23]
Make lvlms focus: Context-aware attention modulation for better multimodal in-context learning
Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, et al. Make lvlms focus: Context-aware attention modulation for better multimodal in-context learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6610–6618, 2026
2026
-
[24]
Implicit in-context learning.arXiv preprint arXiv:2405.14660, 2024
Zhuowei Li, Zihao Xu, Ligong Han, Yunhe Gao, Song Wen, Di Liu, Hao Wang, and Dimitris N Metaxas. Implicit in-context learning.arXiv preprint arXiv:2405.14660, 2024
-
[25]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
2023
-
[26]
Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection
Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 103...
2026
-
[27]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022
2022
-
[28]
A new era of intelligence with gemini
Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini
-
[29]
https://blog.google/products-and-platforms/products/gemini/gemini-3/ , 2025. 11
2025
-
[30]
Detecting harmful memes and their targets
Shraman Pramanick, Dimitar Dimitrov, Rituparna Mukherjee, Shivam Sharma, Md Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. Detecting harmful memes and their targets. In Findings of the association for computational linguistics: ACL-IJCNLP 2021, pages 2783–2796, 2021
2021
-
[31]
Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review arXiv 2024
-
[33]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019
2019
-
[35]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
Meta-analysis of cohen’s kappa.Health Services and Outcomes Research Methodology, 11(3):145–163, 2011
Shuyan Sun. Meta-analysis of cohen’s kappa.Health Services and Outcomes Research Methodology, 11(3):145–163, 2011
2011
-
[37]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review arXiv 2026
-
[38]
Llamav-o1: Rethinking step-by-step visual reasoning in llms
Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24290–24315, 2025
2025
-
[39]
Identifying and mitigating position bias of multi-image vision-language models
Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Identifying and mitigating position bias of multi-image vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10599–10609, 2025
2025
-
[40]
Mmifevol: Towards evolutionary multimodal instruction following
Haoyu Wang, Sihang Jiang, Xiangru Zhu, Yuyan Chen, Xiaojun Meng, Jiansheng Wei, Yitong Wang, and Yanghua Xiao. Mmifevol: Towards evolutionary multimodal instruction following. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26206–26214, 2026
2026
-
[41]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page Pith review arXiv 2024
-
[42]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review arXiv 2021
-
[43]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[44]
The learnability of in-context learning
Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning. Advances in Neural Information Processing Systems, 36:36637–36651, 2023
2023
-
[45]
Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025. 12
-
[46]
Youze Xue, Dian Li, and Gang Liu. Improve multi-modal embedding learning via explicit hard negative gradient amplifying.arXiv preprint arXiv:2506.02020, 2025
-
[47]
R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2376–2385, 2025
2025
-
[48]
Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy
Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21744–21754, 2025
2025
-
[49]
Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, and Jiaxing Huang. Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026
-
[50]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review arXiv 2025
-
[51]
Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025
-
[52]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023
work page internal anchor Pith review arXiv 2023
-
[53]
R1-omni: Explainable omni- multimodal emotion recognition with reinforcing learning,
Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv preprint arXiv:2503.05379, 2025
-
[54]
Swift: a scalable lightweight infrastructure for fine-tuning
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025
2025
-
[55]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review arXiv 2025
-
[56]
Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models
Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Wangbo Zhao, Jiajun Song, Chuanhao Li, Weidong Tang, et al. Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28982–28990, 2026
2026
-
[57]
Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, and Xuming Hu. Don’t just chase “highlighted tokens” in mllms: Revisiting visual holistic context retention. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. A Supplementary Experiment To comprehensively evaluate the robustness and generaliza...
-
[58]
inductive correctness gap
coefficient of 0.78, demonstrating a high degree of alignment and confirming the validity of our automated evaluation pipeline. C Dataset Details VQAv2 [2]is a large-scale visual question answering benchmark. The dataset is built upon images from the Microsoft COCO dataset and contains open-ended questions that require joint understanding of visual conten...
-
[59]
Identify the core task, key visual elements, and the implicit logic required to solve it
Target Problem Analysis: First, carefully observe the Target Question and its image. Identify the core task, key visual elements, and the implicit logic required to solve it
-
[60]
Reference Case Evaluation: Analyze each reference case one by one. For each case, you must explicitly evaluate: • Visual & Textual Elements: What is shown in the image and asked in the text? • Solution Logic: What is the underlying rule, formula, or method used to solve this case? • Relevance Comparison: Compare its task type and solution logic with the T...
-
[61]
helpful". Ignore
Rule Induction: Synthesize a general problem-solving rule ONLY from the cases judged as "helpful". Ignore "unhelpful" cases completely. Use tags like "<|case 1|>" to cite them
-
[62]
I don’t know,
Final Answer: Apply the induced rule to solve the Target Question step-by-step. Use the following strict template for your output: <|begin_of_target_analysis|> [Core Task]: {What is the user asking?} [Key Visuals]: {What are the crucial elements in the target image?} [Required Logic]: {What kind of reasoning or method is likely needed?} <|end_of_target_an...
-
[63]
Decompose the standard answer into individual sub-answers (e.g., for a single answer: one sub-answer; for a list/sequence: each element is a sub-answer)
-
[64]
For each sub-answer, determine independently whether the student’s response contains a correct match (Correct or Wrong)
-
[65]
The final score = number of correct sub-answers / total number of sub-answers. Output Format Use the following XML tags: <sub_answers> List each sub-answer on a separate line in the format: [sub-answer content]: Correct or Wrong </sub_answers> <reason>1–3 concise sentences summarizing the grading result.</reason> independently, this prompt guarantees a hi...
-
[66]
The black triangles are consistently oriented to the left
Observe the first group of three stars. The black triangles are consistently oriented to the left
-
[67]
The black triangles are consistently oriented to the right
Observe the second group of three stars. The black triangles are consistently oriented to the right
-
[68]
Apply the extracted rule: The entire sequence is governed by a consistent geometric transformation applied to the entire star shape
-
[69]
Option B shows the black triangles oriented to the right, matching the second group
Evaluate the options: Option A shows the black triangles oriented to the left, matching the first group. Option B shows the black triangles oriented to the right, matching the second group. Option C shows the black triangles oriented to the left, but the overall shape is rotated clockwise compared to the first group. Option D shows the black triangles ori...
-
[70]
[Final Result]: B <|end_of_answer|> Figure 8: Case study on the MMIQ dataset
Conclusion: Option B is the only option that perfectly matches the geometric transformation required to complete the sequence. [Final Result]: B <|end_of_answer|> Figure 8: Case study on the MMIQ dataset. Vanilla Gemini-3.1-pro and Qwen3-VL fail to reach the correct answer, while ourMMInductioninduces the underlying rule from retrieved demonstrations and ...
-
[71]
**Analyze the diagrams (Stages A and B):** * **Figure A (Jia):** The cell has 4 chromosomes and 8 nuclear DNA molecules. Since the normal somatic cell of a fruit fly has 2n= 8 chromosomes, a cell with 4 chromosomes and 8 DNA molecules indicates that homologous chromosomes have separated, but sister chromatids have not. This corresponds to a secondary sper...
-
[72]
During Meiosis I, homologous chromosomes (X and Y) separate into different secondary spermatocytes
**Analyze the sex chromosomes (Option B):** * A male fruit fly has sex chromosomes XY . During Meiosis I, homologous chromosomes (X and Y) separate into different secondary spermatocytes. * Thus, a secondary spermatocyte will contain either one X or one Y chromosome. * In stage B (Anaphase II), the centromeres split. A cell that received the X chromosome ...
-
[73]
A crossover occurred between non-sister chromatids of homologous chromosomes, producing 4 chromatids for this bivalent:AB,Ab,aB, andab
**Analyze the genotypes of sperms produced (Option C):** * The primary spermatocyte has the genotype AaBb. A crossover occurred between non-sister chromatids of homologous chromosomes, producing 4 chromatids for this bivalent:AB,Ab,aB, andab. * During Meiosis I, the homologous centromeres separate. * If the centromere attached to the A allele goes to one ...
-
[74]
**Analyze the test-cross of a male fruit fly (Option D):** * A fundamental rule in genetics is that **male Drosophila do not normally undergo crossing over** during meiosis; their linked genes exhibit complete linkage. 27 * Although the problem states that a rare crossover occurred in *one specific spermatogonium*, the vast majority of spermatogonia will ...
-
[75]
This corresponds to the prophase I stage of meiosis when homologous chromosomes pair up and undergo crossing over
**Analyze the Bar Graphs**: The first graph (Jia) shows 4 chromosomes and 8 DNA molecules, indicating that each chromosome consists of 2 sister chromatids. This corresponds to the prophase I stage of meiosis when homologous chromosomes pair up and undergo crossing over. The second graph (Yi) shows 8 chromosomes and 8 DNA molecules, indicating that each ch...
-
[76]
At this stage, the cell contains 4 chromosomes (2 pairs of homologues) and 4 X chromosomes (2 pairs of homologues)
**Determine the Stage**: Since the number of chromosomes doubles from 4 to 8, the cell must be undergoing metaphase I . At this stage, the cell contains 4 chromosomes (2 pairs of homologues) and 4 X chromosomes (2 pairs of homologues). The cell cannot be in metaphase II because M II would show 4 chromosomes (2 pairs of sister chromatids) and 4 DNA molecules
-
[77]
Statement A is correct
**Evaluate Option A**: Both stages (prophase I and metaphase I) involve 4 pairs of sister chromatids, so the number of chromatids is always 8. Statement A is correct
-
[78]
Statement B is correct
**Evaluate Option B**: During metaphase I, the cell may contain either 0 or 2 X chromosomes depending on whether the homologous X chromosomes are aligned on opposite poles. Statement B is correct
-
[79]
Since the male fruit fly has the genotype AaBb, the other possible sperm genotypes are ab and aB
**Evaluate Option C**: The sperm genotype is Ab. Since the male fruit fly has the genotype AaBb, the other possible sperm genotypes are ab and aB. Therefore, the genotype of a sperm from another secondary spermatocyte can indeed be either ab or aB. Statement C is correct
-
[80]
Broke up after 2 years of dating
**Evaluate Option D**: The given offspring segregation ratio (45:5:5:45) is characteristic of a test cross between a heterozygous recessive individual (aaBb) and a homozygous recessive individual (aabb). This indicates that the male fruit fly is heterozygous for the B locus (AaBb). However, the question asks which statement is *incorrect*. Since options A...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.