AIR: Adaptive Interleaved Reasoning with Code in MLLMs
Pith reviewed 2026-06-26 09:04 UTC · model grok-4.3
The pith
Reinforcement learning with a group-constrained reward teaches MLLMs to adaptively interleave code execution during reasoning on numerical tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that its three-component solution—a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy that uses a group-constrained reward function—enables MLLMs to perform adaptive interleaved reasoning with code on complex numerical computation tasks, producing an average 6.1 percentage point gain on evaluation benchmarks after RL training, a 9.9 percentage point gain on interleaved reasoning samples, and a tool-use success rate above 95 percent.
What carries the argument
The group-constrained reward function that shapes the adaptive tool-invocation strategy across interleaved reasoning trajectories during reinforcement learning.
If this is right
- MLLMs gain the ability to handle numerical computation problems that previously required only visual operations.
- Accuracy on interleaved reasoning samples rises by 9.9 percentage points after training.
- Overall tool-use success rate during reasoning exceeds 95 percent.
- Models shift from heuristic-driven visual manipulation to dynamic code invocation when computation is needed.
Where Pith is reading between the lines
- The same reward structure might allow models to learn when to call other external tools such as search or simulators beyond code.
- Extending the cold-start pipeline to include more diverse numerical patterns could further widen the performance gap on mixed perception-computation benchmarks.
- If the group constraint generalizes, future systems could apply similar grouping techniques to train dynamic tool selection in non-multimodal language models.
Load-bearing premise
The two-stage cold-start data construction pipeline and data filtering strategies produce training examples that allow the group-constrained reward to successfully teach adaptive tool invocation rather than overfitting to the curated set.
What would settle it
Applying the full RL training procedure to the described dataset and observing no average improvement of 6.1 percentage points on the evaluation benchmarks, or a tool-use success rate below 95 percent, would falsify the central claim.
Figures
read the original abstract
Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: https://github.com/CongHan0808/AIR.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AIR, a three-component framework to enable adaptive interleaved reasoning with code in MLLMs beyond visual-perception tasks. It includes a two-stage cold-start data construction pipeline, data filtering for RL curation, and an adaptive tool-invocation strategy based on a group-constrained reward function. After RL training, the authors claim an average 6.1 pp improvement on evaluation benchmarks (9.9 pp on interleaved reasoning samples) with tool-use success exceeding 95%.
Significance. If the performance gains are shown to be robust against appropriate baselines and controls, the work would meaningfully extend tool-use in MLLMs to numerical computation problems, moving beyond heuristic visual manipulation.
major comments (2)
- [Abstract] Abstract: The performance claims (6.1 pp average gain, 9.9 pp on interleaved samples, >95% tool-use success) are stated without any reference to baselines, statistical significance, data splits, ablation controls, or potential post-hoc selection, rendering it impossible to assess whether the reported improvements are supported by the experiments.
- [Abstract] The central claim that the group-constrained reward successfully teaches adaptive tool invocation (rather than overfitting to the curated set) rests on the two-stage cold-start pipeline and filtering strategies, yet no details on these components, hyperparameter sensitivity, or failure modes are supplied in the provided text.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on our submission. We address the two major comments regarding the abstract below, clarifying the experimental support present in the full manuscript while agreeing to strengthen the abstract for clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The performance claims (6.1 pp average gain, 9.9 pp on interleaved samples, >95% tool-use success) are stated without any reference to baselines, statistical significance, data splits, ablation controls, or potential post-hoc selection, rendering it impossible to assess whether the reported improvements are supported by the experiments.
Authors: We acknowledge that the abstract, due to its brevity, does not explicitly reference these elements. The full manuscript provides the requested context: Section 4 details comparisons against multiple baselines (standard MLLMs without code interleaving and prior tool-use approaches), reports results on fixed held-out data splits, includes ablation studies on the reward components, and presents mean performance with standard deviations across three random seeds to support statistical reliability. No post-hoc selection of results is performed; all reported metrics follow the pre-defined evaluation protocol described in Section 4.1. To address the concern directly in the abstract, we will revise it to briefly note the baseline comparisons and evaluation setup. revision: yes
-
Referee: [Abstract] The central claim that the group-constrained reward successfully teaches adaptive tool invocation (rather than overfitting to the curated set) rests on the two-stage cold-start pipeline and filtering strategies, yet no details on these components, hyperparameter sensitivity, or failure modes are supplied in the provided text.
Authors: The abstract is space-constrained and therefore omits these specifics. The full manuscript describes the two-stage cold-start data construction pipeline in Section 3.1, the RL curation filtering criteria in Section 3.2, and the group-constrained reward formulation together with its adaptive invocation mechanism in Section 3.3. Hyperparameter sensitivity analyses appear in the appendix, and potential failure modes (including cases of non-adaptive behavior) are examined in Section 5 with qualitative examples. These elements collectively support that the observed gains arise from the training procedure rather than overfitting. We will add a concise reference to the pipeline and reward design in the revised abstract. revision: partial
Circularity Check
No significant circularity; empirical RL results stand on reported experiments
full rationale
The supplied abstract and description contain no equations, fitted parameters presented as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The central claims rest on a two-stage data pipeline, filtering, and a group-constrained reward applied in RL training, followed by measured accuracy gains (6.1 pp average, 9.9 pp on interleaved samples). These are presented as experimental outcomes rather than a derivation that reduces to its own inputs by construction. Without any mathematical chain or self-referential definition in the text, the derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning with a suitably designed reward function can teach models to decide when and how to invoke code tools adaptively.
Reference graph
Works this paper leans on
-
[1]
URL https://api.semanticscholar. org/CorpusID:275405676. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Vi- sual instruction tuning.ArXiv, abs/2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URL https://api.semanticscholar. org/CorpusID:258179774. Liu, Y ., Li, S., Cao, L., Xie, Y ., Zhou, M., Dong, H., Ma, X., Han, S., and Zhang, D. Superrl: Re- inforcement learning with supervision to boost lan- guage model reasoning.ArXiv, abs/2506.01096, 2025a. URL https://api.semanticscholar. org/CorpusID:279075961. Liu, Z., Sun, Z., Zang, Y ., wen Dong,...
-
[3]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
URL https://api.semanticscholar. org/CorpusID:264491155. Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., Zhang, K., Luo, P., Qiao, Y ., Zhang, Q., and Shao, W. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025. OpenAI. Learning to...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
URL https://api.semanticscholar. org/CorpusID:270870136. Qiu, H., Lan, X., Liu, F., Sun, X., Ruan, D., Shi, P., and Ma, L. Metis-rise: Rl incentivizes and sft enhances multimodal reasoning model learning.arXiv preprint arXiv:2506.13056, 2025. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ArXiv...
-
[5]
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
URL https://api.semanticscholar. org/CorpusID:267412607. Singh, J., Magazine, R., Pandya, Y ., and Nambi, A. U. Agentic reasoning and tool integration for llms via reinforcement learning.ArXiv, abs/2505.01441,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
URL https://api.semanticscholar. org/CorpusID:278327533. Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. Team, Q. Qwen3-max: Just scale it, September 2025. Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., and Chen, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
URL https://api.semanticscholar. org/CorpusID:279410727. Wu, M., Yang, J., Jiang, J., Li, M., Yan, K., Yu, H., Zhang, M., Zhai, C., and Nahrstedt, K. Vtool-r1: Vlms learn to think with images via reinforcement learning on multi- modal tool use, 2025. URL https://arxiv.org/ abs/2505.19255. Xiao, Y ., Sun, E., Liu, T., and Wang, W. Log- icvista: Multimodal ...
-
[8]
URL https://api.semanticscholar. org/CorpusID:271050597. Xue, Z., Zheng, L., Liu, Q., Li, Y ., Zheng, X., Ma, Z., and An, B. Simpletir: End-to-end reinforce- ment learning for multi-turn tool-integrated reason- ing.ArXiv, abs/2509.02479, 2025. URL https: //api.semanticscholar.org/CorpusID: 281080825. Yang, S., Li, J., Lai, X., Yu, B., Zhao, H., and Jia, J...
-
[9]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
URL https://api.semanticscholar. org/CorpusID:280222768. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y ., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Dai, W., Song, Y ., Wei, X., Zhou, H., Liu, J., Ma, W., Zhang, Y .-Q....
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
URL https://api.semanticscholar. org/CorpusID:277104124. Yue, X. L.-C. T. Z., Lin, Z., Song, Y .-H., Wang, W., Ren, S.-Q., Gu, S., Li, S., Li, P., Zhao, L., Li, L., Bao, K., Tian, H., Zhang, H., Wang, G., Zhu, D., Cici, He, C., Ye, B., Shen, B., Zhang, Z., Jiang, Z.-A., Zheng, Z., Song, Z., Luo, Z., Yu, Y ., Wang, Y ., Tian, Y ., Tu, Y ., Yan, Y ., Huang,...
-
[11]
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
URL https://api.semanticscholar. org/CorpusID:279155294. Zhang, G., Geng, H., Yu, X., Yin, Z., Zhang, Z., Tan, Z., Zhou, H., Li, Z., Xue, X., Li, Y ., Zhou, Y ., Chen, Y ., Zhang, C., Fan, Y ., Wang, Z., Huang, S., Liao, Y ., Wang, H., Yang, M., Ji, H., Littman, M., Wang, J., Yan, S., Torr, P., and Bai, L. The landscape of agentic reinforcement learning f...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
org/CorpusID:273811803
URL https://api.semanticscholar. org/CorpusID:273811803. 11 AIR: Adaptive Interleaved Reasoning with Code in MLLMs A. System Prompt To activate the model’s tool-invocation capability, we designed a task-specific prompt. Build- ing on the commonly used prompts for Chain-of-Thought (CoT) data, we incorporated code- related instructions into the prompt desig...
-
[13]
Sum of squares: {sum_of_squares}
Identify the elements of the matrix A. 2. Square each element. 3. Sum all the squared elements. 4. Take the square root of the sum. The matrixAis given by: A= −2.2 0.6−4.4 −3.1 1.8−1.8 −7.3−6.9 0.5 First, we square each element of the matrix: (−2.2)2 = 4.84,0.6 2 = 0.36,(−4.4) 2 = 19.36, (−3.1)2 = 9.61,1.8 2 = 3.24,(−1.8) 2 = 3.24, (−7.3)2 = 53.29...
-
[14]
**Category 1:** - Item 1: 3 - Item 2: 2 - Item 3: 7 - The maximum value for Category 1 is 7
-
[15]
**Category 2:** - Item 1: 9 - Item 2: 5 - Item 3: 7 - The maximum value for Category 2 is 9
-
[16]
Product: {product}
**Category 3:** - Item 1: 6 - Item 2: 9 - Item 3: 1 - The maximum value for Category 3 is 9. Now, we need to find the product of the maximum values from each category: - Maximum value for Category 1: 7 - Maximum value for Category 2: 9 - Maximum value for Category 3: 9 The product is: 7×9×9 Let’s calculate this using Python: <code> ‘‘‘python # Maximum val...
-
[17]
Month with the highest average temperature after smoothing: {max_month}
Calculate the 3-month moving average for each month. 2. Identify the month with the highest smoothed temperature. Let’s start by calculating the 3-month moving average. The moving average for each month i is given by the average of the temperatures of monthi−1, monthi, and monthi+ 1. We will use Python to perform these calculations accurately. <code> ‘‘‘p...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.