pith. sign in

arxiv: 2606.23678 · v1 · pith:VQAKFJ3Unew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Pith reviewed 2026-06-26 09:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords MLLMsinterleaved reasoningreinforcement learningcode executiontool usenumerical computationadaptive reasoningmultimodal models
0
0 comments X

The pith

Reinforcement learning with a group-constrained reward teaches MLLMs to adaptively interleave code execution during reasoning on numerical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move MLLMs beyond fixed heuristics for visual tool use by training them to decide on their own when to invoke code for calculations that arise in multimodal problems. It builds this capability through a two-stage data pipeline that creates cold-start examples, filters them for RL suitability, and then applies reinforcement learning guided by a reward that penalizes incorrect or unnecessary tool calls within groups of trajectories. A sympathetic reader would care because existing methods cannot handle arithmetic or other numerical operations and therefore fail on many real-world perception-plus-computation questions. The reported outcome is that the trained models show higher accuracy precisely on samples that require interleaved language and code steps.

Core claim

The paper claims that its three-component solution—a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy that uses a group-constrained reward function—enables MLLMs to perform adaptive interleaved reasoning with code on complex numerical computation tasks, producing an average 6.1 percentage point gain on evaluation benchmarks after RL training, a 9.9 percentage point gain on interleaved reasoning samples, and a tool-use success rate above 95 percent.

What carries the argument

The group-constrained reward function that shapes the adaptive tool-invocation strategy across interleaved reasoning trajectories during reinforcement learning.

If this is right

  • MLLMs gain the ability to handle numerical computation problems that previously required only visual operations.
  • Accuracy on interleaved reasoning samples rises by 9.9 percentage points after training.
  • Overall tool-use success rate during reasoning exceeds 95 percent.
  • Models shift from heuristic-driven visual manipulation to dynamic code invocation when computation is needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward structure might allow models to learn when to call other external tools such as search or simulators beyond code.
  • Extending the cold-start pipeline to include more diverse numerical patterns could further widen the performance gap on mixed perception-computation benchmarks.
  • If the group constraint generalizes, future systems could apply similar grouping techniques to train dynamic tool selection in non-multimodal language models.

Load-bearing premise

The two-stage cold-start data construction pipeline and data filtering strategies produce training examples that allow the group-constrained reward to successfully teach adaptive tool invocation rather than overfitting to the curated set.

What would settle it

Applying the full RL training procedure to the described dataset and observing no average improvement of 6.1 percentage points on the evaluation benchmarks, or a tool-use success rate below 95 percent, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.23678 by Cong Han, Haibo Qiu, Xiaohan Lan, Yujie Zhong.

Figure 1
Figure 1. Figure 1: Left: Overview of AIR. Upon receiving an image and a question, AIR generates textual reasoning and optionally invokes a code tool. The code undergoes execution in a sandbox, and its output is fed back into the reasoning process. This cycle of text reasoning and code execution can be repeated multiple times until the final answer is produced. Right: Illustration of reward design with group constraints. Afte… view at source ↗
Figure 2
Figure 2. Figure 2: Top: SFT data construction pipeline; Bottom: RL data filtering method. SFT Data Construction Pipeline: For MMK12 (Meng et al., 2025) data, powerful MLLMs and LLMs generate text-only and text-code interleaved reasoning data (CoT-Text and CoT￾Interleave). Cold-start Data is then produced after verifying code execution and result accuracy in a sandbox. RL Data Pipeline: For the ViRL39k (Wang et al., 2025a) da… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Reward Function Design on Code Rewards. Racc > 0 indicates that a correct result is the sole necessary con￾dition. No Rcode means that the code reward does not provide supervision signals for RL, with only Rcode recorded. For other numerical parameters, the first and second parts of each row denote the lower and upper bounds of Pr and Pw, respectively. 4.3. Ablation Study and Key Insight The impa… view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between Code Reward and Entropy across interleaved reasoning with code.Aggressive tool use (Red) leads to reward saturation at the cost of rising entropy. In contrast, judicious autonomous calling (Blue) maintains a low-entropy state with moderate rewards. The zero-calling baseline (Green) demonstrates the model’s performance in a purely textual/non-augmented context. Interleaved Reasoning 1 Qu… view at source ↗
read the original abstract

Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: https://github.com/CongHan0808/AIR.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes AIR, a three-component framework to enable adaptive interleaved reasoning with code in MLLMs beyond visual-perception tasks. It includes a two-stage cold-start data construction pipeline, data filtering for RL curation, and an adaptive tool-invocation strategy based on a group-constrained reward function. After RL training, the authors claim an average 6.1 pp improvement on evaluation benchmarks (9.9 pp on interleaved reasoning samples) with tool-use success exceeding 95%.

Significance. If the performance gains are shown to be robust against appropriate baselines and controls, the work would meaningfully extend tool-use in MLLMs to numerical computation problems, moving beyond heuristic visual manipulation.

major comments (2)
  1. [Abstract] Abstract: The performance claims (6.1 pp average gain, 9.9 pp on interleaved samples, >95% tool-use success) are stated without any reference to baselines, statistical significance, data splits, ablation controls, or potential post-hoc selection, rendering it impossible to assess whether the reported improvements are supported by the experiments.
  2. [Abstract] The central claim that the group-constrained reward successfully teaches adaptive tool invocation (rather than overfitting to the curated set) rests on the two-stage cold-start pipeline and filtering strategies, yet no details on these components, hyperparameter sensitivity, or failure modes are supplied in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our submission. We address the two major comments regarding the abstract below, clarifying the experimental support present in the full manuscript while agreeing to strengthen the abstract for clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The performance claims (6.1 pp average gain, 9.9 pp on interleaved samples, >95% tool-use success) are stated without any reference to baselines, statistical significance, data splits, ablation controls, or potential post-hoc selection, rendering it impossible to assess whether the reported improvements are supported by the experiments.

    Authors: We acknowledge that the abstract, due to its brevity, does not explicitly reference these elements. The full manuscript provides the requested context: Section 4 details comparisons against multiple baselines (standard MLLMs without code interleaving and prior tool-use approaches), reports results on fixed held-out data splits, includes ablation studies on the reward components, and presents mean performance with standard deviations across three random seeds to support statistical reliability. No post-hoc selection of results is performed; all reported metrics follow the pre-defined evaluation protocol described in Section 4.1. To address the concern directly in the abstract, we will revise it to briefly note the baseline comparisons and evaluation setup. revision: yes

  2. Referee: [Abstract] The central claim that the group-constrained reward successfully teaches adaptive tool invocation (rather than overfitting to the curated set) rests on the two-stage cold-start pipeline and filtering strategies, yet no details on these components, hyperparameter sensitivity, or failure modes are supplied in the provided text.

    Authors: The abstract is space-constrained and therefore omits these specifics. The full manuscript describes the two-stage cold-start data construction pipeline in Section 3.1, the RL curation filtering criteria in Section 3.2, and the group-constrained reward formulation together with its adaptive invocation mechanism in Section 3.3. Hyperparameter sensitivity analyses appear in the appendix, and potential failure modes (including cases of non-adaptive behavior) are examined in Section 5 with qualitative examples. These elements collectively support that the observed gains arise from the training procedure rather than overfitting. We will add a concise reference to the pipeline and reward design in the revised abstract. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical RL results stand on reported experiments

full rationale

The supplied abstract and description contain no equations, fitted parameters presented as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The central claims rest on a two-stage data pipeline, filtering, and a group-constrained reward applied in RL training, followed by measured accuracy gains (6.1 pp average, 9.9 pp on interleaved samples). These are presented as experimental outcomes rather than a derivation that reduces to its own inputs by construction. Without any mathematical chain or self-referential definition in the text, the derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions about reward design and data quality rather than new mathematical derivations; no free parameters, invented entities, or non-standard axioms are mentioned in the abstract.

axioms (1)
  • domain assumption Reinforcement learning with a suitably designed reward function can teach models to decide when and how to invoke code tools adaptively.
    The entire training pipeline depends on this assumption holding for the reported gains to occur.

pith-pipeline@v0.9.1-grok · 5745 in / 1139 out tokens · 27591 ms · 2026-06-26T09:04:56.186683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Visual Instruction Tuning

    URL https://api.semanticscholar. org/CorpusID:275405676. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Vi- sual instruction tuning.ArXiv, abs/2304.08485,

  2. [2]

    org/CorpusID:258179774

    URL https://api.semanticscholar. org/CorpusID:258179774. Liu, Y ., Li, S., Cao, L., Xie, Y ., Zhou, M., Dong, H., Ma, X., Han, S., and Zhang, D. Superrl: Re- inforcement learning with supervision to boost lan- guage model reasoning.ArXiv, abs/2506.01096, 2025a. URL https://api.semanticscholar. org/CorpusID:279075961. Liu, Z., Sun, Z., Zang, Y ., wen Dong,...

  3. [3]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    URL https://api.semanticscholar. org/CorpusID:264491155. Meng, F., Du, L., Liu, Z., Zhou, Z., Lu, Q., Fu, D., Han, T., Shi, B., Wang, W., He, J., Zhang, K., Luo, P., Qiao, Y ., Zhang, Q., and Shao, W. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025. OpenAI. Learning to...

  4. [4]

    org/CorpusID:270870136

    URL https://api.semanticscholar. org/CorpusID:270870136. Qiu, H., Lan, X., Liu, F., Sun, X., Ruan, D., Shi, P., and Ma, L. Metis-rise: Rl incentivizes and sft enhances multimodal reasoning model learning.arXiv preprint arXiv:2506.13056, 2025. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ArXiv...

  5. [5]

    Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

    URL https://api.semanticscholar. org/CorpusID:267412607. Singh, J., Magazine, R., Pandya, Y ., and Nambi, A. U. Agentic reasoning and tool integration for llms via reinforcement learning.ArXiv, abs/2505.01441,

  6. [6]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    URL https://api.semanticscholar. org/CorpusID:278327533. Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. Team, Q. Qwen3-max: Just scale it, September 2025. Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., and Chen, ...

  7. [7]

    org/CorpusID:279410727

    URL https://api.semanticscholar. org/CorpusID:279410727. Wu, M., Yang, J., Jiang, J., Li, M., Yan, K., Yu, H., Zhang, M., Zhai, C., and Nahrstedt, K. Vtool-r1: Vlms learn to think with images via reinforcement learning on multi- modal tool use, 2025. URL https://arxiv.org/ abs/2505.19255. Xiao, Y ., Sun, E., Liu, T., and Wang, W. Log- icvista: Multimodal ...

  8. [8]

    org/CorpusID:271050597

    URL https://api.semanticscholar. org/CorpusID:271050597. Xue, Z., Zheng, L., Liu, Q., Li, Y ., Zheng, X., Ma, Z., and An, B. Simpletir: End-to-end reinforce- ment learning for multi-turn tool-integrated reason- ing.ArXiv, abs/2509.02479, 2025. URL https: //api.semanticscholar.org/CorpusID: 281080825. Yang, S., Li, J., Lai, X., Yu, B., Zhao, H., and Jia, J...

  9. [9]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    URL https://api.semanticscholar. org/CorpusID:280222768. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y ., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Dai, W., Song, Y ., Wei, X., Zhou, H., Liu, J., Ma, W., Zhang, Y .-Q....

  10. [10]

    org/CorpusID:277104124

    URL https://api.semanticscholar. org/CorpusID:277104124. Yue, X. L.-C. T. Z., Lin, Z., Song, Y .-H., Wang, W., Ren, S.-Q., Gu, S., Li, S., Li, P., Zhao, L., Li, L., Bao, K., Tian, H., Zhang, H., Wang, G., Zhu, D., Cici, He, C., Ye, B., Shen, B., Zhang, Z., Jiang, Z.-A., Zheng, Z., Song, Z., Luo, Z., Yu, Y ., Wang, Y ., Tian, Y ., Tu, Y ., Yan, Y ., Huang,...

  11. [11]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    URL https://api.semanticscholar. org/CorpusID:279155294. Zhang, G., Geng, H., Yu, X., Yin, Z., Zhang, Z., Tan, Z., Zhou, H., Li, Z., Xue, X., Li, Y ., Zhou, Y ., Chen, Y ., Zhang, C., Fan, Y ., Wang, Z., Huang, S., Liao, Y ., Wang, H., Yang, M., Ji, H., Littman, M., Wang, J., Yan, S., Torr, P., and Bai, L. The landscape of agentic reinforcement learning f...

  12. [12]

    org/CorpusID:273811803

    URL https://api.semanticscholar. org/CorpusID:273811803. 11 AIR: Adaptive Interleaved Reasoning with Code in MLLMs A. System Prompt To activate the model’s tool-invocation capability, we designed a task-specific prompt. Build- ing on the commonly used prompts for Chain-of-Thought (CoT) data, we incorporated code- related instructions into the prompt desig...

  13. [13]

    Sum of squares: {sum_of_squares}

    Identify the elements of the matrix A. 2. Square each element. 3. Sum all the squared elements. 4. Take the square root of the sum. The matrixAis given by: A=   −2.2 0.6−4.4 −3.1 1.8−1.8 −7.3−6.9 0.5   First, we square each element of the matrix: (−2.2)2 = 4.84,0.6 2 = 0.36,(−4.4) 2 = 19.36, (−3.1)2 = 9.61,1.8 2 = 3.24,(−1.8) 2 = 3.24, (−7.3)2 = 53.29...

  14. [14]

    **Category 1:** - Item 1: 3 - Item 2: 2 - Item 3: 7 - The maximum value for Category 1 is 7

  15. [15]

    **Category 2:** - Item 1: 9 - Item 2: 5 - Item 3: 7 - The maximum value for Category 2 is 9

  16. [16]

    Product: {product}

    **Category 3:** - Item 1: 6 - Item 2: 9 - Item 3: 1 - The maximum value for Category 3 is 9. Now, we need to find the product of the maximum values from each category: - Maximum value for Category 1: 7 - Maximum value for Category 2: 9 - Maximum value for Category 3: 9 The product is: 7×9×9 Let’s calculate this using Python: <code> ‘‘‘python # Maximum val...

  17. [17]

    Month with the highest average temperature after smoothing: {max_month}

    Calculate the 3-month moving average for each month. 2. Identify the month with the highest smoothed temperature. Let’s start by calculating the 3-month moving average. The moving average for each month i is given by the average of the temperatures of monthi−1, monthi, and monthi+ 1. We will use Python to perform these calculations accurately. <code> ‘‘‘p...