iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models
Pith reviewed 2026-05-21 15:44 UTC · model grok-4.3
The pith
A proposer-solver loop with trajectory-aware rewards lets multimodal models improve their reasoning from unlabeled images alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
iReasoner augments standard outcome-level intrinsic rewards with an additional trajectory-aware signal that measures internal agreement across intermediate reasoning steps inside a Proposer-Solver loop; the combined reward is used to train the model on unlabeled images, yielding up to 2.1 point gains across diverse multimodal reasoning benchmarks under fully unsupervised post-training.
What carries the argument
the trajectory-aware signal, which scores internal agreement across the sequence of intermediate reasoning steps elicited by the proposer-solver loop and thereby distinguishes different reasoning paths that reach the same final answer
If this is right
- Reasoning paths become more explicitly constrained during self-play even without external supervision.
- Performance on visually grounded multimodal tasks rises after post-training on unlabeled images only.
- Models can separate multiple valid reasoning routes to the same answer using only internal consistency checks.
- Fully unsupervised post-training becomes viable for improving implicit reasoning in large multimodal models.
Where Pith is reading between the lines
- The same loop structure could be tested on text-only or audio-visual tasks to check whether trajectory rewards transfer beyond image-based reasoning.
- If the internal-agreement signal proves robust, it might reduce the volume of human-labeled chain-of-thought data needed for supervised fine-tuning.
- Combining trajectory rewards with outcome rewards from multiple independent solver runs could further stabilize the learning signal.
- Longer reasoning trajectories might amplify or dilute the benefit, suggesting a natural next experiment on tasks that require many steps.
Load-bearing premise
That agreement among a model's own intermediate reasoning steps provides a valid learning signal that actually improves reasoning quality rather than merely reinforcing superficial consistency.
What would settle it
Ablating the trajectory-aware component while keeping the outcome-level reward and measuring whether benchmark gains disappear or reverse on the same set of multimodal reasoning tasks.
read the original abstract
Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings. Our code is available at https://meghanaasunil.github.io/iReasoner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes iReasoner, a self-evolving framework for large multimodal models that improves implicit reasoning via a Proposer-Solver loop over unlabeled images. It augments standard outcome-level intrinsic rewards with a trajectory-aware signal that rewards internal agreement across intermediate chain-of-thought steps, without ground-truth labels or external judges. The central empirical claim is that this yields up to +2.1 points on diverse multimodal reasoning benchmarks when applied as unsupervised post-training to Qwen2.5-VL-7B.
Significance. If the central result holds after proper validation, the work would be moderately significant for unsupervised self-improvement of LMMs. It attempts to address the limitation of prior self-play methods that only reward final outcomes by adding explicit constraints on reasoning trajectories. The open-sourcing of code supports reproducibility and could serve as a baseline for future reasoning-aware intrinsic supervision techniques.
major comments (2)
- [Abstract] Abstract: The claim that the trajectory-aware signal 'distinguishes reasoning paths leading to the same answer' is load-bearing for the +2.1 point improvement, yet no diagnostic is reported showing that higher internal agreement across CoT steps correlates with correctness on held-out labeled data rather than consistent-but-erroneous visual grounding (e.g., repeated misreading of text or spatial relations).
- [Method] Method section (Proposer-Solver loop description): The formulation of the trajectory-aware reward must be checked for whether it can reinforce spurious consistency; without an ablation or correlation analysis against external verification, it remains unclear if the signal supplies valid learning gradients for actual reasoning quality improvement.
minor comments (2)
- [Abstract] The abstract mentions 'diverse multimodal reasoning benchmarks' but does not list them or report per-benchmark deltas with error bars; adding this table would strengthen the presentation.
- [Method] Notation for the intrinsic reward components (outcome-level vs. trajectory-aware) should be defined more explicitly with equations to avoid ambiguity in how they are combined.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas to strengthen our presentation of the results. We address each major comment below and commit to revisions that include the suggested diagnostics and ablations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the trajectory-aware signal 'distinguishes reasoning paths leading to the same answer' is load-bearing for the +2.1 point improvement, yet no diagnostic is reported showing that higher internal agreement across CoT steps correlates with correctness on held-out labeled data rather than consistent-but-erroneous visual grounding (e.g., repeated misreading of text or spatial relations).
Authors: Thank you for highlighting this important point. The trajectory-aware signal is intended to provide finer-grained supervision by rewarding consistency in intermediate steps for paths that lead to the same final answer. While the manuscript does not include an explicit correlation study on held-out data, the performance improvements on multiple benchmarks suggest that the signal is capturing useful reasoning improvements rather than mere consistency in errors. To directly address this concern, we will add a new analysis in the revised manuscript that computes the correlation between the internal agreement score and correctness using a held-out labeled subset of the data. This will help demonstrate whether higher agreement indeed aligns with correct visual grounding. revision: yes
-
Referee: [Method] Method section (Proposer-Solver loop description): The formulation of the trajectory-aware reward must be checked for whether it can reinforce spurious consistency; without an ablation or correlation analysis against external verification, it remains unclear if the signal supplies valid learning gradients for actual reasoning quality improvement.
Authors: We agree that verifying the reward does not reinforce spurious consistency is crucial. The current formulation rewards agreement across CoT steps only when the final answer matches, which we believe encourages coherent reasoning. However, to provide stronger evidence, we will include an ablation study in the revision that compares the full iReasoner reward against a baseline that uses only outcome rewards and against a variant with access to external verification on a subset. We will also report the correlation with external correctness metrics to confirm the validity of the learning gradients. revision: yes
Circularity Check
No significant circularity; empirical gains are independently measured
full rationale
The paper defines a trajectory-aware intrinsic reward over internal agreement between proposer and solver outputs in an unsupervised loop, then reports measured improvements of up to +2.1 points on external multimodal reasoning benchmarks. No equation or claim reduces the reported performance to a fitted parameter by construction, nor does any load-bearing premise collapse into a self-citation or prior ansatz from the same authors. The central mechanism is a testable hypothesis about agreement as a proxy signal, evaluated against held-out labeled benchmarks rather than tautologically equivalent to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Internal agreement across chain-of-thought trajectories correlates with improved reasoning quality in the absence of ground truth
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
trajectory-aware signal defined over intermediate reasoning steps... Intrinsic CoT Agreement Reward... step-wise similarity to these prototypes, with higher weight on early, grounding-heavy steps
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Starting from Qwen2.5-VL-7B, iReasoner yields up to +2.1 points across diverse multimodal reasoning benchmarks under fully unsupervised post-training
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.