iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Manikandarajan Venmathimaran; Meghana Sunil; Muthu Subash Kavitha

arxiv: 2601.05877 · v3 · pith:KALDCBDNnew · submitted 2026-01-09 · 💻 cs.CL

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Meghana Sunil , Manikandarajan Venmathimaran , Muthu Subash Kavitha This is my paper

Pith reviewed 2026-05-21 15:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-evolving modelslarge multimodal modelsintrinsic rewardstrajectory-aware supervisionunsupervised post-trainingchain-of-thought reasoningproposer-solver loop

0 comments

The pith

A proposer-solver loop with trajectory-aware rewards lets multimodal models improve their reasoning from unlabeled images alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that current self-improvement methods for large multimodal models are limited because they only reward final answers and leave the quality of intermediate reasoning steps loosely guided. It introduces a trajectory-aware signal that checks internal agreement across the chain of reasoning steps generated in a proposer-solver loop over unlabeled images. This extra signal supplies learning targets that can separate good reasoning paths from poor ones even when no ground-truth labels or external judges are available. A sympathetic reader would care because better constrained intermediate steps should produce more reliable visually grounded decisions rather than just higher final accuracy. The reported outcome is that starting from Qwen2.5-VL-7B the approach delivers measurable gains on multiple multimodal reasoning benchmarks after fully unsupervised post-training.

Core claim

iReasoner augments standard outcome-level intrinsic rewards with an additional trajectory-aware signal that measures internal agreement across intermediate reasoning steps inside a Proposer-Solver loop; the combined reward is used to train the model on unlabeled images, yielding up to 2.1 point gains across diverse multimodal reasoning benchmarks under fully unsupervised post-training.

What carries the argument

the trajectory-aware signal, which scores internal agreement across the sequence of intermediate reasoning steps elicited by the proposer-solver loop and thereby distinguishes different reasoning paths that reach the same final answer

If this is right

Reasoning paths become more explicitly constrained during self-play even without external supervision.
Performance on visually grounded multimodal tasks rises after post-training on unlabeled images only.
Models can separate multiple valid reasoning routes to the same answer using only internal consistency checks.
Fully unsupervised post-training becomes viable for improving implicit reasoning in large multimodal models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop structure could be tested on text-only or audio-visual tasks to check whether trajectory rewards transfer beyond image-based reasoning.
If the internal-agreement signal proves robust, it might reduce the volume of human-labeled chain-of-thought data needed for supervised fine-tuning.
Combining trajectory rewards with outcome rewards from multiple independent solver runs could further stabilize the learning signal.
Longer reasoning trajectories might amplify or dilute the benefit, suggesting a natural next experiment on tasks that require many steps.

Load-bearing premise

That agreement among a model's own intermediate reasoning steps provides a valid learning signal that actually improves reasoning quality rather than merely reinforcing superficial consistency.

What would settle it

Ablating the trajectory-aware component while keeping the outcome-level reward and measuring whether benchmark gains disappear or reverse on the same set of multimodal reasoning tasks.

read the original abstract

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings. Our code is available at https://meghanaasunil.github.io/iReasoner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes iReasoner, a self-evolving framework for large multimodal models that improves implicit reasoning via a Proposer-Solver loop over unlabeled images. It augments standard outcome-level intrinsic rewards with a trajectory-aware signal that rewards internal agreement across intermediate chain-of-thought steps, without ground-truth labels or external judges. The central empirical claim is that this yields up to +2.1 points on diverse multimodal reasoning benchmarks when applied as unsupervised post-training to Qwen2.5-VL-7B.

Significance. If the central result holds after proper validation, the work would be moderately significant for unsupervised self-improvement of LMMs. It attempts to address the limitation of prior self-play methods that only reward final outcomes by adding explicit constraints on reasoning trajectories. The open-sourcing of code supports reproducibility and could serve as a baseline for future reasoning-aware intrinsic supervision techniques.

major comments (2)

[Abstract] Abstract: The claim that the trajectory-aware signal 'distinguishes reasoning paths leading to the same answer' is load-bearing for the +2.1 point improvement, yet no diagnostic is reported showing that higher internal agreement across CoT steps correlates with correctness on held-out labeled data rather than consistent-but-erroneous visual grounding (e.g., repeated misreading of text or spatial relations).
[Method] Method section (Proposer-Solver loop description): The formulation of the trajectory-aware reward must be checked for whether it can reinforce spurious consistency; without an ablation or correlation analysis against external verification, it remains unclear if the signal supplies valid learning gradients for actual reasoning quality improvement.

minor comments (2)

[Abstract] The abstract mentions 'diverse multimodal reasoning benchmarks' but does not list them or report per-benchmark deltas with error bars; adding this table would strengthen the presentation.
[Method] Notation for the intrinsic reward components (outcome-level vs. trajectory-aware) should be defined more explicitly with equations to avoid ambiguity in how they are combined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas to strengthen our presentation of the results. We address each major comment below and commit to revisions that include the suggested diagnostics and ablations.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the trajectory-aware signal 'distinguishes reasoning paths leading to the same answer' is load-bearing for the +2.1 point improvement, yet no diagnostic is reported showing that higher internal agreement across CoT steps correlates with correctness on held-out labeled data rather than consistent-but-erroneous visual grounding (e.g., repeated misreading of text or spatial relations).

Authors: Thank you for highlighting this important point. The trajectory-aware signal is intended to provide finer-grained supervision by rewarding consistency in intermediate steps for paths that lead to the same final answer. While the manuscript does not include an explicit correlation study on held-out data, the performance improvements on multiple benchmarks suggest that the signal is capturing useful reasoning improvements rather than mere consistency in errors. To directly address this concern, we will add a new analysis in the revised manuscript that computes the correlation between the internal agreement score and correctness using a held-out labeled subset of the data. This will help demonstrate whether higher agreement indeed aligns with correct visual grounding. revision: yes
Referee: [Method] Method section (Proposer-Solver loop description): The formulation of the trajectory-aware reward must be checked for whether it can reinforce spurious consistency; without an ablation or correlation analysis against external verification, it remains unclear if the signal supplies valid learning gradients for actual reasoning quality improvement.

Authors: We agree that verifying the reward does not reinforce spurious consistency is crucial. The current formulation rewards agreement across CoT steps only when the final answer matches, which we believe encourages coherent reasoning. However, to provide stronger evidence, we will include an ablation study in the revision that compares the full iReasoner reward against a baseline that uses only outcome rewards and against a variant with access to external verification on a subset. We will also report the correlation with external correctness metrics to confirm the validity of the learning gradients. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains are independently measured

full rationale

The paper defines a trajectory-aware intrinsic reward over internal agreement between proposer and solver outputs in an unsupervised loop, then reports measured improvements of up to +2.1 points on external multimodal reasoning benchmarks. No equation or claim reduces the reported performance to a fitted parameter by construction, nor does any load-bearing premise collapse into a self-citation or prior ansatz from the same authors. The central mechanism is a testable hypothesis about agreement as a proxy signal, evaluated against held-out labeled benchmarks rather than tautologically equivalent to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits identification of exact free parameters; the framework rests on the domain assumption that internal trajectory agreement supplies useful supervision.

axioms (1)

domain assumption Internal agreement across chain-of-thought trajectories correlates with improved reasoning quality in the absence of ground truth
This premise enables the unsupervised setting and is invoked when the abstract claims the trajectory-aware signal distinguishes reasoning paths.

pith-pipeline@v0.9.0 · 5732 in / 1258 out tokens · 68724 ms · 2026-05-21T15:44:16.967372+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

trajectory-aware signal defined over intermediate reasoning steps... Intrinsic CoT Agreement Reward... step-wise similarity to these prototypes, with higher weight on early, grounding-heavy steps
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Starting from Qwen2.5-VL-7B, iReasoner yields up to +2.1 points across diverse multimodal reasoning benchmarks under fully unsupervised post-training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.