arxiv: 2604.17475 · v1 · submitted 2026-04-19 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception

Ashutosh Bajpai , Tamal Majumder , Akshay Nambi , Tanmoy Chakraborty

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords small vision-language modelscold-start reinforcement learningsupervision-free agentic trajectoriesgrounded visual perceptiontool instrumental utilitymulti-turn rolloutsmultimodal agents

0 comments

The pith

SPECTRA lets small vision-language models learn grounded agentic trajectories from environment interaction alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SPECTRA, a cold-start reinforcement learning framework that optimizes supervision-free agentic trajectories for small vision-language models. It applies a topological constraint called Soft Structured Multi-turn Rollouts to force explicit sequencing of tool-derived visual evidence before any synthesis step, while a multi-objective reward balances task correctness, rollout structure, and tool utility. This combination lets the models self-discover robust behaviors without human preference labels or trajectory data. A new metric, Tool Instrumental Utility, tracks tool efficacy during the process. Experiments on composite and out-of-distribution benchmarks show gains in accuracy and efficiency, indicating that environmental feedback alone can improve visual grounding and tool orchestration in multimodal agents.

Core claim

SPECTRA bootstraps agentic capabilities in SVLMs via Coldstart Reinforcement Learning by enforcing Soft Structured Multi-turn Rollouts that direct agents to sequence tool-derived evidence before synthesis. A multi-objective reward signal simultaneously maximizes task correctness, rollout structure, and tool utility, enabling the agent to self-discover robust behaviors without human preference labels. The framework further introduces Tool Instrumental Utility to quantify tool efficacy absent ground truth, yielding up to 5% higher task accuracy and 9% better tool efficiency across composite and MMMU-Pro benchmarks.

What carries the argument

Soft Structured Multi-turn Rollouts, a topological constraint that requires agents to explicitly sequence tool-derived visual evidence before any final synthesis or answer generation.

If this is right

SVLMs can improve visual grounding and tool orchestration purely through environmental interaction.
Expensive supervised trajectory tuning becomes unnecessary for achieving measurable gains in agentic performance.
Structured rollout constraints produce more reliable sequencing of evidence before reasoning.
Tool efficacy can be measured and optimized even when ground-truth answers are unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cold-start structure could reduce dependence on human feedback when training agents in non-visual domains.
Topological constraints on reasoning steps may substitute for some forms of preference data in other multimodal settings.
Combining SPECTRA with very small amounts of supervision could be tested to see if further gains appear.

Load-bearing premise

The multi-objective reward signal and soft structured rollout constraint suffice for the model to discover effective grounded behaviors without external labels or supervision.

What would settle it

Train an SVLM with SPECTRA and compare it head-to-head against the identical base model on the same visual tasks; if accuracy and tool efficiency show no gain or a decline, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.17475 by Akshay Nambi, Ashutosh Bajpai, Tamal Majumder, Tanmoy Chakraborty.

**Figure 1.** Figure 1: Comparison of the Qwen-2.5-VL-7B based SPECTRA with a baseline agent, showcasing its efficient tool utilization and visually-guided structured approach, resulting in accurate outcomes. wise reasoning, especially relying on expensive synthetically generated data. We investigate how tool and structured rolloutsmediated visual perception refinement mechanisms can enhance SVLMs’ agentic capabilities, enabling… view at source ↗

**Figure 2.** Figure 2: Overview of SPECTRA. The pipeline samples structured multi-turn trajectories from the SVLM policy, aggregates multi-objective rewards to compute group-relative advantages, and iteratively optimizes the policy parameters via GRPO-based objective, where rc : Rcorr, rs : Rstruct, rt : Rtool, rm : Rterm are reward components for task accuracy, structural integrity, tool efficiency, and terminal delimitation, … view at source ↗

**Figure 3.** Figure 3: Multi-turn structured rollout template. standardizes the dense multi-objective reward Rtotal against the group’s baseline performance: Aˆ i = Rtotal(τi) − µ({Rtotal(τj )} G j=1) σ({Rtotal(τj )} G j=1) + δ (4) Here, µ and σ act as a dynamic, input-specific baseline computed over sampled group, effectively filtering out the inherent difficulty of the visual query I and isolating the quality of the tool-usage… view at source ↗

**Figure 5.** Figure 5: Differential State Transition Analysis: Base [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Robustness comparison. The Baseline Agent (left) crashes after a tool error, ending abruptly without an answer. SPECTRA (right) hallucinates an undefined tool (highlighted in red) and enters a generation loop, yet successfully recovers to provide the correct answer (A). dom guessing. While TTAC remains a challenging objective under cold-start conditions, SPECTRA notably shifts the mean correlation from ne… view at source ↗

**Figure 7.** Figure 7: A sample from each of the four datasets (AI2D, TQA, OK-VQA, and Science QA) used in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: The system prompt used to guide the Zero-shot inference. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The structured rollout prompt for training and inference used in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Tool signature included in the chat template for Qwen-2.5-VL-3B ( [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Custom chat template text prompt for Qwen-2.5-VL-7B ( [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison on a complex visual reasoning task. The Baseline relies on incorrect visual [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison showing SPECTRA’s ability to recover from minimal tool output. While the Baseline Agent errors out on tool usage, SPECTRA correctly interprets the "captioning_tool" response and validates the producer role of grass in the food web [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison on a food web analysis task. The Baseline agent incorrectly interprets the food [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison on an anatomical diagram analysis task. The Baseline agent correctly identifies [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

read the original abstract

Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPECTRA tries a supervision-free RL setup for small VLMs with rollout constraints and a new TIU metric, but the abstract's modest gains lack any supporting details or verification.

read the letter

The core thing to know is that this paper proposes SPECTRA as a cold-start RL framework to improve small vision-language models on agentic visual tasks without any supervised trajectories or human labels. It adds a soft structured multi-turn rollout constraint to force explicit tool sequencing before synthesis, a multi-objective reward on correctness plus structure plus tool utility, and a new Tool Instrumental Utility metric to score tool use absent ground truth. The abstract reports up to 5% accuracy gains and 9% better tool efficiency on MMMU-Pro and composite benchmarks.

Referee Report

3 major / 2 minor

Summary. The paper proposes SPECTRA (Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment), a supervision-free cold-start reinforcement learning framework for improving agentic capabilities in Small Vision-Language Models (SVLMs). It introduces Soft Structured Multi-turn Rollouts as a topological constraint to enforce sequencing of tool-derived evidence before synthesis, a multi-objective reward combining task correctness, rollout structure, and tool utility, and the Tool Instrumental Utility (TIU) metric to quantify tool efficacy without ground truth. Evaluations on composite benchmarks and out-of-distribution MMMU-Pro data report gains of up to 5% in task accuracy and 9% in tool efficiency.

Significance. If the central claims hold after verification, the work would be moderately significant for the field of multimodal agentic systems. It addresses visual brittleness and tool orchestration in efficient SVLMs without relying on expensive supervised trajectory data, potentially enabling more scalable self-supervised learning from environmental interaction. The TIU metric and structured rollout topology are presented as novel, but their impact depends on demonstrating they are not reducible to reward engineering. The reported gains are modest, so broader adoption would require strong ablations and reproducibility evidence.

major comments (3)

[Abstract] Abstract: The specific claims of 'up to 5% task accuracy' and '9% tool efficiency' improvements are load-bearing for the central contribution but provide no details on baselines, statistical tests, number of runs, or variance. Without this, it is impossible to assess whether the gains exceed noise or are attributable to SPECTRA rather than implementation choices.
[Abstract] Abstract and reward formulation (assumed §3): The multi-objective reward is described as maximizing 'task correctness, rollout structure, and tool utility' with no explicit equations or weight values provided. Since weights in the multi-objective signal are free parameters, the claim that agents 'self-discover robust behaviors without human preference labels' risks circularity; the structure and utility terms may be tuned to produce the reported outcomes.
[Abstract] Abstract: TIU is introduced as a novel metric to 'quantify tool efficacy in the absence of ground truth,' yet no definition, correlation analysis with actual task performance, or validation against held-out ground truth is referenced. This is central to the supervision-free claim and requires explicit formulation and empirical checks.

minor comments (2)

[Title] The title 'Waking Up Blind' does not appear to match the abstract content focused on SPECTRA; clarify or align the title with the proposed framework.
[Abstract] The abstract mentions 'composite and out-of-distribution (MMMU-Pro) benchmarks' but does not list the specific composite benchmarks or dataset sizes; add these for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the valuable feedback on our manuscript. We address each of the major comments below and will incorporate revisions to enhance the clarity and rigor of the abstract and method descriptions.

read point-by-point responses

Referee: [Abstract] Abstract: The specific claims of 'up to 5% task accuracy' and '9% tool efficiency' improvements are load-bearing for the central contribution but provide no details on baselines, statistical tests, number of runs, or variance. Without this, it is impossible to assess whether the gains exceed noise or are attributable to SPECTRA rather than implementation choices.

Authors: We agree that the abstract would benefit from more context on the evaluation. The full manuscript details the baselines (including vanilla SVLM, standard RL without structured rollouts, and supervised methods), reports results over multiple runs with variance, and includes statistical tests in Section 4 and the associated tables. To address this directly in the abstract, we will revise it to note that the reported improvements are 'statistically significant across multiple runs with variance accounted for, as detailed in Section 4.' revision: yes
Referee: [Abstract] Abstract and reward formulation (assumed §3): The multi-objective reward is described as maximizing 'task correctness, rollout structure, and tool utility' with no explicit equations or weight values provided. Since weights in the multi-objective signal are free parameters, the claim that agents 'self-discover robust behaviors without human preference labels' risks circularity; the structure and utility terms may be tuned to produce the reported outcomes.

Authors: The explicit formulation of the multi-objective reward, including the equations for each term and the fixed weight values, is provided in Section 3 of the manuscript. The weights are predetermined and fixed across experiments to avoid task-specific tuning, allowing the agent to discover behaviors through the environmental interaction signal. We will revise the manuscript to include the reward equation prominently in the abstract or introduction for better accessibility and to explicitly state that weights are not optimized on the reported benchmarks. revision: yes
Referee: [Abstract] Abstract: TIU is introduced as a novel metric to 'quantify tool efficacy in the absence of ground truth,' yet no definition, correlation analysis with actual task performance, or validation against held-out ground truth is referenced. This is central to the supervision-free claim and requires explicit formulation and empirical checks.

Authors: The Tool Instrumental Utility (TIU) metric is formally defined in Section 3.3, with supporting correlation analysis to task performance and validation on held-out ground truth data presented in Section 4 and the appendix. This demonstrates its reliability as a supervision-free proxy. We will update the abstract to include a concise definition of TIU and a reference to the empirical validation results to strengthen the presentation of this contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract describes SPECTRA as a supervision-free framework using a multi-objective reward and topological constraints for self-discovery of behaviors. However, no equations, specific reward formulations, rollout definitions, or derivation steps are provided in the accessible text. Without quoted paper content exhibiting reductions (e.g., a fitted parameter renamed as prediction or self-citation load-bearing the central claim), no circular steps can be identified per the strict criteria. The framework is presented as self-contained against benchmarks like MMMU-Pro, with claims of 5% accuracy and 9% efficiency gains treated as empirical outcomes rather than definitional.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim depends on the effectiveness of the proposed RL setup and new metric. Details on exact reward formulations are absent from the abstract, so the ledger is based on what is implied.

free parameters (1)

weights in multi-objective reward signal
The reward combines task correctness, rollout structure, and tool utility; balancing these typically requires chosen or fitted weights.

axioms (2)

domain assumption Reinforcement learning from environmental interaction alone can bootstrap robust agentic behaviors in SVLMs
The cold-start RL approach assumes feedback from task correctness and tool use suffices without initial supervision.
ad hoc to paper Topological constraints on multi-turn rollouts enforce grounding of reasoning in visual observations
The Soft Structured Multi-turn Rollouts are a paper-specific constraint introduced to direct evidence sequencing.

invented entities (2)

Tool Instrumental Utility (TIU) no independent evidence
purpose: Quantify tool efficacy without ground truth
New metric proposed to evaluate tools when correct answers are unavailable.
SPECTRA framework no independent evidence
purpose: Supervision-free optimization of agentic trajectories
Overall method combining RL, constraints, and rewards is presented as novel.

pith-pipeline@v0.9.0 · 5515 in / 1750 out tokens · 67092 ms · 2026-05-10T05:29:53.378226+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 8 canonical work pages · 2 internal anchors

[1]

On grpo collapse in search-r1: The lazy likelihood-displacement death spiral.arXiv preprint arXiv:2512.04220, 2025a

On grpo collapse in search-r1: The lazy likelihood-displacement death spiral.Preprint, arXiv:2512.04220. Yu Du, Fangyun Wei, and Hongyang Zhang. 2024. Any- tool: Self-reflective, hierarchical agents for large- scale api calls.Preprint, arXiv:2402.04253. Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. 2024. Videoagent: A memory-a...

work page arXiv 2024
[2]

A Diagram Is Worth A Dozen Images

A diagram is worth a dozen images.Preprint, arXiv:1603.07396. Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2019. Textbook question answering with multi-modal con- text graph understanding and self-supervised open- set comprehension.Preprint, arXiv:1811.00232. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, ...

work page Pith review arXiv 2019
[3]

Llava-plus: Learning to use tools for creating multimodal agents.Preprint, arXiv:2311.05437. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Ao- han Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, and 3 oth- ers. 2025. Agentbench: Eva...

work page arXiv 2025
[4]

GAIA: a benchmark for General AI Assistants

Gaia: a benchmark for general ai assistants. Preprint, arXiv:2311.12983. Microsoft, :, Abdelrahman Abouelenin, Atabak Ash- faq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dong- dong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi ling Chen, Qi Dai, and 57 others. 2025. P...

work page internal anchor Pith review arXiv 2025
[5]

Is reinforcement learning (not) for nat- ural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization,

Is reinforcement learning (not) for natural lan- guage processing: Benchmarks, baselines, and build- ing blocks for natural language policy optimization. Preprint, arXiv:2210.01241. Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki

work page arXiv
[6]

Proximal Policy Optimization Algorithms

‘smolagents‘: a smol library to build great agentic systems. https://github.com/ huggingface/smolagents. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Prox- imal policy optimization algorithms.Preprint, arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

MMInA: Benchmarking multihop multi- modal internet agents

Mmina: Benchmarking multihop multimodal internet agents.Preprint, arXiv:2404.09992. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multimodal llms. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578. Harsh Trivedi, Tushar Kho...

work page arXiv 2024
[8]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents,

Appworld: A controllable world of apps and people for benchmarking interactive coding agents. Preprint, arXiv:2407.18901. Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. 2025. Mllm-tool: A multimodal large language model for tool agent learning.Preprint, arXiv:2401.10727. Jize Wang, Ma Zerun, Yining Li, Songyang Z...

work page arXiv 2025
[9]

Contains T h e d i a g r a m s h o w s t h e M o o n o r b i t i n g E a r t h a s v i e w e d f r o m S p a c e a b o v e t h e N o r t h P o l e

dataset contains grade school level text- book style scientific diagrams paired with mul- tiple choice questions that evaluate diagram understanding and visual reasoning. Contains T h e d i a g r a m s h o w s t h e M o o n o r b i t i n g E a r t h a s v i e w e d f r o m S p a c e a b o v e t h e N o r t h P o l e . A t w h i c h t w o p o s i t i o n s...

2019
[10]

</think_reasoning> tags about the question and which tool to call if needed

Start by thinking inside <think_reasoning>... </think_reasoning> tags about the question and which tool to call if needed
[11]

name": <function-name>,

Then you can call tool or tools fromAvailable_tools. Tool call instructions: For each function call, return a json object with function name and arguments within <tool_call></tool_call>XML tags: <tool_call>{"name": <function-name>, "arguments": <args-json-object>} </tool_call> end of response. List of Available_tools: captioning_tool, ocr_tool, detection_...
[12]

After you have used the tools, you will see the tool outputs inside appropriate tags in the same order from the system
[13]

</think_perception>tags once

After getting tool_response, think Your thoughts on thetool_response inside <think_perception>... </think_perception>tags once. (think_perceptionstep only comes aftertool_responsetags)
[14]

</think_reasoning> tags again

Then resume your thought process inside<think_reasoning>... </think_reasoning> tags again. Try to think clearly, aloud and step-by-step so that you reach to the correct final answer. If needed also reflect on your thought process. All thinking must be inside<think_reasoning>... </think_reasoning>tags
[15]

type": "function

At the end you must always choose the most appropriate option and put answer inside: <answer>\boxed{(option)answer} </answer> tags, irrespective of the output of the tool being true or false or incorrect. You MUST provide an answer even if uncertain at the end. Figure 9: The structured rollout prompt for training and inference used inSPECTRA. You are prov...

2048
[16]

name": "None

Grass is at the bottom of the food web, meaning it is the starting point for energy flow. 2. Arrows point from grass to other organisms, indicating that these organisms consume grass. 3. The organisms that consume grass are rabbits and mice, which are labeled as consumers in the food web. Given this information, grass plays the role of a producer in the f...
[17]

name": "perception_tool

**Identify the direct relationship with Plantain**: The Plantain is directly consumed by the mouse. Therefore, if the Plantain population is wiped out, the mouse population will be directly affected. 2. **Evaluate the impact on other populations**: - The grasshopper population is not directly connected to Plantain, so it will not be directly affected. - T...
[18]

F". Therefore, the correct answer should be associated with the label

The image shows a diagram of the human respiratory system, with the lungs labeled. 2. The left lung is typically positioned on the left side of the chest cavity. 3. In the diagram, the left lung is labeled with the letter ’D’. 4. The right lung is on the right side of the chest cavity and is labeled with the letter ’F’. Given this information, the label ’...