pith. machine review for the scientific record. sign in

arxiv: 2603.15432 · v3 · submitted 2026-03-16 · 💻 cs.CV

Recognition: no theorem link

Gym-V: A Unified Vision Environment System for Agentic Vision Research

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords Gym-Vvision agentsreinforcement learningobservation scaffoldingprocedural generationcross-domain transferagentic VLMsvisual environments
0
0 comments X

The pith

Observation scaffolding via captions and rules determines vision agent learning success more than RL algorithm choice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Gym-V supplies a single platform of 179 procedurally generated visual environments across 10 domains with adjustable difficulty to replace the previous collection of incompatible toolkits. Experiments on the platform show that adding textual captions and explicit game rules to the agent's observations enables successful training where none occurred before, while swapping one reinforcement learning algorithm for another produces smaller or negligible gains. The same infrastructure demonstrates that training across many task categories produces broad generalization, that training on only one category often produces negative transfer to others, and that allowing multi-turn interaction magnifies both the positive and negative effects.

Core claim

Gym-V is a unified platform of 179 procedurally generated visual environments spanning 10 domains. Controlled experiments on it establish that observation scaffolding—specifically the addition of captions and game rules—is the decisive factor that allows reinforcement learning to succeed at all, whereas the particular RL algorithm chosen matters far less. Training on diverse task categories yields positive cross-domain transfer while narrow training produces negative transfer, and multi-turn interaction amplifies every one of these outcomes.

What carries the argument

Gym-V, the single platform of 179 procedurally generated visual environments with controllable difficulty across 10 domains that supplies standardized observation formats, reward signals, and interaction loops for agent training and evaluation.

If this is right

  • Training pipelines should allocate more effort to generating accurate captions and stating rules than to hyperparameter sweeps over RL algorithms.
  • Agents trained on mixed task categories will transfer more reliably than agents trained on single categories.
  • Multi-turn interaction loops will magnify both the benefits of good scaffolding and the harms of narrow training.
  • New environments can be added to the platform by following the same procedural generation rules to extend the test suite without breaking existing comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Vision-language models intended for agents may gain more from improved captioning modules than from further RL algorithm refinements.
  • The procedural generation approach could be extended to generate environments whose visual statistics match specific real-world domains for targeted testing.
  • Negative transfer observed under narrow training suggests that curriculum design for vision agents must deliberately mix categories rather than sequence them.

Load-bearing premise

The 179 procedurally generated environments capture the essential difficulties faced by real-world vision agents so that results obtained inside them will generalize to actual deployed systems.

What would settle it

Finding a set of real-world vision tasks where changing the RL algorithm produces larger performance gains than adding or removing captions and rules would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.15432 by Fanqing Meng, Jiaheng Zhang, Jiaqi Liao, Jiawei Gu, Lingxiao Du, Linjie Li, Mengkang Hu, Michael Qizhe Shieh, Xiangyan Liu, Zichen Liu, Zijian Wu, Ziqi Zhao.

Figure 1
Figure 1. Figure 1: Overview of Gym-V. Top: 105 single-turn and 74 multi-turn environments across 10 categories. Bottom: a unified reset/step interface shared by interactive environments, offline datasets, and evaluation benchmarks. same loop on a single-turn visual puzzle or a multi-turn game episode, while built-in verifiers check correctness automatically. Gym-V spans 179 environments across 10 domains ( [PITH_FULL_IMAGE:… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation fidelity of Gym-V against official pipelines. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy (×100) vs. difficulty level on six representative environments from different categories. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training reward curves for GRPO, GSPO, and SAPO across 12 single-turn (rows 1–3) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training reward curves for context modeling (top row) and rules injection (bottom row) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training reward curves comparing image only (green) vs. image + caption (red) on two single-turn and two multi-turn environments. Adding textual captions yields a substantial and consistent improvement across all four environments. The benefit is most pronounced on perception￾heavy tasks such as longest_path, where caption narrows the visual grounding gap early in training, and on multi-turn games (mineswe… view at source ↗
Figure 7
Figure 7. Figure 7: Training reward curves comparing three prompt conditions across four multi-turn games. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sampled observations from 8 retro arcade environments over 200 VLM agent steps. Despite [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Algorithmic. Left: BinaryMatrix — the agent performs an algorithmic operation (e.g., counting connected components) on a rendered binary grid; output is a single integer or transformed grid. Right: RottenOranges — the agent determines the minimum time for all oranges to rot via BFS propagation on a grid; output is a single integer [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cognition. Left: RubiksCube-QA — the agent observes a rendered 3D Rubik’s Cube and answers questions about face colors after rotations. Right: OddOneOutPoly — the agent identifies which polygon differs from the rest based on a visual property (e.g., symmetry, number of sides). expected output format. The complete list of all 202 environments, their configuration parameters, difficulty levels, and detailed… view at source ↗
Figure 11
Figure 11. Figure 11: Geometry. Left: LargestIsland — the agent observes a binary grid (blue = water, green = land) and must find the maximum area of a 4-directionally connected island; output is a single integer. Right: VisibleLine — given N lines y = Ax + B plotted on a 2D plane, the agent identifies which lines are visible from y = +∞ (i.e., lie on the upper envelope); output is the space-separated indices of visible lines … view at source ↗
Figure 12
Figure 12. Figure 12: Graphs. Left: ShortestPath — the agent computes the shortest path between two nodes in a weighted graph; output is the node sequence. Right: LongestPath — the agent finds the longest simple path in a graph; output is the path length. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Logic. Left: Thermometers — the agent fills thermometer-shaped regions in a grid to satisfy row/column sum constraints. Right: CircuitLogic — the agent traces a Boolean logic circuit and determines the output given input values [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Puzzles. Left: KnightSwap — the agent swaps the positions of two sets of knights on a small chessboard using legal knight moves; output is the move sequence. Right: TowerOfHanoi — the agent produces the move sequence to transfer all discs to the target peg; output is a sequence of (source, target) moves. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Games. Left: Sokoban — the agent pushes boxes onto goal positions by issuing directional commands (up/down/left/right) in a multi-turn grid puzzle. Right: Chess — the agent plays against a built-in opponent, outputting moves in UCI notation (e.g., e2e4) each turn [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Spatial. Left: DoorKey (MiniGrid, 2D) — the agent navigates a top-down grid-world to find a key, unlock a door, and reach the goal. Right: CollectHealth (MiniWorld, 3D) — the agent navigates a first-person 3D room to collect health items; actions include move_forward and turn [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Temporal. Left: StreetsOfRage2 — a side-scrolling beat-em-up requiring continuous combat inputs. Right: GoldenAxe — a hack-and-slash arcade game. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Per-environment zero-shot evaluation heatmap across all evaluated environments and 9 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
read the original abstract

As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gym-V, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, to support systematic research on agentic vision models. Using this platform, the authors report that observation scaffolding (via captions and game rules) is more decisive for training success than the choice of RL algorithm, that captions and rules can determine whether learning succeeds at all, and that cross-domain transfer from diverse task categories generalizes better than narrow training (which can produce negative transfer), with multi-turn interaction amplifying these effects. Gym-V is released as open infrastructure for training and evaluation.

Significance. If the empirical claims are substantiated with full methods and statistics, Gym-V would address a clear gap in standardized benchmarks for vision agents, enabling reproducible comparisons that are currently fragmented. The reported dominance of scaffolding over algorithm choice, if robust, would usefully redirect attention toward environment design and reward verifiability in VLM training pipelines.

major comments (3)
  1. [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the central claim that observation scaffolding is more decisive than RL algorithm choice is presented without any reported sample sizes, number of independent runs, error bars, or statistical tests. This absence makes it impossible to assess whether the ranking is reliable or sensitive to random seeds.
  2. [§3 (Environment Design)] §3 (Environment Design): the 179 procedurally generated environments are treated as representative of real vision-agent challenges, yet no description is given of how procedural generation handles stochastic textures, lighting variation, or long-tail object distributions. If these factors are fixed or simplified, the reported dominance of scaffolding may not survive restoration of realistic image statistics, directly undermining the load-bearing assumption for the main claim.
  3. [§4.3 (Transfer Experiments)] §4.3 (Transfer Experiments): the statements on broad generalization from diverse training and negative transfer from narrow training lack quantitative definitions (e.g., exact performance deltas, how task categories were partitioned, and confidence intervals). Without these, the transfer results cannot be evaluated for robustness.
minor comments (2)
  1. [Abstract] Notation for VLM and RL is introduced without an initial expansion in the abstract or early sections.
  2. [Figures] Figure captions for environment examples should explicitly state the controllable difficulty parameters shown in each panel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional rigor will strengthen the manuscript. We address each major point below and will incorporate revisions to improve statistical reporting, environment descriptions, and quantitative transfer metrics.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the central claim that observation scaffolding is more decisive than RL algorithm choice is presented without any reported sample sizes, number of independent runs, error bars, or statistical tests. This absence makes it impossible to assess whether the ranking is reliable or sensitive to random seeds.

    Authors: We agree that the central claim requires explicit statistical support to be fully convincing. In the revised manuscript we will report the exact sample sizes (5 independent runs per condition using distinct random seeds), include error bars (standard error) on all performance plots in §4, and add statistical tests (paired t-tests with p-values) comparing scaffolding conditions against algorithm variants. These additions will directly address sensitivity to seeds and allow readers to evaluate the reliability of the observed ranking. revision: yes

  2. Referee: [§3 (Environment Design)] §3 (Environment Design): the 179 procedurally generated environments are treated as representative of real vision-agent challenges, yet no description is given of how procedural generation handles stochastic textures, lighting variation, or long-tail object distributions. If these factors are fixed or simplified, the reported dominance of scaffolding may not survive restoration of realistic image statistics, directly undermining the load-bearing assumption for the main claim.

    Authors: The procedural generation in Gym-V is deliberately parameterized to control difficulty through object count, placement, and basic visual attributes across the 10 domains. We acknowledge that the current §3 does not detail stochastic texture and lighting variation. In revision we will expand the section to specify that each domain samples from randomized texture palettes (hue/saturation jitter) and lighting directions (multiple light sources with angle variation), while long-tail object distributions are approximated via expanded per-domain asset libraries. We maintain that the platform's value lies in verifiable rewards and controllable scaffolding rather than full photorealism; the scaffolding dominance is demonstrated under these controlled statistics, which we will now make explicit. revision: partial

  3. Referee: [§4.3 (Transfer Experiments)] §4.3 (Transfer Experiments): the statements on broad generalization from diverse training and negative transfer from narrow training lack quantitative definitions (e.g., exact performance deltas, how task categories were partitioned, and confidence intervals). Without these, the transfer results cannot be evaluated for robustness.

    Authors: We agree that precise quantitative definitions are required. In the revision we will explicitly state that task categories are partitioned according to the 10 domains listed in Table 1, report exact performance deltas (e.g., mean success-rate improvement of diverse training over narrow baselines), and include 95% confidence intervals computed across the 5 independent runs for all transfer metrics. These details will be added to §4.3 and the associated tables/figures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims from controlled experiments

full rationale

The paper introduces the Gym-V platform and reports direct experimental outcomes showing that observation scaffolding (captions and rules) determines training success more than RL algorithm choice. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central findings are framed as observations from runs on the 179 procedurally generated environments, without any reduction of results to prior inputs by construction or self-referential definitions. This is a standard empirical contribution whose validity rests on the experiments themselves rather than any circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no explicit free parameters, axioms, or invented entities in the abstract; it relies on standard assumptions of reinforcement learning and procedural generation without stating new postulates.

pith-pipeline@v0.9.0 · 5499 in / 1019 out tokens · 37888 ms · 2026-05-15T10:00:33.440431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    Browser- arena: Evaluating llm agents on real-world web navigation tasks.arXiv preprint arXiv:2510.02418,

    Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, and Osbert Bastani. Browser- arena: Evaluating llm agents on real-world web navigation tasks.arXiv preprint arXiv:2510.02418,

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning.arXiv preprint arXiv:2505.13426,

    Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, and Baobao Chang. G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning.arXiv preprint arXiv:2505.13426,

  4. [4]

    Soft Adaptive Policy Optimization

    11 Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347,

  5. [5]

    Textarena

    Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. Textarena. arXiv preprint arXiv:2504.11442,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  7. [7]

    lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,

  8. [8]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

  9. [9]

    RLlib: Abstractions for Distributed Reinforcement Learning

    URL https://arxiv. org/pdf/1712.09381. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer,

  10. [10]

    Gem: A gym for agentic llms.arXiv preprint arXiv:2510.01051,

    Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, et al. Gem: A gym for agentic llms.arXiv preprint arXiv:2510.01051,

  11. [11]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

  12. [12]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    12 Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279,

  13. [13]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  14. [14]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  15. [15]

    Korgym: A dynamic game platform for llm reasoning evaluation.arXiv preprint arXiv:2505.14552,

    Jiajun Shi, Jian Yang, Jiaheng Liu, Xingyuan Bu, Jiangjie Chen, Junting Zhou, Kaijing Ma, Zhoufutu Wen, Bingli Wang, Yancheng He, et al. Korgym: A dynamic game platform for llm reasoning evaluation.arXiv preprint arXiv:2505.14552,

  16. [16]

    Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760,

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760,

  17. [17]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  18. [18]

    Vagen: Reinforcing world model reasoning for multi-turn vlm agents.arXiv preprint arXiv:2510.16907, 2025a

    Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.arXiv preprint arXiv:2510.16907, 2025a. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multi...

  19. [19]

    Cogito, ergo ludo: An agent that learns to play by reasoning and planning.arXiv preprint arXiv:2509.25052, 2025b

    Sai Wang, Yu Wu, and Zhongwen Xu. Cogito, ergo ludo: An agent that learns to play by reasoning and planning.arXiv preprint arXiv:2509.25052, 2025b. Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. Genexam: A multidisciplinary text-to-image exam.arXiv preprint arXiv:2509.14232, 2025c. Zirui Wang, Junyi ...

  20. [20]

    Qwen3 Technical Report

    URL https://proceedings.neurips.cc/paper/ 2021/file/2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  21. [21]

    Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317,

    13 Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, et al. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317,

  22. [22]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826,

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826,

  23. [23]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

  24. [24]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

  25. [25]

    A Training Protocol All training experiments useQwen2.5-VL-7B-Instructas the base model, fine-tuned with the verl framework [Sheng et al., 2024]

    14 Overview of the Appendix This Appendix is organized as follows: • Section A provides the training protocol details; • Section B analyzes context as implicit rule discovery; • Section C discusses temporal environment analysis; • Section D showcases per-environment examples; • Section E presents per-environment evaluation results; • Section F provides th...