pith. machine review for the scientific record. sign in

arxiv: 2605.02730 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Perceptual Flow Network for Visually Grounded Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Perceptual Flow Networkvisually grounded reasoninglarge vision-language modelsvariational reinforcement learningvisual hallucinationself-conditioned generation
0
0 comments X

The pith

Perceptual Flow Network improves visually grounded reasoning by decoupling perception from reasoning and shaping it with variational reinforcement learning instead of rigid expert priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that large vision-language models suffer from language bias and hallucinations because standard training does not properly constrain visual trajectories. Existing approaches add geometric priors from visual experts as supervision, but these are biased toward geometric accuracy rather than reasoning usefulness. PFlowNet addresses this by separating perception from reasoning into a self-conditioned generation process. It then combines multi-dimensional rewards with vicinal geometric shaping through variational reinforcement learning to encourage perceptual behaviors oriented toward reasoning while maintaining visual reliability. The result is presented as delivering both a theoretical performance guarantee and higher empirical accuracy on visual reasoning benchmarks.

Core claim

PFlowNet decouples perception from reasoning to create a self-conditioned generation process, then integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning. This produces reasoning-oriented perceptual behaviors while preserving visual reliability and yields a provable performance guarantee along with new state-of-the-art scores on V* Bench and MME-RealWorld-lite.

What carries the argument

The self-conditioned generation process in PFlowNet, which decouples perception from reasoning and applies vicinal geometric shaping through variational reinforcement learning to avoid rigid alignment with expert priors.

If this is right

  • The approach delivers a provable performance guarantee for the resulting model.
  • It reaches new state-of-the-art performance of 90.6 percent on V* Bench.
  • It reaches new state-of-the-art performance of 67.0 percent on MME-RealWorld-lite.
  • It enables reasoning-oriented perceptual behaviors while keeping visual outputs reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling strategy might allow separate tuning of perception modules in other multimodal systems without retraining the entire model.
  • Variational reinforcement learning for shaping perceptual flows could extend to balancing competing objectives in non-visual language tasks.
  • The method suggests a route to reduce hallucinations by prioritizing reasoning utility over strict geometric matching in additional visual benchmarks.

Load-bearing premise

That geometric priors from visual experts are suboptimal for reasoning utility and that vicinal geometric shaping via variational reinforcement learning will produce superior perceptual behaviors without reducing visual reliability.

What would settle it

An experiment in which models trained with rigid geometric priors from visual experts achieve higher accuracy than PFlowNet on the V* Bench or MME-RealWorld-lite benchmarks would undermine the central claim.

read the original abstract

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Perceptual Flow Network (PFlowNet) to address limitations in Large Vision-Language Models (LVLMs) where standard MLE optimization leads to language bias and hallucination. It observes that geometric priors from visual experts are suboptimal for reasoning utility due to bias toward geometric precision. PFlowNet decouples perception from reasoning via a self-conditioned generation process and integrates multi-dimensional rewards with vicinal geometric shaping using variational reinforcement learning. This is claimed to produce reasoning-oriented perceptual behaviors while preserving visual reliability, delivering a provable performance guarantee and new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Significance. If the claimed provable guarantee can be rigorously established under clearly stated assumptions and the SOTA empirical results are reproducible with proper controls for backbone choice and hyperparameter tuning, the work would offer a meaningful alternative to rigid expert-prior alignment in grounded reasoning tasks. The decoupling of perception and reasoning plus the variational RL formulation with vicinal shaping could influence methods for reducing hallucinations in LVLMs, provided the guarantee applies to downstream reasoning utility rather than only the surrogate objective.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'PFlowNet delivers a provable performance guarantee' is load-bearing for the paper's novelty yet provides no statement of what is proven (e.g., convergence rate, bound on hallucination rate, or optimality of the decoupled flow), no assumptions (e.g., bounded reward variance, Lipschitz continuity of the shaping term, or properties of the self-conditioned distribution), and no proof sketch or derivation. This prevents evaluation of whether the guarantee supports attribution of the reported SOTA numbers to the proposed mechanism.
  2. [Abstract] Abstract: The superiority claim rests on eschewing rigid alignment with expert geometric priors in favor of variational RL with vicinal shaping and multi-dimensional rewards, but the abstract supplies no indication of how the guarantee reduces to the fitted parameters or how the empirical results on V* Bench and MME-RealWorld-lite isolate the effect of the proposed shaping from backbone or tuning choices.
minor comments (1)
  1. [Abstract] Abstract: The term 'vicinal geometric shaping' is introduced without a brief definition or reference to its precise formulation, which may hinder immediate understanding of the method's novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our theoretical and empirical contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'PFlowNet delivers a provable performance guarantee' is load-bearing for the paper's novelty yet provides no statement of what is proven (e.g., convergence rate, bound on hallucination rate, or optimality of the decoupled flow), no assumptions (e.g., bounded reward variance, Lipschitz continuity of the shaping term, or properties of the self-conditioned distribution), and no proof sketch or derivation. This prevents evaluation of whether the guarantee supports attribution of the reported SOTA numbers to the proposed mechanism.

    Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the theoretical claim. The full manuscript contains the detailed analysis in Section 4, which establishes a bound on the expected reasoning utility of the self-conditioned perceptual flow. The proof relies on standard variational RL convergence arguments under the assumptions of bounded reward variance and Lipschitz continuity of the vicinal shaping term. We will revise the abstract to include a concise statement of the proven guarantee, the key assumptions, and a pointer to the proof section. revision: yes

  2. Referee: [Abstract] Abstract: The superiority claim rests on eschewing rigid alignment with expert geometric priors in favor of variational RL with vicinal shaping and multi-dimensional rewards, but the abstract supplies no indication of how the guarantee reduces to the fitted parameters or how the empirical results on V* Bench and MME-RealWorld-lite isolate the effect of the proposed shaping from backbone or tuning choices.

    Authors: The abstract is a high-level summary; the reduction of the guarantee to the learned parameters is derived explicitly in the variational objective of Section 4. For the empirical results, Section 5.3 reports controlled ablations that isolate the vicinal shaping and multi-dimensional reward components while holding the backbone model and hyperparameter settings fixed. We will add a sentence to the abstract noting that the reported SOTA numbers are supported by these ablations and the theoretical analysis. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on asserted guarantee without self-referential reduction

full rationale

The abstract asserts a 'provable performance guarantee' and SOTA results from decoupling perception, multi-dimensional rewards, and variational RL with vicinal shaping, but supplies no equations, derivations, or self-citations that reduce the guarantee or empirical claims to fitted inputs or prior author results by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method is described as building on (then modifying) external geometric priors, which is independent of the target claims. This is the common honest case of a self-contained high-level description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the proposed network itself.

invented entities (1)
  • Perceptual Flow Network (PFlowNet) no independent evidence
    purpose: Decouples perception from reasoning to enable self-conditioned generation and reasoning-oriented perceptual behaviors
    New architecture introduced to address limitations of geometric priors.

pith-pipeline@v0.9.0 · 5505 in / 1103 out tokens · 26450 ms · 2026-05-08T18:37:37.373842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 37 canonical work pages · 15 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025.URL https://arxiv. org/abs/2511.21631, 2025

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Vicinal risk minimization.Advances in neural information processing systems, 13, 2000

    Olivier Chapelle, Jason Weston, Léon Bottou, and Vladimir Vapnik. Vicinal risk minimization.Advances in neural information processing systems, 13, 2000

  4. [4]

    Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 37: 44393–44418, 2024

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 37: 44393–44418, 2024

  5. [5]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

  6. [6]

    Gemini-3-flash.https://deepmind.google/models/gemini/flash/, 2025

    DeepMind. Gemini-3-flash.https://deepmind.google/models/gemini/flash/, 2025

  7. [7]

    Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

    DeepMind. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

  8. [8]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024

  9. [9]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

  10. [10]

    Detecting and preventing hallucinations in large vision language models

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143, 2024

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  13. [13]

    Deepeyesv2: Toward agentic multimodal model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  14. [14]

    The dawn of gui agent: A preliminary case study with claude 3.5 computer use,

    Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323, 2024. 13

  15. [15]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

  16. [16]

    Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought

    Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025

  17. [17]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

  18. [18]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  20. [20]

    Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

    Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025

  21. [21]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

  22. [22]

    arXiv preprint arXiv:2403.00231 , year=

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231, 2024

  23. [23]

    arXiv preprint arXiv:2603.03857 (2026)

    Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, and Yue Lu. Deepscan: A training-free framework for visually grounded reasoning in large vision-language models.arXiv preprint arXiv:2603.03857, 2026

  24. [24]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

  25. [25]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024

  26. [26]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  27. [27]

    Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/2024-01-30-llava-next/ , 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/2024-01-30-llava-next/ , 2024

  28. [28]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

  29. [29]

    Look as you think: Unifying reasoning and visual evidence attribution for verifiable document rag via reinforcement learning.arXiv preprint arXiv:2511.12003, 2025

    Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, and Enhong Chen. Look as you think: Unifying reasoning and visual evidence attribution for verifiable document rag via reinforcement learning.arXiv preprint arXiv:2511.12003, 2025

  30. [30]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  31. [31]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

  32. [32]

    Visual agentic reinforcement fine-tuning

    Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual agentic reinforcement fine-tuning.URL https://arxiv. org/abs/2505.14246, 2025

  33. [33]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14

  34. [34]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  35. [35]

    Struvis: Enhancing reasoning-based text-to-image generation via thinking with structured vision.arXiv preprint arXiv:2603.06032, 2026

    Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, and Xuming Hu. Struvis: Enhancing reasoning-based text-to-image generation via thinking with structured vision.arXiv preprint arXiv:2603.06032, 2026

  36. [36]

    Learning gflownets from partial episodes for improved convergence and stability

    Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cristian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes for improved convergence and stability. InInternational Conference on Machine Learning, pages 23467–23483. PMLR, 2023

  37. [37]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

  38. [38]

    Openai-gpt-4o.https://openai.com/index/gpt-4o-system-card/, 2024

    OpenAI. Openai-gpt-4o.https://openai.com/index/gpt-4o-system-card/, 2024

  39. [39]

    Openai-o3.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

    OpenAI. Openai-o3.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

  40. [40]

    Operator: A computer-using agent.https://openai.com/index/operator-system-card/, 2025

    OpenAI. Operator: A computer-using agent.https://openai.com/index/operator-system-card/, 2025. System Card and Technical Report

  41. [41]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  43. [43]

    Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

    Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

  44. [44]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  45. [45]

    arXiv preprint arXiv:2512.17312 (2025)

    Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

  46. [46]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  47. [47]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  48. [48]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

  49. [49]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  50. [50]

    Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https: //github.com/huggingface/trl, 2020

  51. [51]

    Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

    Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology. arXiv preprint arXiv:2507.07999, 2025

  52. [52]

    VGR: Visual Grounded Reasoning

    Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025. 15

  53. [53]

    Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  54. [54]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

  55. [55]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. arXiv preprint arXiv:2504.07934, 2025

  56. [56]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13084–13094, 2024

  57. [57]

    Os-atlas: Foundation action model for generalist gui agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. InThe Thirteenth International Conference on Learning Representations, 2024

  58. [58]

    Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025

    Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, and Chun Yuan. Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025

  59. [59]

    Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

    Xuan Yu, Dayan Guan, Michael Ying Yang, and Yanfeng Gu. Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

  60. [60]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

  61. [61]

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

  62. [62]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

  63. [63]

    Mirg-rl: Multi-image reasoning and grounding with reinforcement learning.arXiv preprint arXiv:2509.21788, 2025

    Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, and Tao Wei. Mirg-rl: Multi-image reasoning and grounding with reinforcement learning.arXiv preprint arXiv:2509.21788, 2025

  64. [64]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372, 2024

  65. [65]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing “thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  66. [66]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 16 A Omitted Technical Details Roadmap.We organize the theoretical analysis as follows. InAppendix ...