arxiv: 2605.02730 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Perceptual Flow Network for Visually Grounded Reasoning

Yangfu Li , Yuning Gong , Hongjian Zhan , Teng Li , Yuanhuiyi Lyu , Tianyi Chen , Qi Liu , Ziyuan Huang

show 3 more authors

Zhihang Zhong Dandan Zheng Yue Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Perceptual Flow Networkvisually grounded reasoninglarge vision-language modelsvariational reinforcement learningvisual hallucinationself-conditioned generation

0 comments

The pith

Perceptual Flow Network improves visually grounded reasoning by decoupling perception from reasoning and shaping it with variational reinforcement learning instead of rigid expert priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that large vision-language models suffer from language bias and hallucinations because standard training does not properly constrain visual trajectories. Existing approaches add geometric priors from visual experts as supervision, but these are biased toward geometric accuracy rather than reasoning usefulness. PFlowNet addresses this by separating perception from reasoning into a self-conditioned generation process. It then combines multi-dimensional rewards with vicinal geometric shaping through variational reinforcement learning to encourage perceptual behaviors oriented toward reasoning while maintaining visual reliability. The result is presented as delivering both a theoretical performance guarantee and higher empirical accuracy on visual reasoning benchmarks.

Core claim

PFlowNet decouples perception from reasoning to create a self-conditioned generation process, then integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning. This produces reasoning-oriented perceptual behaviors while preserving visual reliability and yields a provable performance guarantee along with new state-of-the-art scores on V* Bench and MME-RealWorld-lite.

What carries the argument

The self-conditioned generation process in PFlowNet, which decouples perception from reasoning and applies vicinal geometric shaping through variational reinforcement learning to avoid rigid alignment with expert priors.

If this is right

The approach delivers a provable performance guarantee for the resulting model.
It reaches new state-of-the-art performance of 90.6 percent on V* Bench.
It reaches new state-of-the-art performance of 67.0 percent on MME-RealWorld-lite.
It enables reasoning-oriented perceptual behaviors while keeping visual outputs reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decoupling strategy might allow separate tuning of perception modules in other multimodal systems without retraining the entire model.
Variational reinforcement learning for shaping perceptual flows could extend to balancing competing objectives in non-visual language tasks.
The method suggests a route to reduce hallucinations by prioritizing reasoning utility over strict geometric matching in additional visual benchmarks.

Load-bearing premise

That geometric priors from visual experts are suboptimal for reasoning utility and that vicinal geometric shaping via variational reinforcement learning will produce superior perceptual behaviors without reducing visual reliability.

What would settle it

An experiment in which models trained with rigid geometric priors from visual experts achieve higher accuracy than PFlowNet on the V* Bench or MME-RealWorld-lite benchmarks would undermine the central claim.

read the original abstract

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PFlowNet decouples perception from reasoning in LVLMs and trains the perceptual part with variational RL plus multi-dimensional rewards and vicinal shaping to reduce hallucinations, but the asserted provable guarantee has no visible derivation or assumptions to support the SOTA numbers.

read the letter

The core move is to stop forcing perception to match geometric priors from visual experts and instead let it develop through a self-conditioned process optimized for downstream reasoning utility. The authors use variational reinforcement learning with multi-dimensional rewards and a vicinal geometric shaping term to encourage behaviors that help reasoning while trying to keep visual reliability intact. This is a direct response to the observation that standard MLE produces language bias and that expert priors often prioritize geometric accuracy over task usefulness. The reported results are specific: 90.6% on V* Bench and 67.0% on MME-RealWorld-lite, presented as new SOTA. If the controls and ablations hold, those numbers are worth attention for anyone working on reliable multimodal systems. The architecture description itself is clear enough to follow and builds sensibly on existing lines of work in RL for LVLMs and visual grounding. The soft spot is exactly the one the stress-test flags. The abstract states a provable performance guarantee without any sketch of what is proven, the assumptions required, or how the guarantee connects to the final benchmark scores rather than just the RL surrogate objective. Without that link, it is hard to know whether the gains come from the decoupling and shaping mechanism or from backbone choice and tuning. The experimental details are also thin in the abstract, so the soundness of the empirical claims cannot be fully assessed yet. This paper is aimed at people working on hallucination mitigation and grounded reasoning in vision-language models. A reader already familiar with RL applications in multimodal settings would get value from the reward design and the self-conditioned generation idea. It deserves a serious referee because the problem is real, the proposed direction is coherent, and the benchmarks are relevant, even if the theoretical claim needs substantial expansion and the experiments need more transparency.

Referee Report

2 major / 1 minor

Summary. The paper proposes Perceptual Flow Network (PFlowNet) to address limitations in Large Vision-Language Models (LVLMs) where standard MLE optimization leads to language bias and hallucination. It observes that geometric priors from visual experts are suboptimal for reasoning utility due to bias toward geometric precision. PFlowNet decouples perception from reasoning via a self-conditioned generation process and integrates multi-dimensional rewards with vicinal geometric shaping using variational reinforcement learning. This is claimed to produce reasoning-oriented perceptual behaviors while preserving visual reliability, delivering a provable performance guarantee and new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Significance. If the claimed provable guarantee can be rigorously established under clearly stated assumptions and the SOTA empirical results are reproducible with proper controls for backbone choice and hyperparameter tuning, the work would offer a meaningful alternative to rigid expert-prior alignment in grounded reasoning tasks. The decoupling of perception and reasoning plus the variational RL formulation with vicinal shaping could influence methods for reducing hallucinations in LVLMs, provided the guarantee applies to downstream reasoning utility rather than only the surrogate objective.

major comments (2)

[Abstract] Abstract: The central claim that 'PFlowNet delivers a provable performance guarantee' is load-bearing for the paper's novelty yet provides no statement of what is proven (e.g., convergence rate, bound on hallucination rate, or optimality of the decoupled flow), no assumptions (e.g., bounded reward variance, Lipschitz continuity of the shaping term, or properties of the self-conditioned distribution), and no proof sketch or derivation. This prevents evaluation of whether the guarantee supports attribution of the reported SOTA numbers to the proposed mechanism.
[Abstract] Abstract: The superiority claim rests on eschewing rigid alignment with expert geometric priors in favor of variational RL with vicinal shaping and multi-dimensional rewards, but the abstract supplies no indication of how the guarantee reduces to the fitted parameters or how the empirical results on V* Bench and MME-RealWorld-lite isolate the effect of the proposed shaping from backbone or tuning choices.

minor comments (1)

[Abstract] Abstract: The term 'vicinal geometric shaping' is introduced without a brief definition or reference to its precise formulation, which may hinder immediate understanding of the method's novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our theoretical and empirical contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'PFlowNet delivers a provable performance guarantee' is load-bearing for the paper's novelty yet provides no statement of what is proven (e.g., convergence rate, bound on hallucination rate, or optimality of the decoupled flow), no assumptions (e.g., bounded reward variance, Lipschitz continuity of the shaping term, or properties of the self-conditioned distribution), and no proof sketch or derivation. This prevents evaluation of whether the guarantee supports attribution of the reported SOTA numbers to the proposed mechanism.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the theoretical claim. The full manuscript contains the detailed analysis in Section 4, which establishes a bound on the expected reasoning utility of the self-conditioned perceptual flow. The proof relies on standard variational RL convergence arguments under the assumptions of bounded reward variance and Lipschitz continuity of the vicinal shaping term. We will revise the abstract to include a concise statement of the proven guarantee, the key assumptions, and a pointer to the proof section. revision: yes
Referee: [Abstract] Abstract: The superiority claim rests on eschewing rigid alignment with expert geometric priors in favor of variational RL with vicinal shaping and multi-dimensional rewards, but the abstract supplies no indication of how the guarantee reduces to the fitted parameters or how the empirical results on V* Bench and MME-RealWorld-lite isolate the effect of the proposed shaping from backbone or tuning choices.

Authors: The abstract is a high-level summary; the reduction of the guarantee to the learned parameters is derived explicitly in the variational objective of Section 4. For the empirical results, Section 5.3 reports controlled ablations that isolate the vicinal shaping and multi-dimensional reward components while holding the backbone model and hyperparameter settings fixed. We will add a sentence to the abstract noting that the reported SOTA numbers are supported by these ablations and the theoretical analysis. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on asserted guarantee without self-referential reduction

full rationale

The abstract asserts a 'provable performance guarantee' and SOTA results from decoupling perception, multi-dimensional rewards, and variational RL with vicinal shaping, but supplies no equations, derivations, or self-citations that reduce the guarantee or empirical claims to fitted inputs or prior author results by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method is described as building on (then modifying) external geometric priors, which is independent of the target claims. This is the common honest case of a self-contained high-level description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the proposed network itself.

invented entities (1)

Perceptual Flow Network (PFlowNet) no independent evidence
purpose: Decouples perception from reasoning to enable self-conditioned generation and reasoning-oriented perceptual behaviors
New architecture introduced to address limitations of geometric priors.

pith-pipeline@v0.9.0 · 5505 in / 1103 out tokens · 26450 ms · 2026-05-08T18:37:37.373842+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PFlowNet ... integrates a multi-dimensional reward function with vicinal geometric shaping via variational reinforcement learning ... Sub-Trajectory Balance (SubTB)
Foundation/AlphaCoordinateFixation.lean (parameter-free α=1 pin) alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ω_λ(z_{0:k}, E) := exp(−λ·1[d_IoU(r_{1:k}, E) > ε]) ... we set λ=4.5, ε=0.5 in this work.
Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

D_TV(p_{θ⋆}(·|X), P_V(·|X,Y)) ≤ (1/2Z_λ)·(q|s_V−Z_λ| + (1−q)|e^{−λ}s_V−Z_λ| + e^{−λ}(1−s_V))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 37 canonical work pages · 15 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025.URL https://arxiv. org/abs/2511.21631, 2025

work page Pith review arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page Pith review arXiv 2025
[3]

Vicinal risk minimization.Advances in neural information processing systems, 13, 2000

Olivier Chapelle, Jason Weston, Léon Bottou, and Vladimir Vapnik. Vicinal risk minimization.Advances in neural information processing systems, 13, 2000

2000
[4]

Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 37: 44393–44418, 2024

Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 37: 44393–44418, 2024

2024
[5]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

2024
[6]

Gemini-3-flash.https://deepmind.google/models/gemini/flash/, 2025

DeepMind. Gemini-3-flash.https://deepmind.google/models/gemini/flash/, 2025

2025
[7]

Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

DeepMind. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

2025
[8]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024

work page arXiv 2024
[9]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

2024
[10]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143, 2024

2024
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review arXiv 2025
[12]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[13]

Deepeyesv2: Toward agentic multimodal model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page arXiv 2025
[14]

The dawn of gui agent: A preliminary case study with claude 3.5 computer use,

Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323, 2024. 13

work page arXiv 2024
[15]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

2024
[16]

Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought

Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025

work page arXiv 2025
[17]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

2016
[18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[19]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page Pith review arXiv 2024
[20]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025

2025
[21]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

2025
[22]

arXiv preprint arXiv:2403.00231 , year=

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231, 2024

work page arXiv 2024
[23]

arXiv preprint arXiv:2603.03857 (2026)

Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, and Yue Lu. Deepscan: A training-free framework for visually grounded reasoning in large vision-language models.arXiv preprint arXiv:2603.03857, 2026

work page arXiv 2026
[24]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review arXiv 2023
[25]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024

work page internal anchor Pith review arXiv 2024
[26]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[27]

Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/2024-01-30-llava-next/ , 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/2024-01-30-llava-next/ , 2024

2024
[28]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

2024
[29]

Look as you think: Unifying reasoning and visual evidence attribution for verifiable document rag via reinforcement learning.arXiv preprint arXiv:2511.12003, 2025

Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, and Enhong Chen. Look as you think: Unifying reasoning and visual evidence attribution for verifiable document rag via reinforcement learning.arXiv preprint arXiv:2511.12003, 2025

work page arXiv 2025
[30]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[31]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review arXiv 2025
[32]

Visual agentic reinforcement fine-tuning

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual agentic reinforcement fine-tuning.URL https://arxiv. org/abs/2505.14246, 2025

work page arXiv 2025
[33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14

work page internal anchor Pith review arXiv 2017
[34]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review arXiv 2025
[35]

Struvis: Enhancing reasoning-based text-to-image generation via thinking with structured vision.arXiv preprint arXiv:2603.06032, 2026

Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, and Xuming Hu. Struvis: Enhancing reasoning-based text-to-image generation via thinking with structured vision.arXiv preprint arXiv:2603.06032, 2026

work page arXiv 2026
[36]

Learning gflownets from partial episodes for improved convergence and stability

Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cristian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes for improved convergence and stability. InInternational Conference on Machine Learning, pages 23467–23483. PMLR, 2023

2023
[37]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

2022
[38]

Openai-gpt-4o.https://openai.com/index/gpt-4o-system-card/, 2024

OpenAI. Openai-gpt-4o.https://openai.com/index/gpt-4o-system-card/, 2024

2024
[39]

Openai-o3.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

OpenAI. Openai-o3.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

2025
[40]

Operator: A computer-using agent.https://openai.com/index/operator-system-card/, 2025

OpenAI. Operator: A computer-using agent.https://openai.com/index/operator-system-card/, 2025. System Card and Technical Report

2025
[41]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page Pith review arXiv 2025
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[43]

Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

work page arXiv 2025
[44]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review arXiv 2025
[45]

arXiv preprint arXiv:2512.17312 (2025)

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

work page arXiv 2025
[46]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review arXiv 2025
[47]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[48]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

2024
[49]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[50]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https: //github.com/huggingface/trl, 2020

2020
[51]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology. arXiv preprint arXiv:2507.07999, 2025

work page arXiv 2025
[52]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

2024
[54]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

2025
[55]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. arXiv preprint arXiv:2504.07934, 2025

work page arXiv 2025
[56]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13084–13094, 2024

2024
[57]

Os-atlas: Foundation action model for generalist gui agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[58]

Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025

Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, and Chun Yuan. Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025

work page arXiv 2025
[59]

Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

Xuan Yu, Dayan Guan, Michael Ying Yang, and Yanfeng Gu. Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

work page arXiv 2025
[60]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page Pith review arXiv 2025
[61]

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

work page arXiv 2024
[62]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

work page internal anchor Pith review arXiv 2025
[63]

Mirg-rl: Multi-image reasoning and grounding with reinforcement learning.arXiv preprint arXiv:2509.21788, 2025

Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, and Tao Wei. Mirg-rl: Multi-image reasoning and grounding with reinforcement learning.arXiv preprint arXiv:2509.21788, 2025

work page arXiv 2025
[64]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372, 2024

work page internal anchor Pith review arXiv 2024
[65]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing “thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review arXiv 2025
[66]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 16 A Omitted Technical Details Roadmap.We organize the theoretical analysis as follows. InAppendix ...

work page internal anchor Pith review arXiv 2025