Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains
Pith reviewed 2026-05-16 09:52 UTC · model grok-4.3
The pith
Lang2Act boosts VLM visual perception over 4% by letting models self-generate linguistic toolchains through two-stage RL training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lang2Act enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains, achieving performance improvements of over 4%.
Load-bearing premise
That self-emergent actions discovered in the first RL stage form a stable, reusable linguistic toolbox that the second stage can reliably exploit without external fixed tools or loss of visual fidelity.
Figures
read the original abstract
Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Lang2Act, a two-stage RL framework for VLMs in VRAG tasks. Stage 1 optimizes the model to self-explore and collect high-quality actions into a reusable linguistic toolbox; stage 2 optimizes the model to exploit those tools for fine-grained visual perception and reasoning. The method claims to avoid visual information loss from external fixed tools and reports performance gains exceeding 4%.
Significance. If the self-emergent linguistic toolchains prove stable across seeds and causally responsible for the gains (rather than arising from extra RL steps alone), the work would offer a meaningful step toward autonomous tool discovery in VLMs, reducing reliance on hand-crafted external modules while preserving visual fidelity. The two-stage separation of exploration and exploitation is a clean design choice that could generalize to other perception-reasoning pipelines.
major comments (2)
- [Abstract] Abstract: the central performance claim of 'over 4%' gains is stated without any reference to baselines, datasets, metrics, error bars, or statistical tests, making it impossible to assess whether the linguistic-toolchain mechanism is responsible for the improvement.
- [Abstract / Method] The description of stage 1 ('optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox') and stage 2 ('further optimizes VLMs to exploit these linguistic tools') supplies no evidence on action consistency across random seeds, action diversity, or an ablation that removes the toolbox while retaining the second-stage RL updates; without such controls the >4% gain could be explained by additional policy-gradient steps alone.
minor comments (1)
- The GitHub link is given but the abstract contains no statement on code release, seed reporting, or hyperparameter sensitivity, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our results. We will revise the abstract to include specific baselines, datasets, metrics, and statistical details. We will also add the requested controls and ablations in the experiments section to isolate the contribution of the self-emergent linguistic toolbox.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim of 'over 4%' gains is stated without any reference to baselines, datasets, metrics, error bars, or statistical tests, making it impossible to assess whether the linguistic-toolchain mechanism is responsible for the improvement.
Authors: We agree that the abstract is currently underspecified. In the revision we will explicitly name the baselines (standard VRAG pipelines without linguistic toolchains), the evaluation datasets, the primary metrics, and report mean performance with standard deviation across three random seeds. This will make the >4% claim directly interpretable and allow readers to judge whether the mechanism is responsible. revision: yes
-
Referee: [Abstract / Method] The description of stage 1 ('optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox') and stage 2 ('further optimizes VLMs to exploit these linguistic tools') supplies no evidence on action consistency across random seeds, action diversity, or an ablation that removes the toolbox while retaining the second-stage RL updates; without such controls the >4% gain could be explained by additional policy-gradient steps alone.
Authors: We acknowledge the need for these controls. The revised manuscript will include: (1) action-consistency statistics (Jaccard overlap of generated toolchains) across three seeds, (2) diversity metrics (unique action types and entropy), and (3) an ablation that runs the identical second-stage RL schedule without access to the collected linguistic toolbox. The ablation will be reported alongside the full Lang2Act results so that any remaining gain can be attributed to the toolbox rather than extra gradient steps. revision: yes
Circularity Check
No significant circularity: standard two-stage RL with empirical validation
full rationale
The paper presents a two-stage RL training procedure for VLMs to discover and exploit linguistic actions. No equations, derivations, or fitted parameters are shown that reduce to their own inputs by construction. The method relies on standard policy optimization rather than self-definitional loops, imported uniqueness theorems, or ansatzes smuggled via self-citation. Performance gains are reported from experiments, not from renaming known results or treating fitted inputs as predictions. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL reward and exploration hyperparameters
axioms (1)
- domain assumption Reinforcement learning can optimize VLMs to explore high-quality self-emergent actions and then exploit them for reasoning
invented entities (1)
-
self-emergent linguistic toolchains
no independent evidence
Forward citations
Cited by 1 Pith paper
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
Reference graph
Works this paper leans on
-
[1]
Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zha...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Easyr1: An efficient, scalable, multi- modality rl training framework. https://github. com/hiyouga/EasyR1. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025. Dapo: An open-source llm reinforcement learning system at scale.ArXiv preprint, abs/2503.14476. Shi Yu, Cha...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Structure Awareness: Tools must reflect layout (rows, columns, axes)
-
[4]
Atomic Data Extraction: Locate region first, then extract data
-
[5]
Analytical Calculation: Define precise math tools (subtract, rank_values). ### CURRENT TOOL POOL {tool_pool_text} ### GUIDELINES: DESIGNING DOCUMENT TOOLS - Tables/Grids: Navigate rows/columns (e.g., locate_table_row, read_cell_value). - Charts/Graphs: Map visuals to values (e.g., map_bar_to_axis). - Reasoning: Define specific logic tools for calculation/...
-
[6]
New Definitions:DEFINE_TOOL: name || args || desc
- [7]
-
[8]
End:END_OF_TOOLS ### EXAMPLE (Structure & Math) DESC: Found ’Q3 Revenue’, read value, compared to ’Q2’, calculated growth. OUTPUT: DEFINE_TOOL: subtract_values || val1, val2 || Calculate difference. <tool name="locate_table_row" args="row ’Q3 Revenue’">Row 4</tool> <tool name="read_cell_value" args="Row 4, col ’Amount’">$150M</tool> <tool name="subtract_v...
-
[9]
In <think>, analyze all {num_images} images and state which one(s) contain relevant evidence
-
[10]
In <description>, focus only on the selected images and describe your reasoning process using the tools below
-
[11]
In <answer>, provide only the final, concise answer grounded in visual evidence. Available Tools for <description>: – <tool name="locate_visual_element" args="Image k: structural hint"> Locate specific visual elements or regions based on structural hints. </tool> – <tool name="read_text_element" args="Image k: locator/region"> Read and transcribe visible ...
-
[12]
(1 Star: Disjointed/Confusing; 3 Stars: Readable but with leaps; 5 Stars: Perfectly smooth/Logical)
Coherence (Reasoning Logic & Fluency) Definition:Evaluates whether the chain of thought is logically sound, structured, and easy to follow. (1 Star: Disjointed/Confusing; 3 Stars: Readable but with leaps; 5 Stars: Perfectly smooth/Logical)
-
[13]
(1 Star: Major fabrications; 3 Stars: Minor errors; 5 Stars: Entirely truthful)
Non-Hallucination (Faithfulness) Definition:Evaluates whether the response contains fabricated information. (1 Star: Major fabrications; 3 Stars: Minor errors; 5 Stars: Entirely truthful)
-
[14]
(1 Star: Contradictory; 3 Stars: Partially consistent; 5 Stars: Fully consistent)
Factual Consistency (Alignment with Gold Answer) Definition:Evaluates whether the model’s final conclusion aligns with the Gold Answer. (1 Star: Contradictory; 3 Stars: Partially consistent; 5 Stars: Fully consistent). Output Format: Please strictly follow this format: Coherence: [Score] Non-Hallucination: [Score] Factual Consistency: [Score] Average: [Av...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.