pith. sign in

arxiv: 2602.13235 · v2 · submitted 2026-01-29 · 💻 cs.AI · cs.CV

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Pith reviewed 2026-05-16 09:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords visualvlmslang2actlinguisticperceptionreasoningcapabilitiesexternal
0
0 comments X

The pith

Lang2Act boosts VLM visual perception over 4% by letting models self-generate linguistic toolchains through two-stage RL training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models process both images and text but often lose key details when they hand off images to separate fixed tools like croppers before reasoning. Lang2Act instead lets the model invent its own set of actions, described in plain language, that act as tools for zooming in or examining specific parts of an image. Training uses reinforcement learning in two phases. The first phase rewards the model for finding high-quality actions that build a shared toolbox. The second phase teaches the model to pick and apply those tools effectively when answering questions about new images. Because the tools stay inside the model's own language space, no external step strips away visual information. Experiments report more than 4% better results on visual tasks compared to prior methods that rely on rigid external engines.

Core claim

Lang2Act enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains, achieving performance improvements of over 4%.

Load-bearing premise

That self-emergent actions discovered in the first RL stage form a stable, reusable linguistic toolbox that the second stage can reliably exploit without external fixed tools or loss of visual fidelity.

Figures

Figures reproduced from arXiv: 2602.13235 by Chunyi Peng, Ge Yu, Shuo Wang, Yu Gu, Yukun Yan, Yuqi Xiong, Zhenghao Liu, Zhipeng Xu, Zulong Chen.

Figure 1
Figure 1. Figure 1: Comparison between VRAG-RL and the Lang2Act framework. eration (Ram et al., 2023). To extend the bene￾fits of RAG to visual documents, existing meth￾ods (Yu et al., 2024; Faysse et al., 2024) typically adopt an end-to-end Visual Retrieval-Augmented Generation (VRAG) modeling paradigm. This de￾sign avoids the error propagation introduced by text-based RAG pipelines that rely on optical char￾acter recognitio… view at source ↗
Figure 2
Figure 2. Figure 2: The overview architecture of Lang2Act. set T (i−1) pool . And then the tool pool is updated using the produced tool set Si : T (i) pool = T (i−1) pool ∪ Si, (5) where T (i) pool denotes the tool pool after processing trajectory τi , and the pool is initialized as T (0) pool = ∅. After processing all n trajectories, we get the final tool set T (n) pool and select the top-K most fre￾quently used tools from t… view at source ↗
Figure 3
Figure 3. Figure 3: Quantitative analysis of image perception [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case Study on SlideVQA. The red box indicates the ground truth region of the given image. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of sample difficulty before and [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Large language model–based evaluation of response quality under successful attention perception. All [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt of Vanilla. Action RL Prompt. System Prompt: You are a specialized AI assistant for visual question answering based on multiple provided document images. Your task is to answer the user’s question by carefully analyzing all images. Your response must strictly follow this format: <think>...</think> <description>...</description> <answer>...</answer> Guidance: - You have exactly {num_images} image(s).… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt of Action RL [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt of Tools Curation [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt of Lang2Act [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt of EVisRAG for evidence-structured visual question answering. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt of Tree-of-Thoughts (ToT) [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt of Graph-of-Thoughts (GOT). VRAG-RL Prompt. System Prompt: Answer the given question. You must conduct reasoning inside <think> and </think> first every time you get new information. After reasoning, if you lack knowledge, you may call a search engine via <search> query </search>. When an image is retrieved, you may crop it using <bbox>[x1, y1, x2, y2]</bbox>. Repeat as needed. If no further knowle… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt of VRAG-RL [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt of PixelReasoner. VLM Judge Prompt. System Prompt: You are an expert evaluation system for a question answering chatbot. You will be given one evaluation item. You will see a query, a reference answer, and a generated answer. Your task is to evaluate the correctness of the generated answer. Your response MUST be exactly one line, formatted as <judge>True</judge> if the generated answer is correct, … view at source ↗
Figure 16
Figure 16. Figure 16: Prompt of the automatic judge for single-item evaluation. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt used for the automatic evaluation of reasoning coherence, faithfulness, and factual consistency. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
read the original abstract

Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Lang2Act, a two-stage RL framework for VLMs in VRAG tasks. Stage 1 optimizes the model to self-explore and collect high-quality actions into a reusable linguistic toolbox; stage 2 optimizes the model to exploit those tools for fine-grained visual perception and reasoning. The method claims to avoid visual information loss from external fixed tools and reports performance gains exceeding 4%.

Significance. If the self-emergent linguistic toolchains prove stable across seeds and causally responsible for the gains (rather than arising from extra RL steps alone), the work would offer a meaningful step toward autonomous tool discovery in VLMs, reducing reliance on hand-crafted external modules while preserving visual fidelity. The two-stage separation of exploration and exploitation is a clean design choice that could generalize to other perception-reasoning pipelines.

major comments (2)
  1. [Abstract] Abstract: the central performance claim of 'over 4%' gains is stated without any reference to baselines, datasets, metrics, error bars, or statistical tests, making it impossible to assess whether the linguistic-toolchain mechanism is responsible for the improvement.
  2. [Abstract / Method] The description of stage 1 ('optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox') and stage 2 ('further optimizes VLMs to exploit these linguistic tools') supplies no evidence on action consistency across random seeds, action diversity, or an ablation that removes the toolbox while retaining the second-stage RL updates; without such controls the >4% gain could be explained by additional policy-gradient steps alone.
minor comments (1)
  1. The GitHub link is given but the abstract contains no statement on code release, seed reporting, or hyperparameter sensitivity, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We will revise the abstract to include specific baselines, datasets, metrics, and statistical details. We will also add the requested controls and ablations in the experiments section to isolate the contribution of the self-emergent linguistic toolbox.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claim of 'over 4%' gains is stated without any reference to baselines, datasets, metrics, error bars, or statistical tests, making it impossible to assess whether the linguistic-toolchain mechanism is responsible for the improvement.

    Authors: We agree that the abstract is currently underspecified. In the revision we will explicitly name the baselines (standard VRAG pipelines without linguistic toolchains), the evaluation datasets, the primary metrics, and report mean performance with standard deviation across three random seeds. This will make the >4% claim directly interpretable and allow readers to judge whether the mechanism is responsible. revision: yes

  2. Referee: [Abstract / Method] The description of stage 1 ('optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox') and stage 2 ('further optimizes VLMs to exploit these linguistic tools') supplies no evidence on action consistency across random seeds, action diversity, or an ablation that removes the toolbox while retaining the second-stage RL updates; without such controls the >4% gain could be explained by additional policy-gradient steps alone.

    Authors: We acknowledge the need for these controls. The revised manuscript will include: (1) action-consistency statistics (Jaccard overlap of generated toolchains) across three seeds, (2) diversity metrics (unique action types and entropy), and (3) an ablation that runs the identical second-stage RL schedule without access to the collected linguistic toolbox. The ablation will be reported alongside the full Lang2Act results so that any remaining gain can be attributed to the toolbox rather than extra gradient steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard two-stage RL with empirical validation

full rationale

The paper presents a two-stage RL training procedure for VLMs to discover and exploit linguistic actions. No equations, derivations, or fitted parameters are shown that reduce to their own inputs by construction. The method relies on standard policy optimization rather than self-definitional loops, imported uniqueness theorems, or ansatzes smuggled via self-citation. Performance gains are reported from experiments, not from renaming known results or treating fitted inputs as predictions. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that RL can discover and stabilize useful linguistic actions without external supervision, plus standard VLM and RL background.

free parameters (1)
  • RL reward and exploration hyperparameters
    Tuned values control action discovery in stage one and tool usage in stage two; exact values not stated in abstract.
axioms (1)
  • domain assumption Reinforcement learning can optimize VLMs to explore high-quality self-emergent actions and then exploit them for reasoning
    Invoked to justify the two-stage training framework.
invented entities (1)
  • self-emergent linguistic toolchains no independent evidence
    purpose: Reusable set of language-described actions that replace fixed external visual tools
    New construct introduced to maintain visual information inside the model.

pith-pipeline@v0.9.0 · 5551 in / 1231 out tokens · 29501 ms · 2026-05-16T09:52:42.465536+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

    cs.CL 2026-05 unverdicted novelty 6.0

    CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zha...

  2. [2]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Easyr1: An efficient, scalable, multi- modality rl training framework. https://github. com/hiyouga/EasyR1. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025. Dapo: An open-source llm reinforcement learning system at scale.ArXiv preprint, abs/2503.14476. Shi Yu, Cha...

  3. [3]

    Structure Awareness: Tools must reflect layout (rows, columns, axes)

  4. [4]

    Atomic Data Extraction: Locate region first, then extract data

  5. [5]

    ### CURRENT TOOL POOL {tool_pool_text} ### GUIDELINES: DESIGNING DOCUMENT TOOLS - Tables/Grids: Navigate rows/columns (e.g., locate_table_row, read_cell_value)

    Analytical Calculation: Define precise math tools (subtract, rank_values). ### CURRENT TOOL POOL {tool_pool_text} ### GUIDELINES: DESIGNING DOCUMENT TOOLS - Tables/Grids: Navigate rows/columns (e.g., locate_table_row, read_cell_value). - Charts/Graphs: Map visuals to values (e.g., map_bar_to_axis). - Reasoning: Define specific logic tools for calculation/...

  6. [6]

    New Definitions:DEFINE_TOOL: name || args || desc

  7. [7]

    " args=

    Applications:<tool name="..." args="...">reasoning</tool>

  8. [8]

    locate_table_row

    End:END_OF_TOOLS ### EXAMPLE (Structure & Math) DESC: Found ’Q3 Revenue’, read value, compared to ’Q2’, calculated growth. OUTPUT: DEFINE_TOOL: subtract_values || val1, val2 || Calculate difference. <tool name="locate_table_row" args="row ’Q3 Revenue’">Row 4</tool> <tool name="read_cell_value" args="Row 4, col ’Amount’">$150M</tool> <tool name="subtract_v...

  9. [9]

    In <think>, analyze all {num_images} images and state which one(s) contain relevant evidence

  10. [10]

    In <description>, focus only on the selected images and describe your reasoning process using the tools below

  11. [11]

    locate_visual_element

    In <answer>, provide only the final, concise answer grounded in visual evidence. Available Tools for <description>: – <tool name="locate_visual_element" args="Image k: structural hint"> Locate specific visual elements or regions based on structural hints. </tool> – <tool name="read_text_element" args="Image k: locator/region"> Read and transcribe visible ...

  12. [12]

    (1 Star: Disjointed/Confusing; 3 Stars: Readable but with leaps; 5 Stars: Perfectly smooth/Logical)

    Coherence (Reasoning Logic & Fluency) Definition:Evaluates whether the chain of thought is logically sound, structured, and easy to follow. (1 Star: Disjointed/Confusing; 3 Stars: Readable but with leaps; 5 Stars: Perfectly smooth/Logical)

  13. [13]

    (1 Star: Major fabrications; 3 Stars: Minor errors; 5 Stars: Entirely truthful)

    Non-Hallucination (Faithfulness) Definition:Evaluates whether the response contains fabricated information. (1 Star: Major fabrications; 3 Stars: Minor errors; 5 Stars: Entirely truthful)

  14. [14]

    (1 Star: Contradictory; 3 Stars: Partially consistent; 5 Stars: Fully consistent)

    Factual Consistency (Alignment with Gold Answer) Definition:Evaluates whether the model’s final conclusion aligns with the Gold Answer. (1 Star: Contradictory; 3 Stars: Partially consistent; 5 Stars: Fully consistent). Output Format: Please strictly follow this format: Coherence: [Score] Non-Hallucination: [Score] Factual Consistency: [Score] Average: [Av...