pith. sign in

arxiv: 2604.27366 · v1 · submitted 2026-04-30 · 💻 cs.CV

Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving

Pith reviewed 2026-05-07 09:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous drivingvision language actiontrajectory refinementcritic modelclosed-loop evaluationsynthetic datasetBench2Drive
0
0 comments X

The pith

A vision-language-action model judges its own rough driving trajectory and refines it before acting, raising closed-loop success on driving benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CriticVLA, a two-stage framework that first uses a VLA model to generate an initial trajectory from multimodal inputs and then applies the same model's critic capability to evaluate that trajectory and perform single-step optimization. This addresses the limitation in prior VLA driving methods that map inputs directly to controls without explicit judgment or refinement. The approach is supported by training on a large synthetic dataset of 12.9 million annotated trajectories. If the critic reliably improves trajectory quality, the result is more robust driving behavior in complex, closed-loop environments where small errors compound.

Core claim

CriticVLA extends VLAs from pure action generation to a judge-then-drive process: an initial rough trajectory is produced, after which multimodal evaluation by the VLA critic identifies issues and guides single-step refinement, producing higher-quality behaviors than direct-mapping baselines.

What carries the argument

The two-stage CriticVLA pipeline, in which the VLA critic performs multimodal evaluation followed by single-step optimization on an initial trajectory.

If this is right

  • Driving systems gain an internal mechanism for catching and correcting trajectory errors before execution.
  • VLA models trained on large synthetic trajectory sets can extend their role beyond mapping to include reliable judgment.
  • Closed-loop performance improves particularly in scenarios that previously exposed weaknesses in direct VLA control.
  • The framework provides a concrete way to leverage existing critic capabilities demonstrated in other LLM domains for vehicle control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar judge-then-act loops could apply to other embodied tasks such as robotic manipulation where initial plans benefit from self-evaluation.
  • The reliance on synthetic data suggests a path to scale critic training without exhaustive real-world labeling, provided domain gaps remain small.
  • If the refinement step generalizes, hybrid systems might combine this critic stage with existing rule-based safety layers for added redundancy.

Load-bearing premise

That a VLA model trained as a critic on synthetic trajectory data can produce evaluations and refinements that improve actual closed-loop driving performance.

What would settle it

Run the full CriticVLA pipeline versus a version without the critic refinement stage on the same Bench2Drive scenarios and measure whether the success rate gap disappears.

Figures

Figures reproduced from arXiv: 2604.27366 by Hao Yang, Jianing Huang, Lijin Yang, Shu Liu, Zhongzhan Huang.

Figure 1
Figure 1. Figure 1: Conceptual diagram of the VLA-based autonomous driving model (ADM). (a) The common paradigm of existing VLA￾based ADM. (b) Our CriticVLA: in Stage-1 a VLA-based ADM generates a rough trajectory A without language output, in Stage-2 the same VLA is reused to generate judgment and refinement to obtain final action A ′ . 2025; Zhang et al., 2025; Jiang et al., 2025a). Leveraging large-scale pretraining and un… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed CriticVLA framework for autonomous driving. Stage-1: we encode the image, language instruction, and rough trajectory queries as input to an LLM to get a rough action without language output. Stage-2: the rough action is then additionally encoded and processed by LLM to generate language output as critics. Refined action is then generated from the critics and refinement trajectory q… view at source ↗
Figure 3
Figure 3. Figure 3: The template of critique cr. where I = (V, L) denote the visual and language inputs, and = (LoRA , DETR ) denote the learnable generator parameters from LoRA and DETR, respectively, as shown in view at source ↗
Figure 4
Figure 4. Figure 4: The Visualization of CriticVLA’s refinement effect. The two blue points represent the current and next target points, respectively: (a) and (b) demonstrate the refinement of route waypoints leading to successful correction of steering to the highway entrance, (c) and (d) illustrate the adjustment of speed waypoints that control the future speed to avoid collision. The gray boxes display generated critiques view at source ↗
Figure 5
Figure 5. Figure 5 view at source ↗
Figure 6
Figure 6. Figure 6: The distribution of Lipschitz constant LQ and LC in different scenarios. trajectories for analysis. First, regarding the part of Assumption 3.2 related to the Q function: for an action A, Q(A) can be defined as the proportion of untriggered risks. If A triggers no risks, then Q(A) = 1. In the experimental setting of this paper, d(·, ·) adopts the Euclidean norm. As shown in Fig. (6), the distributions of L… view at source ↗
Figure 7
Figure 7. Figure 7: The template of prompt and answer. Input Preparation Multi-view RGB camera images are captured and preprocessed via dynamic resizing. Meanwhile, GPS and IMU sensors provide vehicle localization and orientation data. A route planner generates target points based on the global route plan, which are transformed into ego-centric coordinates along with the current vehicle speed to form the input feature set. Mo… view at source ↗
Figure 8
Figure 8. Figure 8: Performance improvement of our method relative to the Stage-1 Model across different scenarios. The x-axis represents scenario categories. The y-axis denotes the per-scenario success rate. D. Core Set Details The standard Bench2Drive benchmark comprises 220 distinct routes, spanning 44 scenario categories with 5 variations per category. While comprehensive, we observed that a subset of these routes present… view at source ↗
Figure 9
Figure 9. Figure 9: More visualization results of proposed CriticVLA. The proposed CriticVLA exhibits excellent correction effects on both speed and direction, effectively rectifying issues such as direction deviation, delayed braking, or inadequate deceleration during vehicle operation. Additionally, the presented failed cases indicate that although CriticVLA intentionally performs corrective operations like avoiding other v… view at source ↗
Figure 10
Figure 10. Figure 10: Examples of CriticDrive Dataset: collision and direction risk. Left panel: driving scenario from both the egocentric front-view and the simplified Bird’s-Eye View bounding box locations. Green dots represent the expert ground truth trajectory, while cyan dots indicate the rough trajectory. The future positions of surrounding dynamic actors are marked in blue. Crucially, potential safety hazards are visual… view at source ↗
Figure 11
Figure 11. Figure 11: Examples of CriticDrive Dataset: collision and direction risk. Left panel: driving scenario from both the egocentric front-view and the simplified Bird’s-Eye View bounding box locations. Green dots represent the expert ground truth trajectory, while cyan dots indicate the rough trajectory. The future positions of surrounding dynamic actors are marked in blue. Crucially, potential safety hazards are visual… view at source ↗
Figure 12
Figure 12. Figure 12: Examples of CriticDrive Dataset: speed and direction risk. Left panel: driving scenario from both the egocentric front-view and the simplified Bird’s-Eye View bounding box locations. Green dots represent the expert ground truth trajectory, while cyan dots indicate the rough trajectory. The future positions of surrounding dynamic actors are marked in blue. Crucially, potential safety hazards are visualized… view at source ↗
Figure 13
Figure 13. Figure 13: Examples of CriticDrive Dataset: no risk detected. Left panel: driving scenario from both the egocentric front-view and the simplified Bird’s-Eye View bounding box locations. Green dots represent the expert ground truth trajectory, while cyan dots indicate the rough trajectory. The future positions of surrounding dynamic actors are marked in blue. Crucially, potential safety hazards are visualized in red,… view at source ↗
read the original abstract

Recent advances in vision language action (VLA) models have shown remarkable potential for autonomous driving by directly mapping multimodal inputs to control signals. However, previous VLA-based methods have not explicitly exploited the critic capability of VLAs to refine driving decisions, even though such capability has been well demonstrated in other LLM-based domains, thereby limiting their performance in complex closed-loop scenarios. In this work, we present a theoretically inspired two-stage framework, CriticVLA, which extends the role of VLAs from acting to judging. CriticVLA first generates a rough trajectory and then refines it through multimodal evaluation and single-step optimization guided by a VLA-based critic, yielding higher-quality driving behaviors. To support this process, we construct a large-scale synthetic dataset of 12.9 million annotated trajectories covering diverse driving scenarios, which enhances the critic's reasoning and refinement abilities. Extensive closed-loop experiments on the Bench2Drive benchmark show that CriticVLA significantly surpasses state-of-the-art baselines, achieving a 73.33% total success rate and delivering about 30% improvement in challenging scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents CriticVLA, a two-stage vision-language-action framework for autonomous driving. In the first stage, a VLA model generates an initial rough trajectory from multimodal inputs. In the second stage, a VLA-based critic evaluates and refines the trajectory through multimodal assessment and single-step optimization. To enable this, the authors build a synthetic dataset comprising 12.9 million annotated trajectories. Closed-loop evaluations on the Bench2Drive benchmark demonstrate that CriticVLA achieves a 73.33% success rate, surpassing state-of-the-art methods with approximately 30% gains in challenging scenarios.

Significance. If the reported performance gains hold under rigorous validation, this work could advance VLA applications in autonomous driving by explicitly leveraging critic capabilities that have succeeded in other LLM domains. The large-scale synthetic dataset is a notable resource contribution. The closed-loop focus directly targets limitations of prior VLA methods, with potential implications for more reliable decision-making in complex scenarios. The theoretically motivated two-stage design is a strength.

major comments (2)
  1. [§5 (Experimental Results)] §5 (Experimental Results): The headline claims of 73.33% total success rate and ~30% improvement in challenging scenarios are presented without details on the number of evaluation runs, standard deviations, statistical significance tests, or exact baseline implementations and hyperparameter settings. This information is required to assess whether the closed-loop gains on Bench2Drive are robust or potentially confounded by evaluation protocol choices.
  2. [§3 (Dataset and Critic Training)] §3 (Dataset and Critic Training): The 12.9 million trajectory dataset is central to enabling the critic's refinement capability, yet the manuscript provides insufficient description of how critic annotations (e.g., quality labels or refinement targets) were generated, filtered, or validated for accuracy. Without this, it is difficult to evaluate the reliability of the assumption that the VLA critic can consistently produce better trajectories in closed-loop settings.
minor comments (2)
  1. [Figure 2] Figure 2: The diagram of the two-stage pipeline would benefit from explicit arrows or labels indicating the flow of multimodal inputs to the critic and the single-step optimization step.
  2. [§2] Notation in §2: The distinction between the actor VLA and critic VLA could be clarified with consistent variable naming (e.g., distinguishing π_actor from π_critic) to avoid reader confusion when describing the refinement process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point-by-point below. Where the manuscript is missing required details, we will revise accordingly to strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: [§5 (Experimental Results)] §5 (Experimental Results): The headline claims of 73.33% total success rate and ~30% improvement in challenging scenarios are presented without details on the number of evaluation runs, standard deviations, statistical significance tests, or exact baseline implementations and hyperparameter settings. This information is required to assess whether the closed-loop gains on Bench2Drive are robust or potentially confounded by evaluation protocol choices.

    Authors: We agree that additional statistical details are needed to fully substantiate the reported performance. In the revised manuscript, we will explicitly state the number of independent evaluation runs conducted, report standard deviations alongside the 73.33% success rate and scenario-specific improvements, include statistical significance tests (such as paired t-tests) comparing CriticVLA against baselines, and provide complete hyperparameter settings plus implementation details for all baselines in the main text or supplementary material. These additions will allow readers to rigorously evaluate the robustness of the closed-loop gains. revision: yes

  2. Referee: [§3 (Dataset and Critic Training)] §3 (Dataset and Critic Training): The 12.9 million trajectory dataset is central to enabling the critic's refinement capability, yet the manuscript provides insufficient description of how critic annotations (e.g., quality labels or refinement targets) were generated, filtered, or validated for accuracy. Without this, it is difficult to evaluate the reliability of the assumption that the VLA critic can consistently produce better trajectories in closed-loop settings.

    Authors: We acknowledge that the current description of the dataset annotation process is insufficient. In the revision, we will expand Section 3 with a detailed account of how critic annotations were produced, including the specific methods for generating quality labels and refinement targets (via a combination of rule-based simulation metrics and VLA-guided assessment), the filtering criteria used to ensure data quality and diversity, and the validation steps such as expert review on sampled subsets to confirm annotation accuracy. This will better justify the reliability of the critic training and its closed-loop benefits. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper describes a two-stage CriticVLA pipeline that extends existing VLA models by adding an explicit critic stage for trajectory generation followed by multimodal refinement. The central claims rest on a newly constructed 12.9-million-trajectory synthetic dataset and closed-loop evaluation on the external Bench2Drive benchmark. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations are present in the provided text. The framework is presented as building on prior VLA capabilities without reducing any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the critic stage provides meaningful improvements without introducing instability in closed-loop control.

axioms (1)
  • domain assumption Vision language action models possess a critic capability that can be leveraged for trajectory refinement in driving tasks
    Stated as extending from other LLM domains but not demonstrated in the abstract for this context.

pith-pipeline@v0.9.0 · 5498 in / 1307 out tokens · 60872 ms · 2026-05-07T09:48:04.458687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references

  1. [1]

    reduce speed from 7.4 m/s to 3.5 m/s

    using learnable queries. We utilize two types of learnable queries within the LLM’s structure. First are 512 vision queries, which are used for latent multimodal feature extraction and grounding within the LLM’s attention mechanism. Second are 16 trajectory queries, which interact with the contextualized hidden states and are directly mapped to the output...

  2. [2]

    Semantic Analysis Phase:The model first performs an autoregressive decoding pass using the Language Critic Head, generating the complete risk assessment and action recommendations

  3. [3]

    Straight

    Refinement Phase:Using the generated analysis as context, the model employs the Query-based Delta Adapters to directly predict refined trajectories. Training data.The CriticVLA refinement stage is trained exclusively on the CriticDrive dataset, which totals 3,081,848 annotated driving frames. This dataset is logically partitioned into three components: Gr...

  4. [4]

    For dynamic actors, we utilize their future bounding boxes derived from the simulator’s ground truth

    Candidate Selection:We identify candidate objects O (vehicles, pedestrians, static meshes) within the ego-vehicle’s future field of view. For dynamic actors, we utilize their future bounding boxes derived from the simulator’s ground truth

  5. [5]

    Spatiotemporal Matching:For a selected object O at a future timestep tcrash, we calculate the precise position pobj. We then derive a constant collision speedv crash required for the ego-vehicle to reachp obj exactly att crash: vcrash = ∥pego −p obj∥ −δ saf ety ∆tcrash (33) whereδ saf ety accounts for the physical extent of the vehicle and the object

  6. [6]

    maintain speed

    Route Interpolation:We modify the geometric route to intersect with the target. The trajectory is reconstructed by interpolating between the current ego position and the collision point pobj, and then returning to the original route. The ego-vehicle is then simulated along this collision path at speed vcrash, guaranteeing a spatiotemporal overlap with the...