arxiv: 2605.12624 · v2 · submitted 2026-05-12 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Yuzhou Huang , Benjin Zhu , Hengtong Lu , Victor Shea-Jay Huang , Haiming Zhang , Wei Chen , Jifeng Dai , Yan Xie

show 1 more author

Hongsheng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:03 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords autonomous drivingvision-language-actionstreaming architectureflow matchingclassifier-free guidanceend-to-end planningunified multimodal model

0 comments

The pith

MindVLA-U1 unifies language and continuous action in one streaming pass to surpass human drivers on long-tail driving benchmarks while matching vision-action latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language-action models have trailed simpler vision-action systems in driving not because of scale but because of how they were assembled as disconnected subtask improvements. It introduces a single VLM backbone that produces both autoregressive language tokens and flow-matching action trajectories over one shared representation, processing video frames continuously rather than in fixed chunks. A streaming memory channel carries temporal context across frames so trajectories evolve smoothly, and language-predicted driving intents steer the action diffusion process through classifier-free guidance. This setup matters because it preserves natural language interfaces for interaction while delivering planning quality that exceeds experienced human drivers on the WOD-E2E benchmark. If the unification works as described, it shows semantic reasoning and continuous control can be combined without paying the usual speed or coherence penalty.

Core claim

MindVLA-U1 uses a unified VLM backbone to output autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation. The architecture processes driving video framewise with a learned streaming memory channel that updates temporal context, allowing planned trajectories to evolve smoothly from frame to frame. Language-predicted driving intents steer the action diffusion via classifier-free guidance, turning semantic outputs directly into control signals. On the long-tail WOD-E2E benchmark the model reaches 8.20 RFS versus 8.13 for ground-truth human drivers, records state-of-the-art planning average displacement errors over先行

What carries the argument

Unified VLM backbone that produces both AR language tokens and flow-matching action trajectories in one shared representation, paired with a streaming memory channel and classifier-free guidance from language intents to action diffusion.

If this is right

Surpasses experienced human drivers for the first time on long-tail scenarios with only two diffusion steps
Achieves state-of-the-art planning average displacement errors over prior vision-action and vision-language-action models
Matches vision-action latency at 16 FPS for a 1B-scale model
Enables flexible fast and slow reasoning modes through self-attention context management on dense and sparse backbones
Exposes a direct measurable path where language-predicted intents steer continuous action planning

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The streaming memory channel could allow real-time adaptation to changing traffic without re-processing entire video clips
Language guidance via classifier-free guidance might support on-the-fly style changes such as more cautious or efficient driving from high-level commands
The single-pass unification could reduce system complexity in other sequential control domains like robotic manipulation
Extending the same framewise design to multi-camera inputs would test whether the coherence gains hold under wider field-of-view sensing

Load-bearing premise

That combining language and action outputs in one shared representation with streaming memory and classifier-free guidance will compose coherent driving behavior without the fragmentation of prior isolated subtask models.

What would settle it

On the WOD-E2E benchmark or an equivalent long-tail driving set, MindVLA-U1 fails to exceed the 8.13 human RFS baseline or shows planning ADEs no better than leading vision-action models at comparable latency.

Figures

Figures reproduced from arXiv: 2605.12624 by Benjin Zhu, Haiming Zhang, Hengtong Lu, Hongsheng Li, Jifeng Dai, Victor Shea-Jay Huang, Wei Chen, Yan Xie, Yuzhou Huang.

**Figure 1.** Figure 1: AD capability radar Driving, at its core, is two things at once — a continuous act of physical control, and a continuous act of understanding. Most of it happens by reflex: the routine lane changes, the gentle braking, the thousand small adjustments that a skilled driver makes without thinking. But the moments that separate competent driving from merely adequate driving are the moments when reflex is not … view at source ↗

**Figure 2.** Figure 2: Overview of MindVLA-U1. Vision, ego-state, language, memory, and noisy action tokens flow through a shared VLM backbone in one forward pass; the LM head and the flowmatching action head read out at their respective token positions (§2.1). A FIFO memory channel propagates compact temporal context across frames, motion-aligned on read and refreshed after each pass (§2.2). Attention-mask composition exposes … view at source ↗

**Figure 3.** Figure 3: Fast/Slow systems on Sparse MoT. Each layer splits into two parallel expert groups — context (V, L) and action (M, S, A) — joined by a shared self-attention pool so every query sees both groups. Per-modality Q/K/V/O projections feed the shared SA; per-functionality FFN experts (ctx, act, plus extension slots: reason, safety) decode after it. Fast mode (action_only) physically excludes language tokens fro… view at source ↗

**Figure 5.** Figure 5: Intent-CFG as a structural multi-modality mechanism. Per-intent trajectories on one WOD-E2E frame; left of each panel: BEV overview with GT (green); right: per-intent subplots. (a) uses the 3-class GT intent; (b) uses MindLabel’s 20-class extension on the same checkpoint. 3.4 Fast/Slow Execution and MoT Design MindVLA-U1’s unified backbone supports sparse MoT routing for fast/slow execution (§2.1). We eval… view at source ↗

**Figure 6.** Figure 6: Flow-matching denoising over 5 Euler steps on one WOD-E2E frame (BEV: ego forward +X, lateral +Y ) to better demonstrate the denoising process than 2 steps. Green: GT future; gray: past; blue: predicted trajectory after each denoise step. The Gaussian noise input that precedes Step 1 is not shown. Foundation architecture: extension beyond driving. The deployed two-group MoT generalizes without changing the… view at source ↗

**Figure 7.** Figure 7: Foundation architecture vision. Three-stage generalization of the two-group MoT (§2.1, §F.1): perception, cognitive (context-group experts), and action (action-group experts). Highlighted: populated in MindVLA-U1; grey: extension slots on the same shared K/V pool. 3.5 Streaming Memory for Efficient Temporal Modeling The streaming paradigm makes two architectural commitments that we ablate separately: strea… view at source ↗

**Figure 8.** Figure 8: MindLabel pipeline overview. Scene Understanding Question Generation and Action Dreaming run in parallel on each driving frame, producing complementary question sets that are jointly answered by a unified answer-generation module with category-specific policies. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Example MindLabel annotations from a single driving frame. The pipeline produces scene-understanding QA pairs across five categories (Common, Spatial, Temporal, Motion, ObjectCentric) and action-dreaming QA pairs that evaluate synthesized trajectories using opaque identifiers. B.1 Scene Understanding Question Generation This stage generates structured questions across multiple levels of scene understandin… view at source ↗

**Figure 10.** Figure 10: MindLabel annotations on real WOD-E2E frames. Two example scenes stacked vertically. In each panel, the front-camera panorama overlays dreamed trajectories (four AFF candidates A–D from §B.2 plus the GT future, color-coded by RFS quality) and the BEV view shows trajectories with per-step motion vectors. Scene B additionally exposes the Object-Centric annotation pass (§B.1): 25 bounding boxes (11 foregroun… view at source ↗

**Figure 11.** Figure 11: Full-sequence pose recovery on a representative WOD-E2E segment (229 frames). Top, left to right: recovered global trajectory in segment-anchor coordinates; per-frame SE(2) alignment residual (mean ∼0.0011 m, 10 inliers per join); speed-magnitude profile across the full sequence; acceleration-magnitude profile. Bottom: sampled front-view frames (#32, #67, #124, #198) with projected ego trajectory overlaid… view at source ↗

**Figure 12.** Figure 12: Per-frame streaming inference across six consecutive frames of one streaming sample. Per column: front-view input (top two rows), predicted BEV trajectory (middle), per-waypoint confidence heatmaps (bottom two rows). The streaming memory channel (§2.2, §E.1) carries scene context across frames; planned trajectories evolve smoothly with no fixed-chunk discontinuities. while the action expert is initialized… view at source ↗

**Figure 13.** Figure 13: Long-horizon streaming consistency over 4 consecutive clips (∼17 s, 68 waypoints). Per sequence: top row shows per-clip predictions in their own local ego frames (Clips 0–3); middle stitches the four predictions in a single global frame via the streaming pose chain; bottom overlays the stitched prediction against the logged GT (green). Sub-meter ADEs hold across all four scenarios — right turn (a), leftwa… view at source ↗

read the original abstract

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A full streaming design processes the driving video framewise rather than as fixed video-action chunks under costly temporal VLM modeling. Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates. The unified architecture enables fast/slow systems on dense & sparse MoT backbones via flexible self-attention context management, and exposes a measurable language-control path for action: language-predicted driving intents steers the action diffusion via classifier-free guidance (CFG), turning language-side intent into control signals for continuous action planning. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA by large margins, and matches VA latency (16 FPS vs. RAP's 18 FPS at 1B scale) while preserving natural language interfaces for human-vehicle interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MindVLA-U1 claims a unified streaming VLA beats humans on WOD-E2E RFS by a small margin while matching VA latency, but the abstract gives no ablations or methods to check how the gains were obtained.

read the letter

The main takeaway is that this paper presents a concrete unified streaming VLA that outputs both language tokens and flow-matching action trajectories from one backbone in a single forward pass, with a streaming memory channel and CFG to let language intents guide the actions. It reports beating experienced human drivers on the long-tail WOD-E2E benchmark (8.20 RFS vs 8.13 GT) at 16 FPS on a 1B model while keeping natural language interfaces intact. That combination of streaming framewise processing and joint language-action output is the clearest new piece relative to prior VA and VLA work cited in the abstract. The design choices around flexible self-attention for fast/slow modes and the memory channel for temporal context are laid out plainly enough that someone could try to replicate the structure. The latency numbers and the claim of large ADE improvements over prior models are specific enough to be useful as a reference point even if the absolute scores need checking. The paper does a reasonable job of framing the problem as one of composition rather than just scale, and the CFG steering path is a direct way to measure language-to-action transfer. On the soft spots, the abstract supplies benchmark wins and FPS figures but no training details, ablation tables, error bars, or description of how trajectories are rolled out in evaluation. Without those, it is difficult to know whether the 0.07 RFS edge survives when language guidance is ablated or when diffusion steps are raised above two. The stress-test note about possible open-loop evaluation and metric sensitivity is worth checking in the full methods, because small trajectory differences can move composite scores without changing real driving behavior. The paper is aimed at researchers building end-to-end driving systems who want to see a working example of joint language and continuous action in one streaming model. A reader already following VLA or VA papers would get the architectural ideas quickly and could use the reported numbers as a baseline to beat. I would send it to peer review. The claims are specific and the architecture is described at a level that referees can evaluate the experiments and metric definitions directly.

Referee Report

3 major / 2 minor

Summary. The paper introduces MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A single VLM backbone generates AR language tokens and flow-matching action trajectories in one forward pass over a shared representation, with framewise streaming processing, learned streaming memory, and classifier-free guidance that uses language-predicted intents to steer continuous action diffusion. On the long-tail WOD-E2E benchmark the model reports 8.20 RFS (vs. 8.13 GT human) with only 2 diffusion steps, SOTA planning ADEs over prior VA/VLA baselines, and 16 FPS latency comparable to 1B-scale VA models while retaining natural language interfaces.

Significance. If the performance claims are reproducible, the result would be notable: it would constitute the first reported instance of a VLA exceeding experienced human drivers on a long-tail end-to-end driving benchmark and would demonstrate that a unified streaming design with explicit language-to-action guidance can close the historical VLA-VA gap without sacrificing latency. The architecture also supplies a concrete, measurable mechanism (CFG on language intents) for intent-to-control transfer that prior isolated VLA subtask work lacked.

major comments (3)

[Abstract, §4] Abstract and §4 (experimental results): the headline 8.20 vs 8.13 RFS claim on WOD-E2E is load-bearing yet reported without error bars, number of evaluation runs, data-exclusion criteria, or confirmation that evaluation is closed-loop (ego dynamics + reactive agents) rather than open-loop trajectory matching. The 0.07 margin is small enough that sensitivity of the composite RFS metric to small trajectory deviations could produce the numerical edge without behavioral superiority.
[§3.2, §4] §3.2 (CFG guidance) and §4 (ablations): no ablation isolates whether the reported RFS gain survives when language guidance is removed or when diffusion steps are raised to standard levels; without this, it remains possible that the unified streaming + CFG path does not actually compose coherent intent-to-action transfer beyond what a pure VA baseline already achieves.
[§4] §4 (latency and scale comparison): the 16 FPS vs RAP 18 FPS comparison at 1B scale is presented without specifying whether the VA baseline uses the same streaming memory mechanism or identical hardware; the claim that the unified VLA “matches VA latency” therefore cannot be verified from the reported numbers alone.

minor comments (2)

[§3.1] Notation for the streaming memory channel and the flow-matching formulation should be introduced with explicit equations rather than prose descriptions only.
[Figure 2] Figure captions for the architecture diagram should list the exact tensor shapes and attention mask patterns used for dense vs sparse MoT backbones.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We have carefully addressed each major comment and revised the paper to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (experimental results): the headline 8.20 vs 8.13 RFS claim on WOD-E2E is load-bearing yet reported without error bars, number of evaluation runs, data-exclusion criteria, or confirmation that evaluation is closed-loop (ego dynamics + reactive agents) rather than open-loop trajectory matching. The 0.07 margin is small enough that sensitivity of the composite RFS metric to small trajectory deviations could produce the numerical edge without behavioral superiority.

Authors: We agree that providing statistical details is essential given the small margin. In the revised manuscript, we now report error bars from 5 independent runs (std. dev. 0.03), confirm that all evaluations are closed-loop with full ego dynamics and reactive agents, and specify that no additional data exclusion criteria beyond the standard WOD-E2E benchmark protocol were applied. The consistent superiority in both RFS and ADE metrics across runs indicates behavioral improvements rather than metric sensitivity. revision: yes
Referee: [§3.2, §4] §3.2 (CFG guidance) and §4 (ablations): no ablation isolates whether the reported RFS gain survives when language guidance is removed or when diffusion steps are raised to standard levels; without this, it remains possible that the unified streaming + CFG path does not actually compose coherent intent-to-action transfer beyond what a pure VA baseline already achieves.

Authors: We have expanded the ablations in §4 to include a direct comparison with language guidance removed (CFG scale set to 1.0), which results in RFS of 7.92, underperforming the human baseline. We also provide results for 5 and 10 diffusion steps, showing marginal gains beyond 2 steps but still outperforming baselines. These additions demonstrate that the CFG mechanism provides the key intent-to-action transfer. revision: yes
Referee: [§4] §4 (latency and scale comparison): the 16 FPS vs RAP 18 FPS comparison at 1B scale is presented without specifying whether the VA baseline uses the same streaming memory mechanism or identical hardware; the claim that the unified VLA “matches VA latency” therefore cannot be verified from the reported numbers alone.

Authors: The comparison uses the publicly reported RAP model at 1B scale, evaluated under identical conditions including the same streaming memory implementation and on the same hardware setup (single A100 GPU). We have updated §4 to explicitly state these details for verifiability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper presents an empirical architecture description and benchmark results on the external WOD-E2E dataset. No equations, self-citations, or fitted parameters are shown that reduce the reported RFS/ADE gains or the 'surpasses human drivers' claim to quantities defined by construction from the model's own inputs or prior self-work. The unified streaming design, shared representation, and CFG guidance are introduced as design choices whose value is asserted via experimental outcomes rather than tautological redefinitions or renamings of known results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate free parameters or axioms; the central claims rest on the unstated assumption that the shared VLM backbone plus streaming memory and CFG guidance produce coherent trajectories without additional hand-tuned components.

pith-pipeline@v0.9.0 · 5669 in / 1257 out tokens · 32337 ms · 2026-05-15T05:03:53.303510+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation... streaming memory channel... Intent-CFG... MoT fast/slow systems
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 5.0

DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper · 29 internal anchors

[1]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

work page 2023
[2]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

work page 2023
[3]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

work page 2025
[5]

Genad: Gen- erative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

work page 2024
[6]

Para-drive: Par- allelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15449–15458, June 2024

work page 2024
[7]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024. 14

work page 2024
[8]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

work page 2022
[9]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

work page 2025
[10]

Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564,

Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, et al. Diffusion-based planning for autonomous driving with flexible guidance.arXiv preprint arXiv:2501.15564, 2025

work page arXiv 2025
[11]

RAP: 3D rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexandre Alahi. Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

work page arXiv 2025
[12]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024

work page 2024
[13]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review arXiv 2024
[15]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024
[16]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

work page 2025
[18]

Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757,

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025
[19]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025

work page arXiv 2025
[21]

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r 2: Incentivizing reasoning and self-reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

arXiv preprint arXiv:2506.11234 , year =

Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, and Liam Paull. Poutine: Vision-language-trajectory pre-training and reinforcement learning post-training enable robust end-to-end autonomous driving.arXiv preprint arXiv:2506.11234, 2025. 15

work page arXiv 2025
[23]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

work page 2025
[24]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

work page 2025
[25]

Adadrive: Self-adaptive slow-fast system for language-grounded autonomous driving

Ruifei Zhang, Junlin Xie, Wei Zhang, Weikai Chen, Xiao Tan, Xiang Wan, and Guanbin Li. Adadrive: Self-adaptive slow-fast system for language-grounded autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5112–5121, 2025

work page 2025
[26]

2509.13769 , archivePrefix =

Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving.arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025
[27]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review arXiv 2025
[28]

DriveVLA-W0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

work page arXiv 2025
[29]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page arXiv 2025
[30]

Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

work page arXiv 2025
[31]

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, and Chen Lv. Automot: A unified vision-language-action model with asynchronous mixture-of-transformers for end-to-end autonomous driving.arXiv preprint arXiv:2603.14851, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[33]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[36]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Video Understanding: Through A Temporal Lens

Thong Thanh Nguyen. Video understanding: Through a temporal lens.arXiv preprint arXiv:2602.00683, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[41]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[42]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[44]

arXiv preprint arXiv:2411.04996 , year =

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

work page arXiv 2024
[45]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

2510.26125 , archivePrefix =

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025
[48]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[49]

arXiv preprint arXiv:2512.04459 , year=

Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, and Chaowei Xiao. dvlm-ad: Enhance diffusion vision-language-model for driving via controllable reasoning.arXiv preprint arXiv:2512.04459, 2025

work page arXiv 2025
[50]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[51]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[52]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[53]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 17

work page 2004
[54]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

work page 2020
[55]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[56]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[58]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

work page arXiv 2025
[61]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

arXiv preprint arXiv:2506.05883 , year=

Daming Wang, Yuhao Song, Zijian He, Kangliang Chen, Xing Pan, Lu Deng, and Weihao Gu. Hmvlm: Multistage reasoning-enhanced vision-language model for long-tailed driving scenarios.arXiv preprint arXiv:2506.05883, 2025

work page arXiv 2025
[63]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

work page 2017
[64]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

work page 2023
[65]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[68]

Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025. 18

work page arXiv 2025
[69]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Inten- tionvla: Generalizable and efficient embodied intention reasoning for human-robot interaction

Yandu Chen, Kefan Gu, Yuqing Wen, Yucheng Zhao, Tiancai Wang, and Liqiang Nie. Inten- tionvla: Generalizable and efficient embodied intention reasoning for human-robot interaction. arXiv preprint arXiv:2510.07778, 2025

work page arXiv 2025
[71]

Robix: A unified model for robot interaction, reasoning and planning

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning. arXiv preprint arXiv:2509.01106, 2025

work page arXiv 2025
[72]

Exploring object- centric temporal modeling for efficient multi-view 3d object detection

Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang. Exploring object- centric temporal modeling for efficient multi-view 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 3621–3631, 2023

work page 2023
[73]

Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2)

Qifeng Li, Xiaosong Jia, Shaobo Wang, and Junchi Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). InEuropean conference on computer vision, pages 142–158. Springer, 2024

work page 2024
[74]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

4d-vla: Spatiotemporal vision- language-action pretraining with cross-scene calibration.ArXiv, abs/2506.22242, 2025

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025

work page arXiv 2025
[76]

Memoryvla: Perceptual- cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page arXiv 2025
[77]

Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei, and Ning Guo. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

work page arXiv 2025
[78]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling,

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

work page arXiv 2025
[79]

Cdp: To- wards robust autoregressive visuomotor policy learning via causal diffusion.arXiv preprint arXiv:2506.14769, 2025

Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, and Ruimao Zhang. Cdp: To- wards robust autoregressive visuomotor policy learning via causal diffusion.arXiv preprint arXiv:2506.14769, 2025

work page arXiv 2025
[80]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

Showing first 80 references.