hub Canonical reference

Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al · 2025 · arXiv 2509.09372

Canonical reference. 83% of citing Pith papers cite this work as background.

18 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 1

citation-polarity summary

background 5 baseline 1

representative citing papers

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-04-14 · unverdicted · novelty 7.0

HazardArena shows VLA models trained on safe data frequently produce unsafe actions in semantically risky but visually similar settings, and a training-free Safety Option Layer reduces those failures with little performance cost.

DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

cs.RO · 2026-03-27 · conditional · novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.

Towards Generalizable Robotic Manipulation in Dynamic Environments

cs.CV · 2026-03-16 · unverdicted · novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.

Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

cs.RO · 2026-04-14 · unverdicted · novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

cs.CV · 2026-04-06 · conditional · novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

cs.RO · 2026-02-22 · unverdicted · novelty 6.0

OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

cs.CV · 2025-11-22 · conditional · novelty 6.0

ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x speedup with comparable or better performance on embodied tasks.

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

cs.RO · 2025-11-18 · unverdicted · novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

cs.RO · 2026-05-17 · unverdicted · novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

cs.CV · 2026-04-21 · unverdicted · novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

cs.RO · 2026-04-16 · unverdicted · novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

cs.LG · 2025-11-24 · unverdicted · novelty 5.0

AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

cs.RO · 2026-04-22 · unverdicted · novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

cs.RO · 2026-03-04

citing papers explorer

Showing 18 of 18 citing papers.

RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 18
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models cs.RO · 2026-04-14 · unverdicted · none · ref 17
HazardArena shows VLA models trained on safe data frequently produce unsafe actions in semantically risky but visually similar settings, and a training-free Safety Option Layer reduces those failures with little performance cost.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching cs.RO · 2026-03-27 · conditional · none · ref 23
DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
Towards Generalizable Robotic Manipulation in Dynamic Environments cs.CV · 2026-03-16 · unverdicted · none · ref 52
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 55
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 17
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models cs.RO · 2026-05-06 · unverdicted · none · ref 18
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models cs.RO · 2026-04-14 · unverdicted · none · ref 61
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes cs.CV · 2026-04-06 · conditional · none · ref 62
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation cs.RO · 2026-02-22 · unverdicted · none · ref 49
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models cs.CV · 2025-11-22 · conditional · none · ref 24
ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x speedup with comparable or better performance on embodied tasks.
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models cs.RO · 2025-11-18 · unverdicted · none · ref 67
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 170
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model cs.CV · 2026-04-21 · unverdicted · none · ref 70
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems cs.RO · 2026-04-16 · unverdicted · none · ref 27
The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention cs.LG · 2025-11-24 · unverdicted · none · ref 49
AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy cs.RO · 2026-04-22 · unverdicted · none · ref 36
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies cs.RO · 2026-03-04 · unreviewed · ref 43

Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer