PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
hub Mixed citations
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Mixed citation behavior. Most common role is background (58%).
abstract
Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model{} is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
HazardArena shows VLA models trained on safe data frequently produce unsafe actions in semantically risky but visually similar settings, and a training-free Safety Option Layer reduces those failures with little performance cost.
TAP uses two-stage pretraining on unlabeled data to learn physical competence before language grounding, matching 1M-expert models with far less labeled data and showing robustness on real robots.
VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.
Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
MPCoT improves long-horizon VLA performance on LIBERO and CALVIN by initializing M latent hypotheses, refining them over K steps, and aggregating via a reward-trained path scorer while preserving the original 8-step action interface and generating zero reasoning tokens.
3DVLA is a plug-and-play framework that enhances pretrained VLAs with pervasive 3D feature encoding using multi-view consistency and Spatially-Conditioned Geometry Aggregation, an instance estimation module, and a masked self-supervised 3D branch, yielding gains on LIBERO-Plus and RoboTwin 2.0.
Afford-VLA internalizes task-conditioned affordance as an explicit visual planning interface within VLA models via learnable <AFF> tokens, achieving SOTA on LIBERO and SimplerEnv benchmarks.
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's
DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
citing papers explorer
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.