CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Jie He; Liqiang Nie; Renshan Zhang; Rui Shao; Wei Li

arxiv: 2508.21046 · v3 · pith:75WZHTFZnew · submitted 2025-08-28 · 💻 cs.CV · cs.RO

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Wei Li , Renshan Zhang , Rui Shao , Jie He , Liqiang Nie This is my paper

classification 💻 cs.CV cs.RO

keywords cogvlaroutingactionvision-language-actionattentioncognition-alignedextensivefold

0 comments

read the original abstract

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Point Tracking Improves World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency ...
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
cs.RO 2026-02 unverdicted novelty 6.0

OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulat...
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
cs.LG 2025-11 unverdicted novelty 5.0

AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
cs.RO 2025-08 unverdicted novelty 5.0

This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.