DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
VLABench: A large-scale benchmark for language-conditioned robotics manipu- lation with long-horizon reasoning tasks
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
citing papers explorer
-
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
-
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.