Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
hub Canonical reference
Toward general-purpose robots via foundation models: A survey and meta-analysis
Canonical reference. 83% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 6representative citing papers
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.
DexWild co-trains dexterous robot policies on in-the-wild human hand interactions recorded with a low-cost system and limited robot data, achieving 68.5% success in unseen environments and 5.8x better cross-embodiment generalization.
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.