Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
Embodied ai: From llms to world models
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 4polarities
background 4representative citing papers
E3VS-Bench supplies 99 3D Gaussian Splatting scenes and 2,014 episodes to test whether embodied agents can use unrestricted 5-DoF viewpoint control to answer questions that depend on fine-grained visual details visible only from specific angles.
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
Chatbot AI systems often fail complex needs while projecting authority, contributing to deskilling, labor displacement, economic concentration, and high environmental costs, so alternative pluralistic and task-specific designs are needed.
citing papers explorer
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes
E3VS-Bench supplies 99 3D Gaussian Splatting scenes and 2,014 episodes to test whether embodied agents can use unrestricted 5-DoF viewpoint control to answer questions that depend on fine-grained visual details visible only from specific angles.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
-
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
-
What if AI systems weren't chatbots?
Chatbot AI systems often fail complex needs while projecting authority, contributing to deskilling, labor displacement, economic concentration, and high environmental costs, so alternative pluralistic and task-specific designs are needed.