Embodied AI Agents: Modeling the World

Alessandro Lazaric; Andrea Madotto; Arjun Majumdar; Asli Celikyilmaz; Basile Terver; Delong Chen; Emmanuel Dupoux; Florian Metze; Franziska Meier; Herv\'e J\'egou

arxiv: 2506.22355 · v3 · pith:P3QXPPJPnew · submitted 2025-06-27 · 💻 cs.AI

Embodied AI Agents: Modeling the World

Pascale Fung , Yoram Bachrach , Asli Celikyilmaz , Kamalika Chaudhuri , Delong Chen , Willy Chung , Emmanuel Dupoux , Hongyu Gong

show 13 more authors

Herv\'e J\'egou Alessandro Lazaric Arjun Majumdar Andrea Madotto Franziska Meier Florian Metze Louis-Philippe Morency Th\'eo Moutakanni Juan Pino Basile Terver Joseph Tighe Paden Tomasello Jitendra Malik

This is my paper

classification 💻 cs.AI

keywords agentsworldembodiedlearnphysicalenvironmentsinteractmodeling

0 comments

read the original abstract

This paper describes our research on AI agents embodied in visual, virtual or physical forms, enabling them to interact with both users and their environments. These agents, which include virtual avatars, wearable devices, and robots, are designed to perceive, learn and act within their surroundings, which makes them more similar to how humans learn and interact with the environments as compared to disembodied agents. We propose that the development of world models is central to reasoning and planning of embodied AI agents, allowing these agents to understand and predict their environment, to understand user intentions and social contexts, thereby enhancing their ability to perform complex tasks autonomously. World modeling encompasses the integration of multimodal perception, planning through reasoning for action and control, and memory to create a comprehensive understanding of the physical world. Beyond the physical world, we also propose to learn the mental world model of users to enable better human-agent collaboration.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coding Agent Is Good As World Simulator
cs.AI 2026-05 unverdicted novelty 7.0

A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.
Graph World Models: Concepts, Taxonomy, and Future Directions
cs.AI 2026-04 unverdicted novelty 7.0

The paper unifies emerging graph-based world models under a new paradigm and proposes a taxonomy organized by spatial, physical, and logical relational inductive biases.
Analytic Concept-Centric Memory for Agentic Embodied Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

Proposes a structured concept-centric memory system for embodied agents that connects object, scene, transition, and skill memories to support coarse-to-fine retrieval and improve task performance over baselines.
SCOPE: Evolving Symbolic World for Planning in Open-Ended Environments
cs.AI 2026-06 unverdicted novelty 6.0

SCOPE is a self-adaptive symbolic planning framework that refines plans and evolves symbolic world models via simulator feedback and distilled knowledge to improve long-horizon planning in open-ended embodied environments.
COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
cs.AI 2026-06 unverdicted novelty 6.0

COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.
Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players
cs.CV 2026-05 unverdicted novelty 6.0

A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control
cs.GR 2026-05 unverdicted novelty 6.0

A new diffusion transformer policy with joint attention over actions, states, and text plus RL post-training outperforms prior methods on language alignment and motion quality for humanoid control.
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control
cs.GR 2026-05 unverdicted novelty 6.0

SCRIPT presents a scalable diffusion policy with JAST-DiT architecture, nonlinear history conditioning, and RLHR post-training that claims to outperform prior methods on text alignment, motion quality, and physical re...
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
cs.CL 2026-05 unverdicted novelty 6.0

LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE...
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
cs.RO 2026-05 unverdicted novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
cs.RO 2026-05 unverdicted novelty 6.0

Sentinel-VLA adds metacognitive status monitoring to VLA models for on-demand reasoning and error recovery, reporting over 30% higher real-world task success than prior SOTA.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
cs.RO 2026-05 unverdicted novelty 6.0

Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
Source-Modality Monitoring in Vision-Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
AgentComm: Semantic Communication for Embodied Agents
eess.SP 2026-04 unverdicted novelty 6.0

AgentComm achieves nearly 50% bandwidth reduction in embodied agent communication via LLM semantic processing, importance-aware transmission, and a task knowledge base, with negligible impact on task completion.
Toward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning
cs.RO 2026-04 unverdicted novelty 6.0

Morphology-conditioned quadrupedal world model enables zero-shot generalization to new robot embodiments for locomotion tasks.
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
cs.CV 2026-03 unverdicted novelty 6.0

A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
cs.CV 2026-02 unverdicted novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
cs.CV 2026-02 unverdicted novelty 6.0

VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality
cs.HC 2026-01 unverdicted novelty 6.0

SpeechLess enables micro-utterance AR interactions by binding prior interactions to personal spatial context for intent extrapolation.
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
cs.AI 2026-06 unverdicted novelty 5.0

A three-stage training pipeline internalizes world-model simulation and success estimation in LLM agents for improved planning on search and math tasks.
MagicSim: A Unified Infrastructure for Executable Embodied Interaction
cs.RO 2026-06 unverdicted novelty 5.0

MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and...
IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance
cs.AI 2026-04 unverdicted novelty 5.0

IndustryAssetEQA integrates episodic telemetry representations with an FMEA knowledge graph to support embodied question answering over industrial assets, showing large gains in validity and reduced overclaims versus ...
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
cs.AI 2025-12 unverdicted novelty 5.0

An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
Critique of Agent Model
cs.AI 2026-06 unverdicted novelty 4.0

Distinguishes agentic (externally scaffolded) from agentive (internally structured) AI systems and proposes the Goal-Identity-Configurator architecture for endogenous autonomy.
Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration
cs.RO 2026-06 unverdicted novelty 4.0

The paper presents a proof-of-concept closed-loop system using consumer EEG to detect high cognitive engagement and defer multi-agent robotic communications until lower workload.
6G Communication Networks Enabling Embodied Agents: Architecture and Prototype
cs.RO 2026-05 unverdicted novelty 4.0

Proposes a four-layer hierarchical communication architecture for 6G-enabled human-robot interaction and shows feasibility via a 5G-based prototype with millisecond latency and stable operation.
Coding Agent Is Good As World Simulator
cs.AI 2026-05 unverdicted novelty 4.0

An agentic framework generates executable physics simulation code from text prompts via coordinated planning, coding, visual, and physics agents that iterate to satisfy both prompt fidelity and physical constraints.
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
cs.CY 2026-04 unverdicted novelty 4.0

Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
A Tutorial on World Models and Physical AI
cs.AI 2026-06 unverdicted novelty 2.0

A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.
Resource Consumption Threats in Large Language Models
cs.CR 2026-03 unverdicted novelty 2.0

A systematic review of resource consumption threats in LLMs that organizes the problem along the full pipeline from threat induction to mitigation.