MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
When to trust your model: model-based policy optimization
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
The paper proposes an L0-L7 evidential ladder for evaluating world models in embodied decision-making, prioritizing interventional action fidelity and policy optimization utility over visual plausibility.
Physical admissibility is defined as a prediction-control interface using kinematic, dynamic, and composed-horizon conditions to reject invalid dynamics proposals, with AUC 0.957 on LeRobot PushT and 87-89% prevention of invalid actions in interventions.
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.
citing papers explorer
-
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
-
How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position
The paper proposes an L0-L7 evidential ladder for evaluating world models in embodied decision-making, prioritizing interventional action fidelity and policy optimization utility over visual plausibility.
-
Can Predicted Dynamics Exist in the Physical World?
Physical admissibility is defined as a prediction-control interface using kinematic, dynamic, and composed-horizon conditions to reject invalid dynamics proposals, with AUC 0.957 on LeRobot PushT and 87-89% prevention of invalid actions in interventions.
-
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.