ESPRIT: Explaining Solutions to Physical Reasoning Tasks

Aadit Vyas; Abhijit Gupta; Caiming Xiong; Dragomir Radev; Jeremy Weiss; Nazneen Fatema Rajani; Richard Socher; Rui Zhang; Stephan Zheng; Yi Chern Tan

arxiv: 2005.00730 · v2 · pith:QNBUIQ5Lnew · submitted 2020-05-02 · 💻 cs.CL · cs.LG

ESPRIT: Explaining Solutions to Physical Reasoning Tasks

Nazneen Fatema Rajani , Rui Zhang , Yi Chern Tan , Stephan Zheng , Jeremy Weiss , Aadit Vyas , Abhijit Gupta , Caiming XIong

show 2 more authors

Richard Socher Dragomir Radev

This is my paper

classification 💻 cs.CL cs.LG

keywords physicalespritdescriptionseventshumanapproachframeworkinterpretable

0 comments

read the original abstract

Neural networks lack the ability to reason about qualitative physics and so cannot generalize to scenarios and tasks unseen during training. We propose ESPRIT, a framework for commonsense reasoning about qualitative physics in natural language that generates interpretable descriptions of physical events. We use a two-step approach of first identifying the pivotal physical events in an environment and then generating natural language descriptions of those events using a data-to-text approach. Our framework learns to generate explanations of how the physical simulation will causally evolve so that an agent or a human can easily reason about a solution using those interpretable descriptions. Human evaluations indicate that ESPRIT produces crucial fine-grained details and has high coverage of physical concepts compared to even human annotations. Dataset, code and documentation are available at https://github.com/salesforce/esprit.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do generative video models understand physical principles?
cs.CV 2025-01 unverdicted novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
cs.CV 2026-05 unverdicted novelty 6.0

A vision-language framework generates text-based rigid-body scene configurations from videos using motion reasoning and optical flow, reporting 0.30 IoU on CLEVRER (7x over baselines) and transfer to 235 real videos.
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
cs.CV 2026-02 unverdicted novelty 6.0

VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.