Intrinsic Motivation Driven Intuitive Physics Learning using Deep Reinforcement Learning with Intrinsic Reward Normalization

Jaewon Choi; Sung-Eui Yoon

arxiv: 1907.03116 · v1 · pith:BUXOISRLnew · submitted 2019-07-06 · 💻 cs.LG · cs.RO· stat.ML

Intrinsic Motivation Driven Intuitive Physics Learning using Deep Reinforcement Learning with Intrinsic Reward Normalization

Jaewon Choi , Sung-Eui Yoon This is my paper

Pith reviewed 2026-05-25 01:34 UTC · model grok-4.3

classification 💻 cs.LG cs.ROstat.ML

keywords intuitive physicsdeep reinforcement learningintrinsic motivationreward normalizationgraphical physics network3D physics engineobject interactionunsupervised learning

0 comments

The pith

A deep RL agent builds an intuitive physics model of objects by using intrinsic reward normalization to select the most informative interactions without external supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper combines a graphical physics network with deep reinforcement learning to let an agent learn object positions and velocities through self-directed play. An intrinsic reward normalization step guides the agent toward actions expected to improve the model most. In a 3D physics engine the approach produces more varied actions and higher prediction accuracy than baselines. The method is shown to work for both stationary and changing object states while relying solely on intrinsic signals.

Core claim

The central claim is that a graphical physics network integrated with deep reinforcement learning, together with intrinsic reward normalization, enables an agent to improve its intuitive physics model by continuously interacting with objects using only intrinsic motivation, without external supervision or hand-designed rewards.

What carries the argument

Intrinsic reward normalization, which re-scales rewards so the agent prioritizes actions that yield the largest expected gain in the accuracy of its physics model.

If this is right

The agent performs a greater variety of actions during learning.
Prediction accuracy for object positions and velocities rises in both stationary and non-stationary settings.
Learning proceeds without any external rewards or labeled data.
The same intrinsic mechanism supports model improvement across different object configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same normalization idea could be tested in real-robot settings where labeled physics data are scarce.
Removing hand-designed rewards may allow the approach to scale to environments with richer object dynamics.
Intrinsic motivation of this form might substitute for supervised pre-training when agents must acquire basic physical priors.

Load-bearing premise

The normalization step correctly ranks actions by how much they will actually improve the agent's model of object motion.

What would settle it

In the 3D physics engine experiments, the agent using intrinsic reward normalization fails to produce both higher model accuracy and a larger number of distinct actions than a random-interaction baseline.

Figures

Figures reproduced from arXiv: 1907.03116 by Jaewon Choi, Sung-Eui Yoon.

**Figure 1.** Figure 1: This diagram illustrates how an agent chooses an action based on loss incurred from its predictor which tries to mimic the behavior of an environment (or a subset of an environment). (1) After making an observation at time t, the agent’s actor module chooses an action. (2) The predictor module takes the action and the observation at time t and makes a prediction; (3) the predictor module then compares its … view at source ↗

**Figure 2.** Figure 2: This shows a detailed diagram of our overall approach shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Experiments in three different scenes: 3-object, 6-object and 8-object scenes. Colors represent different weights where red is 1kg, green is 0.75kg, blue is 0.5kg and white is 0.25kg. The radius of each object can be either 5cm or 7.5cm. We experiment in two scenarios: stationary and non-stationary states. object i: dposobji = X i6=j,j∈[1,N] dposobji,ri,j , dvelobji = X i6=j,j∈[1,N] dvelobji,ri,j , We trai… view at source ↗

**Figure 4.** Figure 4: Action coverage results after 3 or more runs with different random seeds. (From left to right) Action coverage of our agent in 3-object, 6-object, and 8-object scene. 5.2. Non-stationary State (Reinforcement Learning) We extend our prediction model with deep Q network to non-stationary state problems where we do not reset the objects unless they go out of bounds. To increase the chance of collision, we pro… view at source ↗

**Figure 5.** Figure 5: Mean position and velocity prediction errors after 1 frame with different number of focus objects in 3-object, 6-object, and 8-object scene [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: We test our intuition model in the non-stationary problem. Fewer prediction steps incur almost no error, while higher prediction steps, i.e. more than 4 frames, incur a high loss due to cumulative errors. 6.1. Deep Reinforcement Learning Recent advances in deep reinforcement learning have achieved super-human performances on various ATARI games (Mnih et al., 2015) and robotic control problems (Schulman et… view at source ↗

read the original abstract

At an early age, human infants are able to learn and build a model of the world very quickly by constantly observing and interacting with objects around them. One of the most fundamental intuitions human infants acquire is intuitive physics. Human infants learn and develop these models, which later serve as prior knowledge for further learning. Inspired by such behaviors exhibited by human infants, we introduce a graphical physics network integrated with deep reinforcement learning. Specifically, we introduce an intrinsic reward normalization method that allows our agent to efficiently choose actions that can improve its intuitive physics model the most. Using a 3D physics engine, we show that our graphical physics network is able to infer object's positions and velocities very effectively, and our deep reinforcement learning network encourages an agent to improve its model by making it continuously interact with objects only using intrinsic motivation. We experiment our model in both stationary and non-stationary state problems and show benefits of our approach in terms of the number of different actions the agent performs and the accuracy of agent's intuition model. Videos are at https://www.youtube.com/watch?v=pDbByp91r3M&t=2s

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines a graphical physics network with DRL and a normalized intrinsic reward to drive object interactions that improve the model, but the abstract supplies no equations, baselines, or numbers so the actual gains remain unverified.

read the letter

The main takeaway is that this work puts a graphical physics network inside a DRL loop and uses an intrinsic reward normalization step so the agent keeps picking actions that improve its own position and velocity predictions. They run the setup in a 3D engine on both stationary and non-stationary object problems and report higher action diversity plus better model accuracy than would be expected without the normalization.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces a graphical physics network integrated with deep reinforcement learning, along with an intrinsic reward normalization method, to enable an agent to learn intuitive physics models solely through self-motivated interactions with objects in a 3D physics engine. It reports that the approach improves the agent's model in both stationary and non-stationary settings, with benefits measured by action diversity and model accuracy.

Significance. If the central claims hold, the work would demonstrate a viable path for intrinsically motivated RL agents to acquire physical intuitions without external supervision or hand-designed rewards, potentially informing developmental models in AI.

major comments (1)

[Abstract] Abstract: the central claim that the intrinsic reward normalization 'allows the agent to efficiently choose actions that can improve its intuitive physics model the most' cannot be evaluated, as no equations, pseudocode, or quantitative results (e.g., accuracy deltas, baseline comparisons, or ablation on the normalization) are supplied to show that the reported gains in action diversity and model accuracy are attributable to the proposed component rather than other factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the intrinsic reward normalization 'allows the agent to efficiently choose actions that can improve its intuitive physics model the most' cannot be evaluated, as no equations, pseudocode, or quantitative results (e.g., accuracy deltas, baseline comparisons, or ablation on the normalization) are supplied to show that the reported gains in action diversity and model accuracy are attributable to the proposed component rather than other factors.

Authors: The abstract is a high-level summary. The graphical physics network, DRL integration, and intrinsic reward normalization (including equations) are described in the Methods. Experiments report gains in action diversity and model accuracy for the full approach versus baselines in stationary and non-stationary settings. We agree an explicit ablation isolating the normalization, plus quantitative deltas, is needed to strengthen attribution claims and will add this analysis and associated results in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a graphical physics network with DRL and an intrinsic reward normalization method for intuitive physics learning. No equations, fitted parameters presented as predictions, or self-citation chains are visible that reduce the central claim (agent improves model via intrinsic motivation alone) to a definition or input by construction. Experiments on stationary/non-stationary cases with action diversity and model accuracy metrics provide independent validation. The derivation remains self-contained against external benchmarks without load-bearing reductions to self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that intrinsic reward normalization produces efficient action selection.

pith-pipeline@v0.9.0 · 5731 in / 959 out tokens · 17480 ms · 2026-05-25T01:34:56.052690+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · 4 internal anchors

[1]

doi: 10.1002/9780470996652.ch3

ISBN 9780470996652. doi: 10.1002/9780470996652.ch3. Barto, A. G., Singh, S., and Chentanez, N. Intrinsically motivated learning of hierarchical collections of skills. In Proceedings of International Conference on Develop- mental Learning (ICDL) . MIT Press, Cambridge, MA,

work page doi:10.1002/9780470996652.ch3
[2]

doi: 10.1073/pnas.1306572110

ISSN 0027-8424. doi: 10.1073/pnas.1306572110. Berlyne, D. E. Curiosity and exploration. Science, 153 (3731):25–33,

work page doi:10.1073/pnas.1306572110
[3]

A Compositional Object-Based Approach to Learning Physical Dynamics

Chang, M. B., Ullman, T., Torralba, A., and Tenenbaum, J. B. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341 ,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

ISBN 978-1-60558-205-4

ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390187. Fragkiadaki, K., Agrawal, P., Levine, S., and Malik, J. Learning visual predictive models of physics for play- ing billiards. CoRR, abs/1511.07404,

work page doi:10.1145/1390156.1390187
[5]

Learning Visual Predictive Models of Physics for Playing Billiards

URL http://arxiv.org/abs/1511.07404. Hamrick, J. Internal physics models guide probabilistic judgments about object dynamics. 01

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Prioritized Experience Replay

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- tized experience replay. CoRR, abs/1511.05952,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

doi: 10.1109/TAMD.2010.2056368

ISSN 1943-0604. doi: 10.1109/TAMD.2010.2056368. Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - V olume 37, ICML’15, pp. 1889–1897. JMLR.org,

work page doi:10.1109/tamd.2010.2056368 1943
[8]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Submission and Formatting Instructions for ICML 2019 Stahl, A. E. and Feigenson, L. Observing the unex- pected enhances infants’ learning and exploration. Sci- ence, 348(6230):91–94,

2019
[10]

doi: 10.1126/science.aaa3799

ISSN 0036-8075. doi: 10.1126/science.aaa3799. URL http://science. sciencemag.org/content/348/6230/91. Stahl, A. E. and Feigenson, L. Expectancy viola- tions promote learning in young children. Cog- nition, 163:1 – 14,

work page doi:10.1126/science.aaa3799
[11]

doi: https://doi.org/10.1016/j.cognition.2017.02.008

ISSN 0010-0277. doi: https://doi.org/10.1016/j.cognition.2017.02.008. URL http://www.sciencedirect.com/science/ article/pii/S0010027717300380. Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic program- ming. In In Proceedings of the Seventh International Conference on Machine Learning, pp. 216–224. ...

work page doi:10.1016/j.cognition.2017.02.008 2017
[12]

Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., and Tacchetti, A

ISBN 0262193981. Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., and Tacchetti, A. Visual interaction networks: Learning a physics simulator from video. In Guyon, I., Luxburg, U. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Informa- tion Processing Systems 30, pp. 4539–4547. Curran A...

2017

[1] [1]

doi: 10.1002/9780470996652.ch3

ISBN 9780470996652. doi: 10.1002/9780470996652.ch3. Barto, A. G., Singh, S., and Chentanez, N. Intrinsically motivated learning of hierarchical collections of skills. In Proceedings of International Conference on Develop- mental Learning (ICDL) . MIT Press, Cambridge, MA,

work page doi:10.1002/9780470996652.ch3

[2] [2]

doi: 10.1073/pnas.1306572110

ISSN 0027-8424. doi: 10.1073/pnas.1306572110. Berlyne, D. E. Curiosity and exploration. Science, 153 (3731):25–33,

work page doi:10.1073/pnas.1306572110

[3] [3]

A Compositional Object-Based Approach to Learning Physical Dynamics

Chang, M. B., Ullman, T., Torralba, A., and Tenenbaum, J. B. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341 ,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

ISBN 978-1-60558-205-4

ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390187. Fragkiadaki, K., Agrawal, P., Levine, S., and Malik, J. Learning visual predictive models of physics for play- ing billiards. CoRR, abs/1511.07404,

work page doi:10.1145/1390156.1390187

[5] [5]

Learning Visual Predictive Models of Physics for Playing Billiards

URL http://arxiv.org/abs/1511.07404. Hamrick, J. Internal physics models guide probabilistic judgments about object dynamics. 01

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Prioritized Experience Replay

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- tized experience replay. CoRR, abs/1511.05952,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

doi: 10.1109/TAMD.2010.2056368

ISSN 1943-0604. doi: 10.1109/TAMD.2010.2056368. Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - V olume 37, ICML’15, pp. 1889–1897. JMLR.org,

work page doi:10.1109/tamd.2010.2056368 1943

[8] [8]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Submission and Formatting Instructions for ICML 2019 Stahl, A. E. and Feigenson, L. Observing the unex- pected enhances infants’ learning and exploration. Sci- ence, 348(6230):91–94,

2019

[10] [10]

doi: 10.1126/science.aaa3799

ISSN 0036-8075. doi: 10.1126/science.aaa3799. URL http://science. sciencemag.org/content/348/6230/91. Stahl, A. E. and Feigenson, L. Expectancy viola- tions promote learning in young children. Cog- nition, 163:1 – 14,

work page doi:10.1126/science.aaa3799

[11] [11]

doi: https://doi.org/10.1016/j.cognition.2017.02.008

ISSN 0010-0277. doi: https://doi.org/10.1016/j.cognition.2017.02.008. URL http://www.sciencedirect.com/science/ article/pii/S0010027717300380. Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic program- ming. In In Proceedings of the Seventh International Conference on Machine Learning, pp. 216–224. ...

work page doi:10.1016/j.cognition.2017.02.008 2017

[12] [12]

Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., and Tacchetti, A

ISBN 0262193981. Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., and Tacchetti, A. Visual interaction networks: Learning a physics simulator from video. In Guyon, I., Luxburg, U. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Informa- tion Processing Systems 30, pp. 4539–4547. Curran A...

2017