Learning the Arrow of Time
Pith reviewed 2026-05-25 11:09 UTC · model grok-4.3
The pith
A model trained on Markov process trajectories learns an arrow of time that measures reachability and detects side-effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a model can be trained to learn an arrow of time in a Markov process, and this learned direction agrees reasonably well with the Jordan-Kinderlehrer-Otto result for a class of stochastic processes, while also enabling measurement of reachability, detection of side-effects, and provision of an intrinsic reward signal in discrete and continuous environments.
What carries the argument
The learned arrow of time, a model trained to classify whether a given sequence of states runs forward or backward in time.
Load-bearing premise
That a meaningful and learnable arrow of time exists in the observed Markov process and can be extracted to reliably capture environmental properties such as reachability.
What would settle it
Compute the Jordan-Kinderlehrer-Otto arrow independently on a new family of stochastic processes and test whether the learned model recovers the same direction within the error reported for the original class.
Figures
read the original abstract
We humans seem to have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we address the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture meaningful information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. We show empirical results on a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a known notion of an arrow of time given by the celebrated Jordan-Kinderlehrer-Otto result.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a method to learn an 'arrow of time' from observed trajectories in Markov Decision Processes. The learned direction is used to quantify reachability, detect side effects of actions, and construct an intrinsic reward. Empirical demonstrations are provided on discrete gridworlds and continuous control tasks, and the authors report that for an unspecified class of stochastic processes the learned arrow aligns reasonably with the Jordan-Kinderlehrer-Otto (JKO) gradient-flow notion of time asymmetry.
Significance. If the empirical agreement with JKO holds under the stated conditions and the auxiliary tasks (reachability, side-effect detection) prove robust, the work supplies a concrete, data-driven proxy for temporal irreversibility that could be integrated into model-based RL pipelines. The explicit link to an established optimal-transport result is a positive feature when the comparison is made quantitative.
major comments (3)
- [§4] §4 (or wherever the loss for the arrow is defined): the optimization objective used to train the arrow predictor must be stated explicitly; without the precise functional form it is impossible to judge whether the reported agreement with the JKO result is a non-trivial empirical finding or follows by construction from the chosen loss.
- [Table 2 / Figure 3] Table 2 / Figure 3 (JKO comparison): the quantitative measure of agreement (e.g., Spearman rank correlation, Wasserstein distance between implied measures, or regression R²) is not reported; the phrase 'agrees reasonably well' is therefore not falsifiable from the given data.
- [§5.2] §5.2 (reachability experiments): the baseline against which the arrow-based reachability is compared is not described; if the baseline already encodes forward/backward asymmetry, the incremental value of the learned arrow cannot be assessed.
minor comments (2)
- [Abstract] The class of stochastic processes for which the JKO agreement is claimed should be stated in the abstract and introduction.
- [Throughout] Notation for the learned arrow (e.g., whether it is a scalar field, a vector field, or a ranking) is introduced inconsistently across sections.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. Below we address each major point and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [§4] §4 (or wherever the loss for the arrow is defined): the optimization objective used to train the arrow predictor must be stated explicitly; without the precise functional form it is impossible to judge whether the reported agreement with the JKO result is a non-trivial empirical finding or follows by construction from the chosen loss.
Authors: We agree that the precise functional form of the loss must be stated explicitly. In the revised manuscript we will add the full mathematical definition of the optimization objective in Section 4, making clear that the reported JKO agreement is an empirical observation rather than an algebraic consequence of the loss. revision: yes
-
Referee: [Table 2 / Figure 3] Table 2 / Figure 3 (JKO comparison): the quantitative measure of agreement (e.g., Spearman rank correlation, Wasserstein distance between implied measures, or regression R²) is not reported; the phrase 'agrees reasonably well' is therefore not falsifiable from the given data.
Authors: We accept that a quantitative metric is required. The revision will include a numerical measure of agreement (Spearman rank correlation between the learned arrow values and the JKO gradient-flow values) in the caption or a new column of Table 2 / Figure 3. revision: yes
-
Referee: [§5.2] §5.2 (reachability experiments): the baseline against which the arrow-based reachability is compared is not described; if the baseline already encodes forward/backward asymmetry, the incremental value of the learned arrow cannot be assessed.
Authors: We will expand §5.2 to give the exact construction of the baseline, including whether it incorporates any forward/backward asymmetry, so that the incremental contribution of the learned arrow can be evaluated. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central claim is an empirical demonstration that a learned arrow of time in Markov processes agrees reasonably well with the Jordan-Kinderlehrer-Otto result for a class of stochastic processes. No equations, fitting procedures, or derivation steps are visible in the provided text that would reduce any prediction to its inputs by construction. The approach treats the existence of a learnable arrow as a modeling choice and illustrates downstream uses (reachability, side-effects, intrinsic reward) without invoking self-citations, uniqueness theorems, or ansatzes that could create circularity. The result is presented as an illustration rather than a forced equivalence, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/ArrowOfTime.leanz_monotone_absolute, arrow_from_z, entropy_from_berry echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the learned arrow of time agrees reasonably well with a known notion of an arrow of time given by the celebrated Jordan-Kinderlehrer-Otto result... Free-Energy functional F[ρ(·,t)] can only decrease with time
-
Foundation/ArrowOfTime.leanforward_accumulates, z_monotone_absolute echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
h must remain constant (in expectation) along reversible trajectories... along trajectories with irreversible transitions, one may hope that h not only increases, but also quantifies the irreversibility
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL https://link.aps.org/doi/10.1103/PhysRevE.60.2721
doi: 10.1103/PhysRevE.60.2721. URL https://link.aps.org/doi/10.1103/PhysRevE.60.2721. Wojciech H Zurek. Algorithmic randomness and physical entropy. Physical Review A, 40(8):4731,
-
[2]
Decoherence, chaos, quantum-classical correspondence, and the algorithmic arrow of time
Wojciech H Zurek. Decoherence, chaos, quantum-classical correspondence, and the algorithmic arrow of time. Physica Scripta, 1998(T76):186,
work page 1998
-
[3]
The arrow of time in multivariate time series
9 Stefan Bauer, Bernhard Schölkopf, and Jonas Peters. The arrow of time in multivariate time series. In International Conference on Machine Learning, pages 2043–2051,
work page 2043
-
[4]
Safe Exploration in Markov Decision Processes
Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning. Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning
Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. arXiv preprint arXiv:1711.06782,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Recall Traces: Backtracking Models for Efficient Reinforcement Learning
Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy Lillicrap, Sergey Levine, Hugo Larochelle, and Yoshua Bengio. Recall traces: Backtracking models for efficient reinforcement learning. arXiv preprint arXiv:1804.00379,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Time reversal as self-supervision
Suraj Nair, Mohammad Babaeizadeh, Chelsea Finn, Sergey Levine, and Vikash Kumar. Time reversal as self-supervision. arXiv preprint arXiv:1810.01128,
-
[8]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Penalizing side effects using stepwise relative reachability
URL http://arxiv.org/abs/ 1806.01186. Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8052–8060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2035–2042,
work page 2035
-
[12]
Stability of nonlinear stochastic discrete-time systems.Journal of Applied Mathematics, 2013,
Yan Li, Weihai Zhang, and Xikui Liu. Stability of nonlinear stochastic discrete-time systems.Journal of Applied Mathematics, 2013,
work page 2013
-
[13]
David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
doi: 10.4249/scholarpedia.1813. revision #91212. Jan C Willems. Dissipative dynamical systems part i: General theory. Archive for rational mechanics and analysis, 45(5):321–351,
-
[15]
Episodic curiosity through reachability
Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lilli- crap, and Sylvain Gelly. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274,
-
[16]
Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Low Impact Artificial Intelligences
10 Stuart Armstrong and Benjamin Levinstein. Low impact artificial intelligences. arXiv preprint arXiv:1705.10720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Formal theory of creativity, fun, and intrinsic motivation (1990–2010)
Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247,
work page 1990
-
[19]
Large-Scale Study of Curiosity-Driven Learning
Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Dueling Network Architectures for Deep Reinforcement Learning
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
15 0 50000 100000 150000 200000 250000 Iterations 0.0 0.2 0.4 0.6 0.8 1.0Probability of Reaching the Goal Without h-Potential With h-Potential (a) Probability of reaching the goal. 0 50000 100000 150000 200000 250000 Iterations 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0Mean Number of Vases Broken Without h-Potential With h-Potential (b) Number of vases broken. F...
work page 2012
-
[23]
C.1.2 2D World with Drying Tomatoes The environment considered comprises a 7 × 7 2D world where each cell is initially occupied by watered tomato plant25. The agent waters the cell it occupies, restoring the moisture level of the plant in the said cell to 100%. However, for each step the agent does not water a plant, it loses some moisture (by 2% of maxim...
work page 2018
-
[24]
plotted against an engineered reward, which in this case is the amount of moisture gained by the tomato plant the agent just watered. Gist: theh-Potential captures useful informa- tion about the environment, which can then be utilized to define intrinsic rewards. ˆrt = −{η(st−1 →st) − RunningAveraget[η]} (27) where we use a momentum of 0.95 to evaluate the...
work page 2018
-
[25]
We indeed find that states in the vicinity ofθ = 0 have a largerh-potential, owing to the fact that all trajectories converge to (θ, ˙θ) = 0 for large t due to the dissipative action of friction. C.2.2 Continuous Mountain Car The environment28 considered is a variation of Mountain Car (Sutton and Barto, 2011), where the state-space is a tuple (x, ˙x) of th...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.