SUNTA: Hierarchical Video Prediction with Surprise-based Chunking

Masahiro Suzuki; Tomoshi Iiyama; Yutaka Matsuo

arxiv: 2607.02087 · v1 · pith:72NFHQLAnew · submitted 2026-07-02 · 💻 cs.AI · cs.LG

SUNTA: Hierarchical Video Prediction with Surprise-based Chunking

Tomoshi Iiyama , Masahiro Suzuki , Yutaka Matsuo This is my paper

Pith reviewed 2026-07-03 13:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords hierarchical state-space modelsvideo predictionsurprise-based chunkingtemporal abstractionlong-horizon predictionprediction errorsstate-space models

0 comments

The pith

Prediction-error chunking in hierarchical state-space models sustains video forecasts over 250 timesteps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that chunk boundaries in hierarchical state-space models should be set by prediction errors instead of fixed lengths or similarity measures, because errors directly signal when longer context is required. It introduces SUNTA, which uses a decoupled training strategy to preserve surprise signals and measures top-down surprise through internal inconsistency within imagined rollouts. Experiments on 2D and 3D video prediction tasks show the method maintains accuracy far longer than baselines. A sympathetic reader would care because current hierarchical predictors lose coherence quickly, limiting their use for extended planning or simulation. The central claim is that surprise-driven boundaries align better with the data's intrinsic temporal structure.

Core claim

SUNTA determines chunk boundaries by treating internal inconsistency during imagined rollouts as a top-down surprise metric and applies a decoupled training strategy to keep surprise signals usable without triggering hierarchical collapse; this produces models that sustain accurate video predictions over 250 timesteps in 2D and 3D environments while all tested baselines degrade within the first 10 timesteps.

What carries the argument

Internal inconsistency within imagined rollouts functions as the top-down surprise metric that sets chunk boundaries, enabled by decoupled training that isolates surprise signal preservation from end-to-end optimization.

If this is right

The model sustains accurate predictions over 250 timesteps in both 2D and 3D video tasks.
Baselines that rely on fixed-length or similarity-based chunking lose accuracy within the first 10 timesteps.
Decoupled training prevents hierarchical collapse while retaining surprise signals for boundary detection.
Surprise-based chunking aligns boundaries more closely with intrinsic temporal structure than prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inconsistency metric could be applied to set boundaries in non-visual sequence domains such as audio or text streams.
Longer stable rollouts would support more reliable model-based planning loops that query the predictor hundreds of steps ahead.
One could test whether replacing internal inconsistency with other surprise proxies, such as reconstruction variance, produces comparable boundary quality.

Load-bearing premise

Internal inconsistency observed during imagined rollouts reliably marks points where longer-range context is needed and the decoupled training leaves usable surprise signals intact.

What would settle it

A direct experiment showing that SUNTA's chunk boundaries do not reduce prediction error accumulation compared with fixed-length chunking on the same video datasets, with accuracy dropping at similar rates beyond 10 timesteps.

Figures

Figures reproduced from arXiv: 2607.02087 by Masahiro Suzuki, Tomoshi Iiyama, Yutaka Matsuo.

**Figure 1.** Figure 1: Similarity-based chunking vs. surprise-based chunking. Similarity-based chunking monitors observational changes and may over-segment superficial appearance changes or miss nonsalient semantic transitions. Surprise-based chunking cuts at peaks of prediction error, highlighting shifts in the latent dynamics. We argue that temporal abstraction should be driven by prediction errors, or surprise, within the in… view at source ↗

**Figure 2.** Figure 2: Overview of SUNTA. During inference (left), observations are encoded into low-level states and segmented into chunks at points of high prediction error. A bottom-up encoder aggregates each chunk into a single latent state for high-level dynamics modeling. During generation (right), the high-level model produces high-level states that are decoded into low-level sequences. Chunk boundaries are detected via t… view at source ↗

**Figure 3.** Figure 3: Hierarchical structures detected on the Bouncing Ball dataset. Each colored block denotes a chunk identified by the corresponding model. Ground-truth boundaries occur when the ball bounces and changes color. SUNTA precisely identifies these transitions, while VPR and VTA produce fragmented over-segmentation, and LOVE fails to detect meaningful boundaries altogether. The threshold τinf determines the sensit… view at source ↗

**Figure 4.** Figure 4: Qualitative results of open-loop video prediction on the Bouncing Ball dataset. The top row shows the ground-truth image frames and other rows show the image frames generated by each model up to 250 steps ahead, conditioning on the first 50 context frames. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Open-loop prediction accuracy over the rollout horizon. We evaluate pixel-level quality via SSIM (higher is better ↑) and high-level latent-space prediction accuracy via KL divergence (lower is better ↓). All models are conditioned on the first 50 frames [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of open-loop video prediction on the Serial Nine Rooms dataset. The top row shows the ground-truth image frames and other rows show the image frames generated by each model up to 244 steps ahead, conditioning on the first 100 frames. Results on 3D Maze Finally, on 3D Maze, SUNTA again dominates both boundary detection ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: illustrates the train–test mismatch addressed by temporal pattern completion (TPC) and our regularization scheme (Section 3). At training time, the high-level encoder qθ2 aggregates the entire low-level latent sequence within each chunk (Fig. 7a, top) to produce the chunk representation s (2) t . At test time, however, observations are revealed online: the most recent chunk in the context is typically trun… view at source ↗

**Figure 8.** Figure 8: Effect of decoupled training on chunk discovery (Bouncing Ball). End-to-end training collapses into a degenerate segmentation, whereas decoupled level-wise training preserves the surprise signal and recovers the ground-truth boundaries (where the ball bounces and changes color). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results of open-loop video prediction on the 3D Maze dataset. The top row shows the ground-truth image frames and other rows show the image frames generated by each model up to 250 steps ahead, conditioning on the first 50 context frames. J.3 Prediction accuracy within high-level latent space To verify that our chunking mechanism simplifies the learning of high-level dynamics, we evaluate predi… view at source ↗

**Figure 10.** Figure 10: visualizes the level-2 surprise signal across Bouncing Ball trajectories: once the high-level model has observed sufficient color history (six bounces), its surprise stays consistently low and nearly flat, indicating that a third level would have little to model in this setting. We observe the same trend on the other datasets and therefore restrict SUNTA to two levels in this work, leaving the study of de… view at source ↗

read the original abstract

Hierarchical state-space models (HSSMs) offer a promising approach to long-horizon prediction by segmenting sequences into temporal chunks. However, their performance hinges on how chunk boundaries are determined. While prior HSSMs typically rely on fixed-length chunking or similarity-based boundary detection, these methods often misalign with the intrinsic temporal structure of the data. We argue that chunking should instead be driven by prediction errors, which more directly indicate when longer-range context becomes necessary. Nevertheless, integrating surprise-based chunking into HSSMs introduces critical challenges, including hierarchical collapse during end-to-end training and the absence of surprise signals during open-loop prediction. To address these issues, we propose Surprise-based Nested Temporal Abstraction (SUNTA), a method that employs a decoupled training strategy to preserve surprise signals and uses internal inconsistency as a top-down surprise metric to determine chunk boundaries within imagined rollouts. Experiments on video prediction tasks in 2D and 3D environments demonstrate that SUNTA outperforms baselines, uniquely maintaining accurate predictions over 250 timesteps, whereas all baselines degrade within the first 10 timesteps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SUNTA's abstract names real problems with hierarchical collapse and lost surprise signals in open-loop HSSMs and proposes decoupled training plus an internal inconsistency metric, but supplies zero equations, ablations, or training details to show the fixes work.

read the letter

The core idea is to drive chunk boundaries in hierarchical state-space models by prediction error rather than fixed lengths or feature similarity. The authors flag two specific failure modes—end-to-end training collapsing the hierarchy and open-loop rollouts losing any surprise signal—and claim SUNTA solves both with a decoupled training schedule and an internal inconsistency metric used as a top-down surprise signal inside imagined trajectories.

What stands out is that they have isolated the exact points where prior HSSM chunking methods break and offered targeted remedies instead of another similarity heuristic. The reported outcome, accurate video prediction out to 250 timesteps in 2D and 3D settings while every baseline collapses inside the first 10, would be useful for anyone needing long-horizon rollouts if the numbers hold.

The gaps are large and central. No equations appear for how internal inconsistency is computed, so it is impossible to check whether the metric is independent of quantities already optimized inside the model or whether it actually supplies new information during open-loop prediction. There are no ablations on the decoupling step, no description of the environments or baselines, and no variance or statistical detail on the 250-timestep claim. The stress-test worry about decoupled training failing to preserve usable surprise signals is therefore still live; nothing in the abstract demonstrates that the signals survive or that chunk boundaries align with intrinsic temporal structure rather than training artifacts.

This is for researchers already building or extending hierarchical predictors for video or robotics who need concrete mechanisms for temporal abstraction. A reader looking for a worked-out alternative to fixed or similarity-based chunking would find the framing useful even if the evidence is still missing.

I would send it to peer review. The problems it targets are genuine and the proposed fixes are specific enough to be testable, but any serious referee will require the missing derivations, training protocol, and controls before the 250-timestep result can be taken as evidence.

Referee Report

1 major / 0 minor

Summary. The paper proposes Surprise-based Nested Temporal Abstraction (SUNTA), a hierarchical state-space model for long-horizon video prediction. Chunk boundaries are set using internal inconsistency during imagined rollouts as a top-down surprise metric. A decoupled training strategy is introduced to avoid hierarchical collapse in end-to-end training and to preserve usable surprise signals in open-loop prediction. Experiments on 2D and 3D video prediction tasks are reported to show that SUNTA sustains accurate predictions for 250 timesteps while all baselines degrade within the first 10 timesteps.

Significance. If the reported long-horizon results hold under scrutiny, the work would offer a concrete mechanism for aligning temporal chunking with intrinsic prediction-error structure rather than fixed lengths or similarity, addressing a recurring limitation in hierarchical state-space models for video.

major comments (1)

[Abstract] Abstract: the 250-timestep performance claim rests entirely on the decoupled training strategy successfully preserving surprise signals without new instabilities during open-loop rollouts. The abstract identifies hierarchical collapse and missing surprise signals as the two critical challenges yet provides no equations, pseudocode, or verification procedure for the decoupling step, so it is impossible to determine whether the reported advantage follows from the method or from an unstated implementation detail.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for clarity on how the decoupled training strategy is specified. The abstract is a high-level summary; the full technical description, including equations and procedure, appears in the manuscript body. We address the point below.

read point-by-point responses

Referee: [Abstract] Abstract: the 250-timestep performance claim rests entirely on the decoupled training strategy successfully preserving surprise signals without new instabilities during open-loop rollouts. The abstract identifies hierarchical collapse and missing surprise signals as the two critical challenges yet provides no equations, pseudocode, or verification procedure for the decoupling step, so it is impossible to determine whether the reported advantage follows from the method or from an unstated implementation detail.

Authors: The abstract follows standard conventions by summarizing the two challenges and the high-level solution without equations or pseudocode. The decoupled training strategy is fully specified in Section 3.2 (Equations 4–7) together with Algorithm 1, which details the alternating optimization that prevents hierarchical collapse while keeping surprise signals available during open-loop rollouts. The verification procedure is the open-loop video-prediction protocol reported in Sections 4.2–4.3, where SUNTA alone sustains accuracy to 250 timesteps. No implementation detail is left unstated; the performance difference is therefore attributable to the described method. If the referee prefers an explicit cross-reference in the abstract, we can add one sentence directing readers to Section 3.2. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes SUNTA to solve hierarchical collapse and missing surprise signals via decoupled training plus internal inconsistency as a top-down metric for chunk boundaries in imagined rollouts. No equations, self-definitions, or fitted-input-as-prediction reductions appear in the provided text. Performance claims rest on external experimental benchmarks (250-timestep accuracy vs. baselines collapsing at 10 steps), not on any quantity being renamed or forced by construction from the inputs. The derivation is therefore self-contained against the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that surprise signals remain informative after decoupling and that internal inconsistency correlates with external prediction error.

axioms (1)

domain assumption Prediction error is a more direct indicator of when longer-range context is needed than fixed length or similarity.
Stated explicitly in the abstract as the motivating argument for surprise-based chunking.

pith-pipeline@v0.9.1-grok · 5726 in / 1272 out tokens · 39825 ms · 2026-07-03T13:17:07.524462+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Aakur and Sudeep Sarkar

Sathyanarayanan N. Aakur and Sudeep Sarkar. A perceptual prediction framework for self supervised event segmentation. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[2]

xLSTM: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InAdvances in Neural Information Processing Systems, 2024

2024
[3]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-time execution of action chunking flow policies, 2025. URLhttps://arxiv.org/abs/2506.07339

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Botvinick

Matthew M. Botvinick. Hierarchical models of behavior and prefrontal function.Trends in Cognitive Sciences, 12(5), 2008

2008
[5]

Chase and Herbert A

William G. Chase and Herbert A. Simon. Perception in chess.Cognitive Psychology, 4(1),
[6]

doi: https://doi.org/10.1016/0010-0285(73)90004-2

ISSN 0010-0285. doi: https://doi.org/10.1016/0010-0285(73)90004-2

work page doi:10.1016/0010-0285(73)90004-2
[7]

Hierarchical deep generative models for multi-rate multivariate time series

Zhengping Che, Sanjay Purushotham, Guangyu Li, Bo Jiang, and Yan Liu. Hierarchical deep generative models for multi-rate multivariate time series. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research. PMLR, 10–15 Jul 2018

2018
[8]

Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks.CoRR, abs/2306.13831, 2023

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo de Lazcano, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks.CoRR, abs/2306.13831, 2023

work page arXiv 2023
[9]

Hierarchical multiscale recurrent neural networks

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. InInternational Conference on Learning Representations, 2017

2017
[10]

Moran, Yukie Nagai, Tadahiro Taniguchi, Hiroaki Gomi, and Josh Tenenbaum

Karl Friston, Rosalyn J. Moran, Yukie Nagai, Tadahiro Taniguchi, Hiroaki Gomi, and Josh Tenenbaum. World model learning and inference.Neural Networks, 144, 2021. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2021.09.011

work page doi:10.1016/j.neunet.2021.09.011 2021
[11]

A new algorithm for data compression.C Users J., 12(2), February 1994

Philip Gage. A new algorithm for data compression.C Users J., 12(2), February 1994. ISSN 0898-9788

1994
[12]

Christian Gumbsch, Noor Sajid, Georg Martius, and Martin V . Butz. Learning hierarchical world models with adaptive temporal abstractions from discrete latent dynamics. InInternational Conference on Learning Representations, 2024

2024
[13]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, 2018

2018
[14]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of the 36th International Conference on Machine Learning, volume 97, 2019

2019
[15]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020

2020
[16]

Mastering atari with discrete world models

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations, 2021

2021
[17]

Deep hierarchical planning from pixels

Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022
[18]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

TD-MPC2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations, 2024

2024
[21]

Temporal difference learning for model predictive control

Nicklas A Hansen, Hao Su, and Xiaolong Wang. Temporal difference learning for model predictive control. InProceedings of the 39th International Conference on Machine Learning, volume 162, 2022

2022
[22]

Times Books, 2004

Jeff Hawkins and Sandra Blakeslee.On Intelligence. Times Books, 2004. ISBN 0805074562

2004
[23]

Hierarchical recurrent neural networks for long-term dependen- cies

Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependen- cies. InAdvances in Neural Information Processing Systems, volume 8, 1995

1995
[24]

Oxford University Press, 2013

Jakob Hohwy.The Predictive Mind. Oxford University Press, 2013

2013
[25]

Dynamic chunking for end-to-end hierarchical sequence modeling, 2025

Sukjun Hwang, Brandon Wang, and Albert Gu. Dynamic chunking for end-to-end hierarchical sequence modeling, 2025. URLhttps://arxiv.org/abs/2507.07955

work page arXiv 2025
[26]

Zico Kolter, and Chelsea Finn

Yiding Jiang, Evan Liu, Benjamin Eysenbach, J. Zico Kolter, and Chelsea Finn. Learning options via compression. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022
[27]

Variational temporal abstraction

Taesup Kim, Sungjin Ahn, and Yoshua Bengio. Variational temporal abstraction. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[28]

A clockwork rnn

Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. In Proceedings of the 31st International Conference on Machine Learning, volume 32, 2014

2014
[29]

A path towards autonomous machine intelligence.OpenReview Archive Preprint, 2022

Yann LeCun. A path towards autonomous machine intelligence.OpenReview Archive Preprint, 2022

2022
[30]

Reinforcement learning with action chunking,

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking,
[31]

URLhttps://arxiv.org/abs/2507.07969

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Hieros: Hierarchical imagination on struc- tured state space sequence world models

Paul Mattes, Rainer Schlosser, and Ralf Herbrich. Hieros: Hierarchical imagination on struc- tured state space sequence world models. InProceedings of the 41st International Conference on Machine Learning, volume 235, 2024

2024
[33]

Linking fast and slow: The case for generative models.Network Neuroscience, 8, 2024

Johan Medrano, Karl Friston, and Peter Zeidman. Linking fast and slow: The case for generative models.Network Neuroscience, 8, 2024

2024
[34]

George A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information.The Psychological Review, 63(2), March 1956

1956
[35]

A taxonomy of surprise definitions

Alireza Modirshanechi, Johanni Brea, and Wulfram Gerstner. A taxonomy of surprise definitions. Journal of Mathematical Psychology, 110, 2022. ISSN 0022-2496. doi: https://doi.org/10.1016/ j.jmp.2022.102712

work page arXiv 2022
[36]

Streamer: Streaming representation learning and event segmentation in a hierarchical manner

Ramy Mounir, Sujal Vijayaraghavan, and Sudeep Sarkar. Streamer: Streaming representation learning and event segmentation in a hierarchical manner. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

2023
[37]

Fast-slow recurrent neural networks

Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

2017
[38]

Pagnoni, R

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URLhttps://arxiv.org/abs/2412.09871. 11

work page arXiv 2024
[39]

Long-horizon visual planning with goal-conditioned hierarchical predictors.arXiv preprint arXiv:2006.13205, 2020

Karl Pertsch, Oleh Rybkin, Frederik Ebert, Chelsea Finn, Dinesh Jayaraman, and Sergey Levine. Long-horizon visual planning with goal-conditioned hierarchical predictors.arXiv preprint arXiv:2006.13205, 2020

work page arXiv 2006
[40]

Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1), 1999. doi: 10.1038/4580

work page doi:10.1038/4580 1999
[41]

Clockwork variational autoencoders

Vaibhav Saxena, Jimmy Ba, and Danijar Hafner. Clockwork variational autoencoders. In Advances in Neural Information Processing Systems, volume 34, 2021

2021
[42]

Learning complex, extended sequences using the principle of history compression.Neural Computation, 4(2), 1992

Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression.Neural Computation, 4(2), 1992. doi: 10.1162/neco.1992.4.2.234

work page doi:10.1162/neco.1992.4.2.234 1992
[43]

Multi time scale world models

Vaisakh Shaj, Saleh Gholam Zadeh, Ozan Demir, Luiz Ricardo Douat, and Gerhard Neumann. Multi time scale world models. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[44]

Richard S. Sutton. TD Models: Modeling the World at a Mixture of Time Scales. InProceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 1995. ISBN 1-55860-377-8

1995
[45]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1), 1999

1999
[46]

Adopt: Modified adam can converge with any β2 with the optimal rate

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Naga- hara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. Adopt: Modified adam can converge with any β2 with the optimal rate. InAdvances in Neural Information Processing Systems, 2024

2024
[47]

Learning structure from the ground up—hierarchical representation learning by chunking

Shuchen Wu, Noemi Elteto, Ishita Dasgupta, and Eric Schulz. Learning structure from the ground up—hierarchical representation learning by chunking. InAdvances in Neural Informa- tion Processing Systems, volume 35, 2022

2022
[48]

Emergence of functional hierarchy in a multiple timescale neural network model: A humanoid robot experiment.PLOS Computational Biology, 4, 2008

Yuichi Yamashita and Jun Tani. Emergence of functional hierarchy in a multiple timescale neural network model: A humanoid robot experiment.PLOS Computational Biology, 4, 2008. doi: 10.1371/journal.pcbi.1000220

work page doi:10.1371/journal.pcbi.1000220 2008
[49]

Megabyte: Predicting million-byte sequences with multiscale transformers

LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. Megabyte: Predicting million-byte sequences with multiscale transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

2023
[50]

Event perception: a mind-brain perspective.Psychological bulletin, 133(2), March 2007

Jeffrey M Zacks, Nicole K Speer, Khena M Swallow, Todd S Braver, and Jeremy R Reynolds. Event perception: a mind-brain perspective.Psychological bulletin, 133(2), March 2007. ISSN 0033-2909. doi: 10.1037/0033-2909.133.2.273

work page doi:10.1037/0033-2909.133.2.273 2007
[51]

Episodic memory for subjective- timescale models

Alexey Zakharov, Matthew Crosby, and Zafeirios Fountas. Episodic memory for subjective- timescale models. InICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021

2021
[52]

Variational predictive routing with nested subjective timescales

Alexey Zakharov, Qinghai Guo, and Zafeirios Fountas. Variational predictive routing with nested subjective timescales. InInternational Conference on Learning Representations, 2022. 12 A Hyperparameters Table 3 summarizes the hyperparameters used in our main experiments. Unless otherwise stated, the same values are used across all datasets. Table 3:Hyperpa...

2022

[1] [1]

Aakur and Sudeep Sarkar

Sathyanarayanan N. Aakur and Sudeep Sarkar. A perceptual prediction framework for self supervised event segmentation. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[2] [2]

xLSTM: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InAdvances in Neural Information Processing Systems, 2024

2024

[3] [3]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-time execution of action chunking flow policies, 2025. URLhttps://arxiv.org/abs/2506.07339

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Botvinick

Matthew M. Botvinick. Hierarchical models of behavior and prefrontal function.Trends in Cognitive Sciences, 12(5), 2008

2008

[5] [5]

Chase and Herbert A

William G. Chase and Herbert A. Simon. Perception in chess.Cognitive Psychology, 4(1),

[6] [6]

doi: https://doi.org/10.1016/0010-0285(73)90004-2

ISSN 0010-0285. doi: https://doi.org/10.1016/0010-0285(73)90004-2

work page doi:10.1016/0010-0285(73)90004-2

[7] [7]

Hierarchical deep generative models for multi-rate multivariate time series

Zhengping Che, Sanjay Purushotham, Guangyu Li, Bo Jiang, and Yan Liu. Hierarchical deep generative models for multi-rate multivariate time series. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research. PMLR, 10–15 Jul 2018

2018

[8] [8]

Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks.CoRR, abs/2306.13831, 2023

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo de Lazcano, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks.CoRR, abs/2306.13831, 2023

work page arXiv 2023

[9] [9]

Hierarchical multiscale recurrent neural networks

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. InInternational Conference on Learning Representations, 2017

2017

[10] [10]

Moran, Yukie Nagai, Tadahiro Taniguchi, Hiroaki Gomi, and Josh Tenenbaum

Karl Friston, Rosalyn J. Moran, Yukie Nagai, Tadahiro Taniguchi, Hiroaki Gomi, and Josh Tenenbaum. World model learning and inference.Neural Networks, 144, 2021. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2021.09.011

work page doi:10.1016/j.neunet.2021.09.011 2021

[11] [11]

A new algorithm for data compression.C Users J., 12(2), February 1994

Philip Gage. A new algorithm for data compression.C Users J., 12(2), February 1994. ISSN 0898-9788

1994

[12] [12]

Christian Gumbsch, Noor Sajid, Georg Martius, and Martin V . Butz. Learning hierarchical world models with adaptive temporal abstractions from discrete latent dynamics. InInternational Conference on Learning Representations, 2024

2024

[13] [13]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, 2018

2018

[14] [14]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of the 36th International Conference on Machine Learning, volume 97, 2019

2019

[15] [15]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020

2020

[16] [16]

Mastering atari with discrete world models

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations, 2021

2021

[17] [17]

Deep hierarchical planning from pixels

Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022

[18] [18]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

TD-MPC2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations, 2024

2024

[21] [21]

Temporal difference learning for model predictive control

Nicklas A Hansen, Hao Su, and Xiaolong Wang. Temporal difference learning for model predictive control. InProceedings of the 39th International Conference on Machine Learning, volume 162, 2022

2022

[22] [22]

Times Books, 2004

Jeff Hawkins and Sandra Blakeslee.On Intelligence. Times Books, 2004. ISBN 0805074562

2004

[23] [23]

Hierarchical recurrent neural networks for long-term dependen- cies

Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependen- cies. InAdvances in Neural Information Processing Systems, volume 8, 1995

1995

[24] [24]

Oxford University Press, 2013

Jakob Hohwy.The Predictive Mind. Oxford University Press, 2013

2013

[25] [25]

Dynamic chunking for end-to-end hierarchical sequence modeling, 2025

Sukjun Hwang, Brandon Wang, and Albert Gu. Dynamic chunking for end-to-end hierarchical sequence modeling, 2025. URLhttps://arxiv.org/abs/2507.07955

work page arXiv 2025

[26] [26]

Zico Kolter, and Chelsea Finn

Yiding Jiang, Evan Liu, Benjamin Eysenbach, J. Zico Kolter, and Chelsea Finn. Learning options via compression. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022

[27] [27]

Variational temporal abstraction

Taesup Kim, Sungjin Ahn, and Yoshua Bengio. Variational temporal abstraction. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[28] [28]

A clockwork rnn

Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. In Proceedings of the 31st International Conference on Machine Learning, volume 32, 2014

2014

[29] [29]

A path towards autonomous machine intelligence.OpenReview Archive Preprint, 2022

Yann LeCun. A path towards autonomous machine intelligence.OpenReview Archive Preprint, 2022

2022

[30] [30]

Reinforcement learning with action chunking,

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking,

[31] [31]

URLhttps://arxiv.org/abs/2507.07969

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Hieros: Hierarchical imagination on struc- tured state space sequence world models

Paul Mattes, Rainer Schlosser, and Ralf Herbrich. Hieros: Hierarchical imagination on struc- tured state space sequence world models. InProceedings of the 41st International Conference on Machine Learning, volume 235, 2024

2024

[33] [33]

Linking fast and slow: The case for generative models.Network Neuroscience, 8, 2024

Johan Medrano, Karl Friston, and Peter Zeidman. Linking fast and slow: The case for generative models.Network Neuroscience, 8, 2024

2024

[34] [34]

George A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information.The Psychological Review, 63(2), March 1956

1956

[35] [35]

A taxonomy of surprise definitions

Alireza Modirshanechi, Johanni Brea, and Wulfram Gerstner. A taxonomy of surprise definitions. Journal of Mathematical Psychology, 110, 2022. ISSN 0022-2496. doi: https://doi.org/10.1016/ j.jmp.2022.102712

work page arXiv 2022

[36] [36]

Streamer: Streaming representation learning and event segmentation in a hierarchical manner

Ramy Mounir, Sujal Vijayaraghavan, and Sudeep Sarkar. Streamer: Streaming representation learning and event segmentation in a hierarchical manner. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

2023

[37] [37]

Fast-slow recurrent neural networks

Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

2017

[38] [38]

Pagnoni, R

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URLhttps://arxiv.org/abs/2412.09871. 11

work page arXiv 2024

[39] [39]

Long-horizon visual planning with goal-conditioned hierarchical predictors.arXiv preprint arXiv:2006.13205, 2020

Karl Pertsch, Oleh Rybkin, Frederik Ebert, Chelsea Finn, Dinesh Jayaraman, and Sergey Levine. Long-horizon visual planning with goal-conditioned hierarchical predictors.arXiv preprint arXiv:2006.13205, 2020

work page arXiv 2006

[40] [40]

Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1), 1999. doi: 10.1038/4580

work page doi:10.1038/4580 1999

[41] [41]

Clockwork variational autoencoders

Vaibhav Saxena, Jimmy Ba, and Danijar Hafner. Clockwork variational autoencoders. In Advances in Neural Information Processing Systems, volume 34, 2021

2021

[42] [42]

Learning complex, extended sequences using the principle of history compression.Neural Computation, 4(2), 1992

Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression.Neural Computation, 4(2), 1992. doi: 10.1162/neco.1992.4.2.234

work page doi:10.1162/neco.1992.4.2.234 1992

[43] [43]

Multi time scale world models

Vaisakh Shaj, Saleh Gholam Zadeh, Ozan Demir, Luiz Ricardo Douat, and Gerhard Neumann. Multi time scale world models. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[44] [44]

Richard S. Sutton. TD Models: Modeling the World at a Mixture of Time Scales. InProceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 1995. ISBN 1-55860-377-8

1995

[45] [45]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1), 1999

1999

[46] [46]

Adopt: Modified adam can converge with any β2 with the optimal rate

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Naga- hara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. Adopt: Modified adam can converge with any β2 with the optimal rate. InAdvances in Neural Information Processing Systems, 2024

2024

[47] [47]

Learning structure from the ground up—hierarchical representation learning by chunking

Shuchen Wu, Noemi Elteto, Ishita Dasgupta, and Eric Schulz. Learning structure from the ground up—hierarchical representation learning by chunking. InAdvances in Neural Informa- tion Processing Systems, volume 35, 2022

2022

[48] [48]

Emergence of functional hierarchy in a multiple timescale neural network model: A humanoid robot experiment.PLOS Computational Biology, 4, 2008

Yuichi Yamashita and Jun Tani. Emergence of functional hierarchy in a multiple timescale neural network model: A humanoid robot experiment.PLOS Computational Biology, 4, 2008. doi: 10.1371/journal.pcbi.1000220

work page doi:10.1371/journal.pcbi.1000220 2008

[49] [49]

Megabyte: Predicting million-byte sequences with multiscale transformers

LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. Megabyte: Predicting million-byte sequences with multiscale transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

2023

[50] [50]

Event perception: a mind-brain perspective.Psychological bulletin, 133(2), March 2007

Jeffrey M Zacks, Nicole K Speer, Khena M Swallow, Todd S Braver, and Jeremy R Reynolds. Event perception: a mind-brain perspective.Psychological bulletin, 133(2), March 2007. ISSN 0033-2909. doi: 10.1037/0033-2909.133.2.273

work page doi:10.1037/0033-2909.133.2.273 2007

[51] [51]

Episodic memory for subjective- timescale models

Alexey Zakharov, Matthew Crosby, and Zafeirios Fountas. Episodic memory for subjective- timescale models. InICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021

2021

[52] [52]

Variational predictive routing with nested subjective timescales

Alexey Zakharov, Qinghai Guo, and Zafeirios Fountas. Variational predictive routing with nested subjective timescales. InInternational Conference on Learning Representations, 2022. 12 A Hyperparameters Table 3 summarizes the hyperparameters used in our main experiments. Unless otherwise stated, the same values are used across all datasets. Table 3:Hyperpa...

2022