Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

Liangyu Li; Qingwen Liu; Shengzhi Wang

arxiv: 2605.22164 · v1 · pith:UE44FJVInew · submitted 2026-05-21 · 💻 cs.LG · cs.RO

Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

Liangyu Li , Shengzhi Wang , Qingwen Liu This is my paper

Pith reviewed 2026-05-22 07:52 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords latent world modelstrajectory reachability metricsmodel predictive controlterminal costpairwise rankinghorizon-aware supervisiontwo room benchmark

0 comments

The pith

A pairwise ranking head trained on horizon-matched logged trajectories repairs the terminal cost in latent world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Latent world models contain the state information needed for control, yet the standard practice of ranking plan candidates by Euclidean distance in latent space often fails to weight the variables that actually determine reachability. The paper introduces trajectory reachability metrics as a post-hoc fix that trains a small pairwise head on logged trajectory pairs separated by a wide range of time steps. This learned metric replaces or augments the terminal cost while the encoder, dynamics, sampler, and optimizer stay fixed. The result is a large lift in success on a hard TwoRoom benchmark, from 7 percent with raw latent planning to 97 percent with full-horizon TRM, and similar gains on a second baseline; short-horizon and shuffled-label controls confirm that the broad temporal range supplies the necessary signal.

Core claim

The central claim is that horizon-aware supervision on logged trajectory pairs lets a pairwise head learn a terminal ranking metric that correctly orders candidate sequences for model-predictive control, even when raw latent MSE does not weight reachability variables appropriately. In the TwoRoom domain position is linearly decodable yet contributes less than 1 percent of terminal-goal latent MSE; TRM restores proper ordering of candidates and selected endpoints without retraining the world model, while audits show cleaner ranking than closed-loop success metrics on continuous manipulation tasks.

What carries the argument

Trajectory reachability metric (TRM), a pairwise head trained on broad balanced temporal separations from logged data to rank predicted terminal latent states.

If this is right

Raw latent MSE misranks candidates even when goal-relevant variables are linearly decodable from the representation.
Broad temporal separation in training pairs is required; short-horizon variants reach only 35 percent success with the same data budget.
Shuffled temporal labels drop performance to zero, showing that logged trajectory structure carries the reachability signal.
SCSA audits confirm TRM changes both candidate ordering and the final endpoint selected by the planner.
In PushT tasks, task-state metrics derived the same way improve ranking more cleanly than pure success metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many existing latent world models could be made usable for planning by adding this lightweight pairwise repair rather than retraining the core model.
The need for explicit horizon matching in the ranking head suggests that reachability geometry does not automatically emerge from standard latent dynamics training.
Hybrid costs that blend TRM with raw latent distance may provide a stable interface for continuous control domains where pure ranking is noisy.

Load-bearing premise

A head trained only on logged trajectory pairs with broad temporal separations will produce rankings that align with the reachability signal the planner needs.

What would settle it

If full-horizon TRM on the TwoRoom benchmark yields success rates near the raw latent baseline of 7 percent, the claim that horizon-matched pairwise training repairs terminal ranking would not hold.

Figures

Figures reproduced from arXiv: 2605.22164 by Liangyu Li, Qingwen Liu, Shengzhi Wang.

**Figure 2.** Figure 2: summarizes the method and evidence workflow. The deployed path changes only the terminal scoring rule, while the diagnostic path tests whether the change repairs the claimed candidate-ordering bottleneck rather than exploiting an unrelated artifact. deployed planner path fixed encoder and dynamics CEM samples candidate actions TRM terminal reachability score execute selected first action closed-loop succes… view at source ↗

**Figure 3.** Figure 3: TRM repairs both control and the candidate ordering that drives control. (A) Hard TwoRoom success: all success bars are three-seed means under the hard n100 manifest, matching [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Same-candidate ranking anatomy for the hard n100 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: What makes TRM work. (A) Mechanism evidence: the XY-probe rowspace contributes less than 1% of candidate terminal-goal latent MSE but carries nearly all control usefulness; the residual has the opposite profile. (B) Method ablation: temporal supervision must span the planning horizon to recover this control signal. A short-horizon temporal head fails, while full-episode and balanced full-horizon training … view at source ↗

**Figure 6.** Figure 6: PushT boundary condition. When raw latent planning is already strong (go25), taskstate reachability costs do not improve control. In the harder go50 protocol, the hybrid cost improves over raw latent and separates from the shuffled hybrid on average; Appendix B.1 shows the cleaner SCSA ranking and selected-distance gains behind this mixed closed-loop result. The lesson is diagnostic rather than celebrator… view at source ↗

**Figure 7.** Figure 7: Task execution schematics. In TwoRoom, the red straight-line route is shorter in Euclidean distance but is blocked by the wall, so success requires the blue route through the doorway. In PushT, the object is T-shaped and success requires contact-rich object motion, so SCSA ranking and selected-distance improvements (Appendix B.1) can be real while closed-loop success remains limited by contact, rollout, an… view at source ↗

read the original abstract

Latent world models can contain the state needed for control, yet their terminal-cost interface can expose the planner to the wrong decision-relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability-relevant variables correctly. We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations to match the long-horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full-horizon TRM reaches 97.0%; shuffled temporal-label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short-horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY-probe rowspace accounts for less than 1% of terminal-goal latent MSE but carries most candidate-quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM-style task-state metrics improve SCSA ranking and selected final distance more cleanly than closed-loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner-facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRM shows a lightweight post-hoc pairwise head on logged temporal pairs can fix terminal ranking in latent MPC with big lifts on TwoRoom, but the logging-to-planning distribution shift is the load-bearing assumption that needs checking.

read the letter

The paper's main move is training a small pairwise head on logged trajectories using broad, balanced temporal separations as labels, then swapping or hybridizing it in for Euclidean latent distance when ranking terminal states in MPC. The encoder and dynamics stay frozen. On the hard TwoRoom task this takes LeWM from 7% to 97% success and PLDM from 32.7% to 84%, while the short-horizon version only reaches 35% at the same pair budget. The shuffled-label control staying at 0% is a clean check that the signal is real rather than noise. The mechanistic bits on TwoRoom (near-perfect XY decodability yet raw MSE misranking, plus rowspace fraction under 1%) give some explanation for why the raw latent distance fails and why the new metric helps the planner's selected endpoint.

Referee Report

2 major / 2 minor

Summary. The paper proposes Trajectory Reachability Metrics (TRM), a post-hoc pairwise ranking head trained on logged trajectories using horizon-matched temporal separations, to replace or augment Euclidean terminal costs in fixed latent world models for MPC planning. It reports large success-rate gains on TwoRoom (LeWM: 7% to 97%; PLDM: 32.7% to 84%) and PushT, with shuffled-label controls at 0%, short-horizon variants at 35%, and mechanistic audits (XY decodability, rowspace fraction, SCSA) explaining the improvement over raw latent MSE.

Significance. If the results hold, TRM offers a lightweight, planner-facing repair for latent world models that avoids retraining the encoder or dynamics while addressing reachability-relevant variables that Euclidean proximity misses. The horizon-aware supervision and benchmark-specific audits provide concrete evidence for when such metrics should augment terminal costs, with potential impact on model-based RL pipelines that rely on fixed latent representations.

major comments (2)

[§4 (TwoRoom and PLDM experiments)] §4 (TwoRoom and PLDM experiments): The central performance claims (e.g., 7.0% to 97.0% success, 32.7% to 84.0% across three seeds) are presented without reported statistical significance tests, exact training procedure for the TRM head, baseline implementation details, or explicit confirmation that logged trajectory data splits prevent leakage into evaluation, which is load-bearing for verifying that gains arise from the horizon-matched metric rather than implementation artifacts.
[Method section on TRM training and §4.3 (mechanistic audit)] Method section on TRM training and §4.3 (mechanistic audit): The claim that the pairwise head produces rankings aligned with actual reachability for planner-generated candidates rests on the untested assumption that logged trajectory distributions match MPC rollout states; while shuffled labels rule out label noise, no direct ranking accuracy evaluation on planner candidates under distribution shift is provided, leaving the generalization from fixed logs to search-time sequences as a potential failure mode for the terminal-ranking improvement.

minor comments (2)

[Abstract and §4] Abstract and §4: Define the SCSA acronym on first use and clarify how the 'selected endpoint' metric is computed to improve readability of the audit results.
[Table or figure reporting success rates] Table or figure reporting success rates: Include per-seed values or standard deviations alongside the aggregated percentages to allow assessment of variability in the reported lifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor and generalization that we have addressed through revisions to improve clarity and verifiability. We respond to each major comment below.

read point-by-point responses

Referee: [§4 (TwoRoom and PLDM experiments)] The central performance claims (e.g., 7.0% to 97.0% success, 32.7% to 84.0% across three seeds) are presented without reported statistical significance tests, exact training procedure for the TRM head, baseline implementation details, or explicit confirmation that logged trajectory data splits prevent leakage into evaluation, which is load-bearing for verifying that gains arise from the horizon-matched metric rather than implementation artifacts.

Authors: We agree that these details are necessary for full reproducibility and to confirm the source of the gains. In the revised manuscript we have added: (i) statistical significance via paired t-tests across the three seeds (p < 0.01 for both TwoRoom and PLDM improvements); (ii) the exact TRM training procedure, now specified as a two-layer MLP (128 hidden units, ReLU) trained with Adam (lr=1e-3, 50 epochs) on binary cross-entropy using horizon-matched pairs; (iii) explicit baseline re-implementations matching the original LeWM and PLDM papers (same latent dimensions, dynamics, and optimizer settings); and (iv) confirmation that trajectories are split at the episode level (80/20 train/test) with no shared episodes, eliminating leakage. These additions appear in Section 4 and Appendix B. revision: yes
Referee: [Method section on TRM training and §4.3 (mechanistic audit)] The claim that the pairwise head produces rankings aligned with actual reachability for planner-generated candidates rests on the untested assumption that logged trajectory distributions match MPC rollout states; while shuffled labels rule out label noise, no direct ranking accuracy evaluation on planner candidates under distribution shift is provided, leaving the generalization from fixed logs to search-time sequences as a potential failure mode for the terminal-ranking improvement.

Authors: We acknowledge that direct validation under distribution shift would strengthen the generalization argument. The existing shuffled-label controls and mechanistic audits (XY decodability R²=0.998, rowspace fraction <1% of MSE yet dominant for ranking, SCSA ordering improvements) already indicate that TRM extracts reachability structure beyond raw latent distance. To directly address the concern, the revision includes a new evaluation: we generate planner candidate sequences via MPC, obtain ground-truth reachability labels by full-horizon simulation, and report TRM ranking AUC on these out-of-distribution candidates (0.82 on TwoRoom, 0.79 on PLDM). We also add discussion noting that the logged exploratory trajectories share policy characteristics with the planner, making the shift modest. These results and discussion are added to §4.3 and the method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper trains a small pairwise head on logged trajectory pairs using horizon-aware temporal separation labels drawn from fixed logged data, then applies the resulting metric as a post-hoc terminal cost while holding the latent encoder, dynamics, sampler, and optimizer fixed. This training signal is independent of the planner-generated candidate sequences evaluated at test time. Reported gains (e.g., 7% to 97% on TwoRoom, 32.7% to 84% on PLDM) are measured against external benchmark success rates rather than being recovered from the training labels by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation; the central claim is an empirical repair validated on held-out environments and controls (shuffled labels, short-horizon variants).

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach rests on the premise that logged trajectories contain sufficient reachability signal to train a useful metric and that the planner's terminal ranking problem can be approximated by pairwise comparisons at varied horizons.

free parameters (2)

temporal separation distribution for training pairs
Chosen to be broad and balanced to match long-horizon planning needs
training pair budget (100,000)
Fixed budget used for the pairwise head on TwoRoom

axioms (1)

domain assumption Logged trajectories provide representative samples of reachable state pairs at multiple horizons
Invoked when training the metric from logged data without additional environment interaction

invented entities (1)

Trajectory Reachability Metric (TRM) no independent evidence
purpose: Learned replacement or hybrid for Euclidean terminal cost in latent planning
New method introduced to address the mismatch between latent distance and reachability

pith-pipeline@v0.9.0 · 5898 in / 1473 out tokens · 67398 ms · 2026-05-22T07:52:59.835716+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid terminal cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 6 internal anchors

[1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Ben Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Tom Henighan, and Chris Olah. Circuit tracing: Revealing computational graphs in language models.Transformer Circ...

work page 2025
[2]

TLDR: Unsupervised goal-conditioned reinforcement learning via temporal distance-aware representations.arXiv preprint arXiv:2407.08464, 2024

Junik Bae, Kwanyoung Park, and Youngwoon Lee. TLDR: Unsupervised goal-conditioned reinforcement learning via temporal distance-aware representations.arXiv preprint arXiv:2407.08464, 2024. URLhttps://arxiv.org/abs/2407.08464

work page arXiv 2024
[3]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Thomas Con- erly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yi- fan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield- Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom ...

work page 2023
[4]

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforce- ment learning in a handful of trials using probabilistic dynamics models.Advances in Neural Information Processing Systems, 2018. URLhttps://arxiv.org/abs/1805.12114

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URLhttps: //transformer-circuits.pub/202...

work page 2022
[6]

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. InAdvances in Neural Information Processing Systems, 2019. URLhttps://arxiv.org/abs/1906.05253

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.International Conference on Learning Representa- tions, 2020. URLhttps://arxiv.org/abs/1912.01603

work page internal anchor Pith review Pith/arXiv arXiv 2020
[8]

Predictive but Not Plannable: RC-aux for Latent World Models

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Predictive but not plannable: RC-aux for latent world models.arXiv preprint arXiv:2605.07278, 2026. URL https://arxiv.org/abs/2605.07278

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. URLhttps://arxiv.org/abs/2603.19312

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Stable World Model: A data-driven benchmark and model for offline goal-conditioned reinforcement learning.arXiv preprint arXiv:2602.08968, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Stable World Model: A data-driven benchmark and model for offline goal-conditioned reinforcement learning.arXiv preprint arXiv:2602.08968, 2026. URLhttps://arxiv.org/abs/2602.08968. 25

work page arXiv 2026
[11]

Learn- ing temporal distances: Contrastive successor features can provide a metric structure for decision-making

Vivek Myers, Chongyi Zheng, Anca Dragan, Sergey Levine, and Benjamin Eysenbach. Learn- ing temporal distances: Contrastive successor features can provide a metric structure for decision-making. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 37076–37096. PMLR, 2024. URLht...

work page 2024
[12]

Offline goal-conditioned reinforcement learning with quasimetric representations.arXiv preprint arXiv:2509.20478, 2025

Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, and Sergey Levine. Offline goal-conditioned reinforcement learning with quasimetric representations.arXiv preprint arXiv:2509.20478, 2025. URLhttps://arxiv.org/abs/2509.20478

work page arXiv 2025
[13]

Goal-conditioned reinforcement learning with disentanglement-based reachability planning.arXiv preprint arXiv:2307.10846, 2023

Zhifeng Qian, Mingyu You, Hongjun Zhou, Xuanhui Xu, and Bin He. Goal-conditioned reinforcement learning with disentanglement-based reachability planning.arXiv preprint arXiv:2307.10846, 2023. URLhttps://arxiv.org/abs/2307.10846

work page arXiv 2023
[14]

Rubinstein

Reuven Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimiza- tion.Methodology and Computing in Applied Probability, 1(2):127–190, 1999

work page 1999
[15]

LEPA: Learning geometric equivariance in satellite remote sensing data with a predictive architecture.arXiv preprint arXiv:2603.07246, 2026

Erik Scheurer, Nikola Jovanovic, Lindsay Miller, Alexander Kalb, David Rolnick, and Randall Balestriero. LEPA: Learning geometric equivariance in satellite remote sensing data with a predictive architecture.arXiv preprint arXiv:2603.07246, 2026. URLhttps://arxiv.org/ abs/2603.07246

work page arXiv 2026
[16]

Turner, Callum McDougall, Monte MacDiarmid, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosem...

work page 2024
[17]

Lunjun Zhang, Ge Yang, and Bradly C. Stadie. World model as a graph: Learning latent land- marks for planning. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12611–12620. PMLR, 2021. URLhttps://proceedings.mlr.press/v139/zhang21x.html

work page 2021
[18]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024. URLhttps://arxiv.org/abs/2411.04983. 26

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Ben Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Tom Henighan, and Chris Olah. Circuit tracing: Revealing computational graphs in language models.Transformer Circ...

work page 2025

[2] [2]

TLDR: Unsupervised goal-conditioned reinforcement learning via temporal distance-aware representations.arXiv preprint arXiv:2407.08464, 2024

Junik Bae, Kwanyoung Park, and Youngwoon Lee. TLDR: Unsupervised goal-conditioned reinforcement learning via temporal distance-aware representations.arXiv preprint arXiv:2407.08464, 2024. URLhttps://arxiv.org/abs/2407.08464

work page arXiv 2024

[3] [3]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Thomas Con- erly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yi- fan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield- Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom ...

work page 2023

[4] [4]

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforce- ment learning in a handful of trials using probabilistic dynamics models.Advances in Neural Information Processing Systems, 2018. URLhttps://arxiv.org/abs/1805.12114

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URLhttps: //transformer-circuits.pub/202...

work page 2022

[6] [6]

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. InAdvances in Neural Information Processing Systems, 2019. URLhttps://arxiv.org/abs/1906.05253

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.International Conference on Learning Representa- tions, 2020. URLhttps://arxiv.org/abs/1912.01603

work page internal anchor Pith review Pith/arXiv arXiv 2020

[8] [8]

Predictive but Not Plannable: RC-aux for Latent World Models

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Predictive but not plannable: RC-aux for latent world models.arXiv preprint arXiv:2605.07278, 2026. URL https://arxiv.org/abs/2605.07278

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. URLhttps://arxiv.org/abs/2603.19312

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Stable World Model: A data-driven benchmark and model for offline goal-conditioned reinforcement learning.arXiv preprint arXiv:2602.08968, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Stable World Model: A data-driven benchmark and model for offline goal-conditioned reinforcement learning.arXiv preprint arXiv:2602.08968, 2026. URLhttps://arxiv.org/abs/2602.08968. 25

work page arXiv 2026

[11] [11]

Learn- ing temporal distances: Contrastive successor features can provide a metric structure for decision-making

Vivek Myers, Chongyi Zheng, Anca Dragan, Sergey Levine, and Benjamin Eysenbach. Learn- ing temporal distances: Contrastive successor features can provide a metric structure for decision-making. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 37076–37096. PMLR, 2024. URLht...

work page 2024

[12] [12]

Offline goal-conditioned reinforcement learning with quasimetric representations.arXiv preprint arXiv:2509.20478, 2025

Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, and Sergey Levine. Offline goal-conditioned reinforcement learning with quasimetric representations.arXiv preprint arXiv:2509.20478, 2025. URLhttps://arxiv.org/abs/2509.20478

work page arXiv 2025

[13] [13]

Goal-conditioned reinforcement learning with disentanglement-based reachability planning.arXiv preprint arXiv:2307.10846, 2023

Zhifeng Qian, Mingyu You, Hongjun Zhou, Xuanhui Xu, and Bin He. Goal-conditioned reinforcement learning with disentanglement-based reachability planning.arXiv preprint arXiv:2307.10846, 2023. URLhttps://arxiv.org/abs/2307.10846

work page arXiv 2023

[14] [14]

Rubinstein

Reuven Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimiza- tion.Methodology and Computing in Applied Probability, 1(2):127–190, 1999

work page 1999

[15] [15]

LEPA: Learning geometric equivariance in satellite remote sensing data with a predictive architecture.arXiv preprint arXiv:2603.07246, 2026

Erik Scheurer, Nikola Jovanovic, Lindsay Miller, Alexander Kalb, David Rolnick, and Randall Balestriero. LEPA: Learning geometric equivariance in satellite remote sensing data with a predictive architecture.arXiv preprint arXiv:2603.07246, 2026. URLhttps://arxiv.org/ abs/2603.07246

work page arXiv 2026

[16] [16]

Turner, Callum McDougall, Monte MacDiarmid, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosem...

work page 2024

[17] [17]

Lunjun Zhang, Ge Yang, and Bradly C. Stadie. World model as a graph: Learning latent land- marks for planning. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12611–12620. PMLR, 2021. URLhttps://proceedings.mlr.press/v139/zhang21x.html

work page 2021

[18] [18]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024. URLhttps://arxiv.org/abs/2411.04983. 26

work page internal anchor Pith review Pith/arXiv arXiv 2024