Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

Shiheng Zhang

arxiv: 2605.19185 · v1 · pith:ARGN3ZROnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

Shiheng Zhang This is my paper

Pith reviewed 2026-05-20 11:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords graph-PDEDirichlet extensionaction-gap certificateAMLEharmonic extensiongoal-conditioned planningplanner-admissiblesparse labels

0 comments

The pith

If the surrogate value error along the rollout stays below half the true action gap, then the greedy rollout reaches the goal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse goal-conditioned planning with few cost-to-go labels is treated as a graph-PDE Dirichlet extension problem on a goal-dependent boundary. The paper proves a local action-gap certificate: under the operational argmin-Q planner, if the surrogate value error along the rollout stays below half the true action gap, the greedy rollout reaches the goal. AMLE, the p equals infinity endpoint of the graph p-Laplacian family, satisfies the certificate through a comparison-principle fill-distance bound. Harmonic extension, by contrast, can mis-rank local actions because its values reflect boundary hitting probabilities rather than shortest-path order. On 120 AntMaze layout-derived graphs, AMLE attains 0.970 aggregate success while harmonic extension reaches only 0.584, with high-p methods also performing well.

Core claim

Under the operational argmin-Q planner, graph value extensions are planner-admissible when they obey a local action-gap certificate: if the surrogate value error along the rollout stays below half the true action gap, the greedy rollout reaches the goal. Absolutely Minimal Lipschitz Extension (AMLE) instantiates this certificate through a comparison-principle fill-distance bound that respects the graph geometry. Harmonic extension fails to guarantee the same because its values encode boundary hitting probabilities instead of the shortest-path greedy ordering required by the planner.

What carries the argument

The local action-gap certificate that links controlled surrogate value error to guaranteed rollout success, instantiated by AMLE through its comparison-principle fill-distance bound.

If this is right

AMLE achieves 0.970 aggregate rollout success on the 120 AntMaze layout-derived graph configurations.
Harmonic extension reaches only 0.584 success because many rollout decisions occur in AMLE-compatible but harmonic-incompatible local geometry.
Finite high-p methods reach success rates of 0.903 for p=4 and 0.973 for p=8.
AMLE corrects most harmonic inversions on the rollout-weighted decision scope.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The certificate may extend to guide value function choice in other sparse-label planning domains that use similar argmin-style greedy policies.
Graph geometries that favor infinity-norm extensions over probability-based ones could appear in related shortest-path or navigation tasks.
High-p approximations provide a tunable middle ground that preserves much of the AMLE guarantee while remaining computationally feasible.

Load-bearing premise

The operational argmin-Q planner is assumed and the value extension must satisfy the local action-gap condition for the given graph geometry.

What would settle it

A single greedy rollout in one of the 120 AntMaze graphs where the surrogate value error stays below half the true action gap at every step yet the path fails to reach the goal.

Figures

Figures reproduced from arXiv: 2605.19185 by Shiheng Zhang.

**Figure 1.** Figure 1: Seven-node mechanism witness. At s = 4, the true shortest-path action and AMLE choose neighbour 3, while harmonic chooses neighbour 1. The right panel gives the local inequalities that determine the choices. This is a positive-margin mechanism witness, not a universal-dominance claim. Example 3.12 (Harmonic-measure mismatch on a 7-node graph). On the seven-node graph G7 defined in Appendix B (goal g = 0, s… view at source ↗

read the original abstract

Sparse goal-conditioned planning with few cost-to-go labels can be viewed as a graph-PDE Dirichlet extension problem: extend sparse labels on a goal-dependent boundary to unlabelled graph vertices so that greedy rollouts reach the goal. We study which graph value extensions are planner-admissible under the operational argmin-Q planner. Our main result is a local action-gap certificate: if the surrogate value error along the rollout stays below half the true action gap, then the greedy rollout reaches the goal. Absolutely Minimal Lipschitz Extension (AMLE), the p=infinity endpoint of the graph p-Laplacian family, instantiates this certificate through a comparison-principle fill-distance bound. Harmonic extension, by contrast, can mis-rank local actions because its values reflect boundary hitting probabilities rather than shortest-path greedy order. On 120 AntMaze layout-derived graph configurations, harmonic extension achieves 0.584 aggregate rollout success, while AMLE reaches 0.970. Finite high-p methods also enter a high-success regime, with success 0.903 for p=4, 0.973 for p=8, and 0.982 for a fixed-budget p=16 solver, though the p=16 row is not used as a converged endpoint ranking due to incomplete solver certification. Mechanism audits show that many rollout decisions occur in AMLE-compatible but harmonic-incompatible local geometry, and that AMLE corrects most harmonic inversions on the rollout-weighted decision scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a local action-gap certificate for planner-admissible graph value extensions and shows AMLE beating harmonic extension on AntMaze graphs.

read the letter

This paper's main contribution is a local action-gap certificate showing that if the surrogate value error along a greedy rollout stays below half the true action gap, then the rollout reaches the goal under the argmin-Q planner. AMLE achieves this through a comparison-principle fill-distance bound, unlike harmonic extension which can produce action inversions. What stands out as new is the explicit tie between the certificate and planner admissibility for graph-PDE extensions, along with the direct comparison of AMLE and harmonic methods in this context. The work does well by testing on 120 AntMaze layout-derived graphs and reporting clear differences in rollout success: 0.970 for AMLE versus 0.584 for harmonic. The mechanism audits on how AMLE corrects harmonic inversions in rollout-weighted decisions provide concrete insight into where the methods diverge. Soft spots are present but not central. The experiments show high success for AMLE and improving results with higher p, but they do not include a direct audit confirming that the fill-distance bound keeps the error below half the action gap at visited states. This leaves open whether the certificate is the actual driver or if other factors like better Lipschitz behavior explain the gains. The note on incomplete solver certification for p=16 is a small detail. Overall the central argument holds up on the reported evidence. This paper suits researchers focused on goal-conditioned reinforcement learning and planning with sparse labels on graphs. Readers interested in value function extensions that support reliable greedy planning would get practical value from the certificate and the empirical results. It has a solid enough mix of theory and experiments to merit a serious referee. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper frames sparse goal-conditioned planning as a graph-PDE Dirichlet extension problem on graphs derived from AntMaze layouts. It proves a local action-gap certificate: if the surrogate value error along a greedy argmin-Q rollout stays below half the true action gap, then the rollout reaches the goal. AMLE (the p=∞ case) is shown to satisfy the certificate via a comparison-principle fill-distance bound, while harmonic extension can produce action inversions because its values encode hitting probabilities rather than shortest-path order. Experiments on 120 graph configurations report aggregate rollout success of 0.970 for AMLE versus 0.584 for harmonic, with additional high-p results (0.903 at p=4, 0.973 at p=8) and mechanism audits showing AMLE corrects most harmonic inversions on rollout-weighted decisions.

Significance. If the certificate holds and AMLE's bound is verified to enforce the error < gap/2 condition on the tested geometries, the work supplies a principled criterion for selecting planner-admissible value extensions under sparse labels. This could guide design of reliable goal-conditioned planners and explain performance gaps between Lipschitz-style and probabilistic extensions. The mechanism audits add diagnostic value beyond aggregate success rates.

major comments (2)

[Experimental results] Experimental results section: the reported 0.970 success rate and inversion-correction audits do not include a direct measurement confirming that the AMLE comparison-principle fill-distance bound keeps surrogate error below half the local action gap at states visited by the rollouts. Without this check the link between the stated certificate and the empirical advantage remains indirect; the advantage could arise from other properties such as global Lipschitz behavior.
[Theoretical results] Main theoretical result (local action-gap certificate): while the certificate is formulated in terms of the true action gap and surrogate error, the manuscript does not explicitly verify that the fill-distance bound applies to the specific metric and label-placement geometry of the AntMaze-derived graphs used in the 120 configurations.

minor comments (2)

[Abstract and experimental results] The aggregation method for the 0.970 and 0.584 success rates across the 120 configurations is not stated (e.g., mean per graph, fraction of successful graphs).
[Preliminaries] Notation for the operational argmin-Q planner and the precise definition of 'surrogate value error along the rollout' would benefit from an explicit equation reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and for recognizing the potential of the local action-gap certificate. We address each major comment below, clarifying the manuscript's contributions while proposing targeted revisions to make the theory-experiment link more direct.

read point-by-point responses

Referee: [Experimental results] Experimental results section: the reported 0.970 success rate and inversion-correction audits do not include a direct measurement confirming that the AMLE comparison-principle fill-distance bound keeps surrogate error below half the local action gap at states visited by the rollouts. Without this check the link between the stated certificate and the empirical advantage remains indirect; the advantage could arise from other properties such as global Lipschitz behavior.

Authors: We agree that an explicit empirical check of the error < gap/2 condition along the actual rollouts would strengthen the argument and help isolate the certificate's role from other properties such as global Lipschitz behavior. In the revised manuscript we will add a new audit subsection that, for the 120 configurations, reports the maximum surrogate error versus half the local action gap at every state visited by the AMLE rollouts, thereby directly confirming that the comparison-principle fill-distance bound enforces the certificate in the tested geometries. revision: yes
Referee: [Theoretical results] Main theoretical result (local action-gap certificate): while the certificate is formulated in terms of the true action gap and surrogate error, the manuscript does not explicitly verify that the fill-distance bound applies to the specific metric and label-placement geometry of the AntMaze-derived graphs used in the 120 configurations.

Authors: The comparison-principle fill-distance bound is stated for arbitrary finite graphs equipped with shortest-path metrics and arbitrary boundary label sets. The AntMaze-derived graphs are constructed exactly under these conditions (undirected graphs with geodesic distance and goal vertices as the Dirichlet boundary). To make the applicability explicit we will insert a short verification paragraph in the theoretical section that confirms the AntMaze graphs satisfy the metric and boundary hypotheses of the bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central result is a local action-gap certificate stated directly in terms of surrogate value error versus the true action gap under the argmin-Q planner; this is presented as a first-principles implication rather than a fitted or self-referential quantity. AMLE is shown to satisfy the certificate via an independent comparison-principle fill-distance bound on the graph geometry, which is a property of the extension operator and does not reduce to the certificate by the paper's own equations. Empirical success rates on AntMaze graphs and mechanism audits serve as validation, not as inputs that define the bound or certificate. No load-bearing step collapses to a self-definition, renamed known result, or self-citation chain; the derivation remains self-contained against external graph-PDE and planning benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper treats the planning task as a standard graph-PDE Dirichlet problem and introduces the admissibility certificate without introducing new free parameters or invented entities.

axioms (1)

domain assumption Sparse goal-conditioned planning can be viewed as a graph-PDE Dirichlet extension problem.
Explicitly stated as the modeling choice in the opening sentence of the abstract.

pith-pipeline@v0.9.0 · 5786 in / 1309 out tokens · 62911 ms · 2026-05-20T11:35:01.013397+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Deeksha Adil, Richard Peng, and Sushant Sachdeva

doi: 10.1007/s00030-012-0158-1. Deeksha Adil, Richard Peng, and Sushant Sachdeva. Fast, provably convergent IRLS algorithm forp-norm linear regression. InAdvances in Neural Information Processing Systems 32, pp. 14166–14177,

work page doi:10.1007/s00030-012-0158-1
[2]

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba

doi: 10.1145/3686794. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in neural information processing systems,

work page doi:10.1145/3686794
[3]

Gunnar Aronsson

doi: 10.1561/2200000101. Gunnar Aronsson. Extension of functions satisfying Lipschitz conditions.Arkiv för Matematik, 6(6):551–561,

work page doi:10.1561/2200000101
[4]

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani

doi: 10.1162/089976603321780317. Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.Journal of Machine Learning Research, 7:2399–2434,

work page doi:10.1162/089976603321780317
[5]

Jeff Calder

doi: 10.1093/imanum/drac048. Jeff Calder. The game-theoreticp-Laplacian and semi-supervised learning with few labels.Nonlinearity, 32 (1):301–330,

work page doi:10.1093/imanum/drac048
[6]

Jeff Calder and Dejan Slepčev

doi: 10.1137/18M1199241. Jeff Calder and Dejan Slepčev. Properly-weighted graph Laplacian for semi-supervised learning.Applied Mathematics & Optimization, 82(3):1111–1159,

work page doi:10.1137/18m1199241
[7]

Jeff Calder, Dejan Slepčev, and Matthew Thorpe

doi: 10.1007/s00245-019-09637-3. Jeff Calder, Dejan Slepčev, and Matthew Thorpe. Rates of convergence for Laplacian semi-supervised learning with low labeling rates.Research in the Mathematical Sciences, 10(1),

work page doi:10.1007/s00245-019-09637-3
[8]

Affine representations of fractional processes with applica- tions in mathematical finance.Stochastic Process

doi: 10.1016/j. jmaa.2013.05.015. Keenan Crane, Clarisse Weischedel, and Max Wardetzky. Geodesics in heat: a new approach to computing distance based on heat flow.ACM Transactions on Graphics, 32(5):1–11,

work page doi:10.1016/j 2013
[9]

doi: 10.1145/2516971. 2516977. Peter Dayan. Improving generalization for temporal difference learning: the successor representation.Neural Computation, 5(4):613–624,

work page doi:10.1145/2516971
[10]

doi: 10.1162/ neco.1993.5.4.613

doi: 10.1162/neco.1993.5.4.613. Ahmed El Alaoui, Xiuyuan Cheng, Aaditya Ramdas, Martin J. Wainwright, and Michael I. Jordan. Asymptotic behavior ofℓp-based Laplacian regularization in semi-supervised learning. InProceedings of the 29th Annual Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pp. 879–906,

work page doi:10.1162/neco.1993.5.4.613 1993
[11]

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine

doi: 10.1017/S0956792517000122. Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: bridging planning and reinforcement learning. InAdvances in Neural Information Processing Systems 32,

work page doi:10.1017/s0956792517000122
[12]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. InarXiv preprint arXiv:2004.07219,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[13]

Robert Jensen

doi: 10.1137/19M1245372. Robert Jensen. Uniqueness of Lipschitz extensions: minimizing the sup norm of the gradient.Archive for Rational Mechanics and Analysis, 123(1):51–74,

work page doi:10.1137/19m1245372
[14]

Petri Juutinen and Nageswari Shanmugalingam

doi: 10.1007/BF00386368. Petri Juutinen and Nageswari Shanmugalingam. Equivalence of AMLE, strong AMLE, and comparison with cones in metric measure spaces.Mathematische Nachrichten, 279(9–10):1083–1098,

work page doi:10.1007/bf00386368
[15]

Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A Spielman

doi: 10.1002/mana.200510411. Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A Spielman. Algorithms for Lipschitz learning on graphs. InProceedings of the 28th Annual Conference on Learning Theory, volume 40 ofPMLR, pp. 1190–1223,

work page doi:10.1002/mana.200510411
[16]

Rémi Munos

doi: 10.1051/cocv/2010046. Rémi Munos. Error bounds for approximate policy iteration. InProceedings of the 20th International Conference on Machine Learning, pp. 560–567,

work page doi:10.1051/cocv/2010046
[17]

Boaz Nadler, Nathan Srebro, and Xueyuan Zhou

doi: 10.1137/040614384. Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the graph Laplacian: the limit of infinite unlabelled data. InAdvances in Neural Information Processing Systems 22, pp. 1330–1338,

work page doi:10.1137/040614384
[18]

, date-added =

doi: 10.1090/S0894-0347-08-00606-1. Vitchyr H. Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: model-free deep RL for model-based control. InInternational Conference on Learning Representations,

work page doi:10.1090/s0894-0347-08-00606-1
[19]

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver

doi: 10.1007/s10208-022-09557-9. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning,

work page doi:10.1007/s10208-022-09557-9
[20]

doi: 10.1007/s00526-012-0498-z. James A. Sethian. A fast marching level set method for monotonically advancing fronts.Proceedings of the National Academy of Sciences, 93(4):1591–1595,

work page doi:10.1007/s00526-012-0498-z
[21]

Scott Sheffield and Charles K Smart

doi: 10.1073/pnas.93.4.1591. Scott Sheffield and Charles K Smart. Vector-valued optimal Lipschitz extensions.Communications on Pure and Applied Mathematics, 65(1):128–154,

work page doi:10.1073/pnas.93.4.1591
[22]

Satinder P

doi: 10.1002/cpa.20391. Satinder P. Singh and Richard C. Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233,

work page doi:10.1002/cpa.20391
[23]

Dejan Slepčev and Matthew Thorpe

doi: 10.1023/A:1022693225949. Dejan Slepčev and Matthew Thorpe. Analysis ofp-Laplacian regularization in semi-supervised learning. SIAM Journal on Mathematical Analysis, 51(3):2085–2120,

work page doi:10.1023/a:1022693225949 2085
[24]

Alexander J

doi: 10.1137/17M115222X. Alexander J. Smola and Risi Kondor. Kernels and regularization on graphs. InLearning Theory and Kernel Machines, volume 2777 ofLecture Notes in Computer Science, pp. 144–158

work page doi:10.1137/17m115222x
[25]

The hippocampus as a predictive map.Nat Neurosci, 20(11):1643–1653, November 2017

doi: 10.1038/nn.4650. Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems 29,

work page doi:10.1038/nn.4650
[26]

Vladimir Vovk, Alex Gammerman, and Glenn Shafer.Algorithmic learning in a random world

doi: 10.1109/9.412624. Vladimir Vovk, Alex Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer,

work page doi:10.1109/9.412624
[27]

Ronald J

doi: 10.1007/b106715. Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS-93-14, Northeastern University, College of Computer Science,

work page doi:10.1007/b106715
[28]

Nearest-label uses the same sparse boundary as AMLE; oracle Dijkstra uses full-graph shortest-path information

Table 9: Reference baselines aggregated by resolutionr. Nearest-label uses the same sparse boundary as AMLE; oracle Dijkstra uses full-graph shortest-path information. rAMLE(sparse, main rollout) nearest-label oracle Dijkstra 4 0.993 0.038±0.011 1.000±0.000 8 0.971 0.031±0.009 1.000±0.000 12 0.946 0.027±0.009 1.000±0.000 The sparse-labelAMLE-vs-oracle gap...

work page 2000
[29]

21 Case p = ∞(AMLE).Strict minimality gives miny∼xu(y) > u(x)and maxy∼xu(y) > u(x), so the midrange satisfies 1 2(minyu(y) + maxyu(y))>u (x), contradicting theAMLEidentity u(x) = 1 2(minyu(y) + maxyu(y)). Statement and proof of Proposition B.2 (combined local separation) Proposition B.2(CombinedAMLE-vs-harmonic local separation).Fix a decision state s wit...

work page 2015

[1] [1]

Deeksha Adil, Richard Peng, and Sushant Sachdeva

doi: 10.1007/s00030-012-0158-1. Deeksha Adil, Richard Peng, and Sushant Sachdeva. Fast, provably convergent IRLS algorithm forp-norm linear regression. InAdvances in Neural Information Processing Systems 32, pp. 14166–14177,

work page doi:10.1007/s00030-012-0158-1

[2] [2]

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba

doi: 10.1145/3686794. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in neural information processing systems,

work page doi:10.1145/3686794

[3] [3]

Gunnar Aronsson

doi: 10.1561/2200000101. Gunnar Aronsson. Extension of functions satisfying Lipschitz conditions.Arkiv för Matematik, 6(6):551–561,

work page doi:10.1561/2200000101

[4] [4]

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani

doi: 10.1162/089976603321780317. Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.Journal of Machine Learning Research, 7:2399–2434,

work page doi:10.1162/089976603321780317

[5] [5]

Jeff Calder

doi: 10.1093/imanum/drac048. Jeff Calder. The game-theoreticp-Laplacian and semi-supervised learning with few labels.Nonlinearity, 32 (1):301–330,

work page doi:10.1093/imanum/drac048

[6] [6]

Jeff Calder and Dejan Slepčev

doi: 10.1137/18M1199241. Jeff Calder and Dejan Slepčev. Properly-weighted graph Laplacian for semi-supervised learning.Applied Mathematics & Optimization, 82(3):1111–1159,

work page doi:10.1137/18m1199241

[7] [7]

Jeff Calder, Dejan Slepčev, and Matthew Thorpe

doi: 10.1007/s00245-019-09637-3. Jeff Calder, Dejan Slepčev, and Matthew Thorpe. Rates of convergence for Laplacian semi-supervised learning with low labeling rates.Research in the Mathematical Sciences, 10(1),

work page doi:10.1007/s00245-019-09637-3

[8] [8]

Affine representations of fractional processes with applica- tions in mathematical finance.Stochastic Process

doi: 10.1016/j. jmaa.2013.05.015. Keenan Crane, Clarisse Weischedel, and Max Wardetzky. Geodesics in heat: a new approach to computing distance based on heat flow.ACM Transactions on Graphics, 32(5):1–11,

work page doi:10.1016/j 2013

[9] [9]

doi: 10.1145/2516971. 2516977. Peter Dayan. Improving generalization for temporal difference learning: the successor representation.Neural Computation, 5(4):613–624,

work page doi:10.1145/2516971

[10] [10]

doi: 10.1162/ neco.1993.5.4.613

doi: 10.1162/neco.1993.5.4.613. Ahmed El Alaoui, Xiuyuan Cheng, Aaditya Ramdas, Martin J. Wainwright, and Michael I. Jordan. Asymptotic behavior ofℓp-based Laplacian regularization in semi-supervised learning. InProceedings of the 29th Annual Conference on Learning Theory, volume 49 ofProceedings of Machine Learning Research, pp. 879–906,

work page doi:10.1162/neco.1993.5.4.613 1993

[11] [11]

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine

doi: 10.1017/S0956792517000122. Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: bridging planning and reinforcement learning. InAdvances in Neural Information Processing Systems 32,

work page doi:10.1017/s0956792517000122

[12] [12]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. InarXiv preprint arXiv:2004.07219,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[13] [13]

Robert Jensen

doi: 10.1137/19M1245372. Robert Jensen. Uniqueness of Lipschitz extensions: minimizing the sup norm of the gradient.Archive for Rational Mechanics and Analysis, 123(1):51–74,

work page doi:10.1137/19m1245372

[14] [14]

Petri Juutinen and Nageswari Shanmugalingam

doi: 10.1007/BF00386368. Petri Juutinen and Nageswari Shanmugalingam. Equivalence of AMLE, strong AMLE, and comparison with cones in metric measure spaces.Mathematische Nachrichten, 279(9–10):1083–1098,

work page doi:10.1007/bf00386368

[15] [15]

Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A Spielman

doi: 10.1002/mana.200510411. Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A Spielman. Algorithms for Lipschitz learning on graphs. InProceedings of the 28th Annual Conference on Learning Theory, volume 40 ofPMLR, pp. 1190–1223,

work page doi:10.1002/mana.200510411

[16] [16]

Rémi Munos

doi: 10.1051/cocv/2010046. Rémi Munos. Error bounds for approximate policy iteration. InProceedings of the 20th International Conference on Machine Learning, pp. 560–567,

work page doi:10.1051/cocv/2010046

[17] [17]

Boaz Nadler, Nathan Srebro, and Xueyuan Zhou

doi: 10.1137/040614384. Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the graph Laplacian: the limit of infinite unlabelled data. InAdvances in Neural Information Processing Systems 22, pp. 1330–1338,

work page doi:10.1137/040614384

[18] [18]

, date-added =

doi: 10.1090/S0894-0347-08-00606-1. Vitchyr H. Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: model-free deep RL for model-based control. InInternational Conference on Learning Representations,

work page doi:10.1090/s0894-0347-08-00606-1

[19] [19]

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver

doi: 10.1007/s10208-022-09557-9. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning,

work page doi:10.1007/s10208-022-09557-9

[20] [20]

doi: 10.1007/s00526-012-0498-z. James A. Sethian. A fast marching level set method for monotonically advancing fronts.Proceedings of the National Academy of Sciences, 93(4):1591–1595,

work page doi:10.1007/s00526-012-0498-z

[21] [21]

Scott Sheffield and Charles K Smart

doi: 10.1073/pnas.93.4.1591. Scott Sheffield and Charles K Smart. Vector-valued optimal Lipschitz extensions.Communications on Pure and Applied Mathematics, 65(1):128–154,

work page doi:10.1073/pnas.93.4.1591

[22] [22]

Satinder P

doi: 10.1002/cpa.20391. Satinder P. Singh and Richard C. Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233,

work page doi:10.1002/cpa.20391

[23] [23]

Dejan Slepčev and Matthew Thorpe

doi: 10.1023/A:1022693225949. Dejan Slepčev and Matthew Thorpe. Analysis ofp-Laplacian regularization in semi-supervised learning. SIAM Journal on Mathematical Analysis, 51(3):2085–2120,

work page doi:10.1023/a:1022693225949 2085

[24] [24]

Alexander J

doi: 10.1137/17M115222X. Alexander J. Smola and Risi Kondor. Kernels and regularization on graphs. InLearning Theory and Kernel Machines, volume 2777 ofLecture Notes in Computer Science, pp. 144–158

work page doi:10.1137/17m115222x

[25] [25]

The hippocampus as a predictive map.Nat Neurosci, 20(11):1643–1653, November 2017

doi: 10.1038/nn.4650. Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems 29,

work page doi:10.1038/nn.4650

[26] [26]

Vladimir Vovk, Alex Gammerman, and Glenn Shafer.Algorithmic learning in a random world

doi: 10.1109/9.412624. Vladimir Vovk, Alex Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer,

work page doi:10.1109/9.412624

[27] [27]

Ronald J

doi: 10.1007/b106715. Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS-93-14, Northeastern University, College of Computer Science,

work page doi:10.1007/b106715

[28] [28]

Nearest-label uses the same sparse boundary as AMLE; oracle Dijkstra uses full-graph shortest-path information

Table 9: Reference baselines aggregated by resolutionr. Nearest-label uses the same sparse boundary as AMLE; oracle Dijkstra uses full-graph shortest-path information. rAMLE(sparse, main rollout) nearest-label oracle Dijkstra 4 0.993 0.038±0.011 1.000±0.000 8 0.971 0.031±0.009 1.000±0.000 12 0.946 0.027±0.009 1.000±0.000 The sparse-labelAMLE-vs-oracle gap...

work page 2000

[29] [29]

21 Case p = ∞(AMLE).Strict minimality gives miny∼xu(y) > u(x)and maxy∼xu(y) > u(x), so the midrange satisfies 1 2(minyu(y) + maxyu(y))>u (x), contradicting theAMLEidentity u(x) = 1 2(minyu(y) + maxyu(y)). Statement and proof of Proposition B.2 (combined local separation) Proposition B.2(CombinedAMLE-vs-harmonic local separation).Fix a decision state s wit...

work page 2015