pith. sign in

arxiv: 2605.09335 · v1 · submitted 2026-05-10 · 💻 cs.LG

Functional Graphs for Predicting and Explaining Goal Failure in Sparse Goal-Conditioned RL

Pith reviewed 2026-05-12 04:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords goal-conditioned reinforcement learningsparse rewardsfunctional graphsgoal failure diagnosticslocal goal supportattractors and basinsgridworld environments
0
0 comments X

The pith

A one-step local support measure on greedy policy graphs predicts most goal failures in sparse goal-conditioned RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that goal-conditioned policies trained with sparse rewards can be turned into deterministic functional graphs by always taking the greedy action from each state. Inside these graphs, local goal support counts the fraction of a state's valid neighbors whose successor is the goal itself. This cheap statistic, when below 0.5, flags low-success goals with precision 0.921, recall 0.929, and F1 0.925 in the main 8x8 setting. When local support is present yet the goal still fails, the paper classifies the full graph into four types that reveal whether distant states are pulled into competing attractors or trapped in fragmented basins. The result is a post-training diagnostic that explains why aggregate success rates conceal per-goal problems.

Core claim

Trained goal-conditioned value policies induce deterministic functional graphs under greedy evaluation that decompose behavior into attractors and basins. Local goal support, defined as the one-step fraction of valid neighbors whose greedy successor is the goal, is exactly zero for unreachable goals and, when at or below 0.5, identifies low-success goals with precision 0.921, recall 0.929, and F1 0.925 in 8x8 TD learning. A compact taxonomy of goal-dominant, competitor-dominated, partial/contested, and fragmented graphs accounts for the residual failures where local support exists but global success does not.

What carries the argument

The deterministic functional graph induced by greedy policy evaluation, which maps every state to a single successor, together with the local goal support statistic that counts how many neighboring states point directly to the goal.

If this is right

  • Low-LGS goals can be identified after training without running many evaluation episodes per goal.
  • Residual failures can be attributed to specific global structures such as competitor attractors or basin fragmentation.
  • The same threshold and taxonomy apply across TD and other update rules, different curricula, larger grids, and bottleneck geometries.
  • Aggregate success rates can be decomposed into per-goal contributions using the local and global graph properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures could be modified to increase local goal support directly, for example by adding an auxiliary loss that encourages neighbors to map to the intended goal.
  • The graph taxonomy might be used to design automatic relabeling or goal-filtering steps inside curricula that drop or repair competitor-dominated or fragmented goals.
  • Similar successor-graph analysis could be applied to hierarchical or skill-based RL to detect when low-level skills fail to compose into higher-level goals.

Load-bearing premise

The environments are deterministic and sparse so that greedy evaluation produces clean functional graphs without stochastic transitions or continuous-state noise.

What would settle it

Applying the identical LGS <= 0.5 rule in a version of the gridworld where each action has a nonzero probability of moving in a random direction and checking whether the precision, recall, and F1 scores remain above 0.9.

Figures

Figures reproduced from arXiv: 2605.09335 by Shalley Dash.

Figure 1
Figure 1. Figure 1: Policy-induced functional graphs from one [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Policy maps in the 8×8 bottleneck environment under TD learning. The red star marks the goal, orange markers indicate non-goal attractors, blue arrows show greedy policy flow, and brown arrows indicate wall-hit transitions. Text beneath each panel reports S: finite-horizon success, Sup: local goal support count, C: largest competing-basin fraction, and F: fragmentation index [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 3
Figure 3. Figure 3: Success rates as a function of goal distance (mean [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Additional policy maps under TD learning illustrating variability within the same learning [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Policy maps under Monte Carlo (MC) learning for three representative goals (high, medium, [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Policy maps under TD learning with edge-biased curriculum sampling. Despite preferential [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Policy maps for neighboring goals in a 12 × 12 environment under TD learning. The left goal exhibits a coherent basin of attraction with full local support and achieves perfect success. The middle goal shows partial success due to a competing attractor that captures a large portion of the state space. The right goal exhibits fragmented dynamics with multiple competing attractors and weak local support, lea… view at source ↗
Figure 8
Figure 8. Figure 8: Policy maps for neighboring goals under Monte Carlo (MC) learning in a [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional policy maps in the 8×8 bottleneck environment under TD learning. The red star marks the goal, orange markers indicate non-goal attractors, blue arrows show greedy policy flow, and brown arrows indicate wall-hit transitions. Text below each panel reports S: finite-horizon success, Sup: local goal support count, C: largest competing-basin fraction, and F: fragmentation index. The examples show tha… view at source ↗
Figure 10
Figure 10. Figure 10: LGS stratification curves across experimental settings. Each panel plots mean finite [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
read the original abstract

Sparse goal-conditioned reinforcement learning can produce policies whose failures are hidden by aggregate success rates. We analyze trained goal-conditioned value policies through the deterministic functional graphs induced by greedy evaluation: for each goal, every state maps to a single successor, decomposing behavior into attractors and basins. This reveals a local-to-global structure in learned policies. We define local goal support (LGS), a one-step statistic measuring the fraction of valid neighboring states whose greedy successor is the goal. In deterministic sparse GridWorlds, zero LGS exactly precludes goal entry from non-goal starts. Empirically, weak LGS is a strong diagnostic of goal-level failure across update rules, curricula, larger grids, and bottleneck geometries: the fixed rule LGS <= 0.5 identifies low-success goals with precision 0.921, recall 0.929, and F1 0.925 in the main 8x8 TD setting, with similar performance across variants. However, local support is not sufficient for global success: some supported goals still fail because distant states are captured by competing attractors or fragmented basin structure. We therefore introduce a compact post-hoc taxonomy of policy-induced graphs -- goal-dominant, competitor-dominated, partial/contested, and fragmented -- to characterize residual failure modes beyond local support. These results show that sparse GCRL failures can be understood as structured policy-induced dynamics, and that local one-step policy structure provides a cheap post-training diagnostic for goal-level failure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes trained goal-conditioned policies in sparse RL by constructing deterministic functional graphs from greedy action selection. It introduces Local Goal Support (LGS) as a one-step local statistic and shows that a threshold on LGS can predict goal-level success or failure with high precision and recall in deterministic GridWorld environments. Additionally, it proposes a taxonomy of policy graph structures to explain cases where local support does not guarantee global success.

Significance. If the empirical findings hold, the approach provides an efficient, post-training diagnostic tool for identifying problematic goals in goal-conditioned RL without requiring further environment interactions. This could be valuable for practitioners working with sparse rewards, as it turns policy inspection into a graph analysis problem and offers interpretable categories for failure modes. The cross-variant consistency of the LGS diagnostic is a notable strength.

major comments (2)
  1. [Abstract] Abstract: The fixed rule LGS <= 0.5 is reported to achieve precision 0.921, recall 0.929, F1 0.925 in the 8x8 TD setting. However, this performance is tied to the deterministic setting where each state has exactly one greedy successor; the manuscript should include a discussion on how the diagnostic would need to be adapted for stochastic environments, as the current definition of LGS and the graph taxonomy would not apply directly.
  2. [Abstract] Abstract: The statement that zero LGS exactly precludes goal entry from non-goal starts is presented as a key property supporting the diagnostic. A formal proof or derivation of this impossibility result should be provided in the main text (with explicit assumptions) to allow verification of its scope and to confirm it is not circular with the graph construction.
minor comments (2)
  1. The abstract mentions similar performance across variants but does not specify the exact variants or provide quantitative comparison; a table summarizing F1 scores for all settings would improve clarity.
  2. Ensure consistent use of terminology for the graph taxonomy (e.g., goal-dominant, competitor-dominated, partial/contested, fragmented) and provide at least one illustrative example per category in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation for minor revision. The comments highlight important points on scope and rigor that we will address directly in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The fixed rule LGS <= 0.5 is reported to achieve precision 0.921, recall 0.929, F1 0.925 in the 8x8 TD setting. However, this performance is tied to the deterministic setting where each state has exactly one greedy successor; the manuscript should include a discussion on how the diagnostic would need to be adapted for stochastic environments, as the current definition of LGS and the graph taxonomy would not apply directly.

    Authors: We agree that the LGS diagnostic and the functional-graph taxonomy are defined for deterministic greedy policies. We will add a new paragraph in the Discussion section that explicitly states this scope limitation and sketches two directions for stochastic extension: (i) replacing the deterministic successor with an expected-support measure over the policy distribution, and (ii) constructing sampled functional graphs via Monte-Carlo rollouts. We will also note that the exact zero-LGS impossibility result does not carry over without additional assumptions on the support of the stochastic policy. revision: yes

  2. Referee: [Abstract] Abstract: The statement that zero LGS exactly precludes goal entry from non-goal starts is presented as a key property supporting the diagnostic. A formal proof or derivation of this impossibility result should be provided in the main text (with explicit assumptions) to allow verification of its scope and to confirm it is not circular with the graph construction.

    Authors: We accept the request for a formal derivation. The claim follows immediately from the definition of LGS and the construction of the deterministic functional graph: if LGS(g) = 0 then no non-goal state s has g as its unique greedy successor, so the goal set has in-degree zero from outside itself. We will insert a short, self-contained lemma (with the explicit assumptions of determinism and greedy action selection) immediately after the definition of LGS in Section 3, making the argument non-circular. revision: yes

Circularity Check

0 steps flagged

No circularity: LGS is an independent one-step statistic tested against rollout success

full rationale

The paper defines local goal support (LGS) as the fraction of valid neighboring states whose greedy successor equals the goal, directly from the deterministic functional graph. It proves that LGS=0 precludes entry in deterministic sparse GridWorlds and reports empirical F1=0.925 for the fixed threshold LGS<=0.5 against independently measured goal success rates obtained via policy rollouts. Success rates are not used to fit any parameter or define LGS; the taxonomy of graph structures is a post-hoc descriptive classification. No self-citations, fitted inputs, or self-definitional reductions appear in the derivation. The central claim is an empirical correlation between a local graph property and global outcome, which does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on deterministic dynamics and greedy evaluation in sparse grids; LGS and taxonomy are introduced without external validation.

axioms (1)
  • domain assumption Environments are deterministic sparse GridWorlds allowing exact functional graph construction from greedy policy.
    Stated as the setting for all empirical results and the zero-LGS preclusion claim.
invented entities (2)
  • Local Goal Support (LGS) no independent evidence
    purpose: One-step statistic measuring direct goal entry from neighbors
    Newly defined in the paper as the core diagnostic.
  • Policy-induced graph taxonomy (goal-dominant, competitor-dominated, partial/contested, fragmented) no independent evidence
    purpose: Classify residual failure modes beyond local support
    Post-hoc categories introduced to explain cases where LGS is high but success is low.

pith-pipeline@v0.9.0 · 5557 in / 1355 out tokens · 37065 ms · 2026-05-12T04:28:42.765349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We analyze trained goal-conditioned value policies through the deterministic functional graphs induced by greedy evaluation: for each goal, every state maps to a single successor, decomposing behavior into attractors and basins. ... We define local goal support (LGS), a one-step statistic measuring the fraction of valid neighboring states whose greedy successor is the goal.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Proposition 2 (Zero local support precludes goal entry). ... If fg(s) ≠ g ∀ s ∈ N(g), then no trajectory starting from any s0 ≠ g can reach g.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  2. [2]

    TLDR: Unsupervised goal-conditioned RL via temporal distance-aware representations

    Junik Bae, Kwanyoung Park, and Youngwoon Lee. TLDR: Unsupervised goal-conditioned RL via temporal distance-aware representations. InConference on Robot Learning, 2024

  3. [3]

    Zico Kolter, and Vladlen Koltun

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems, 2019

  4. [4]

    Hunt, Tom Schaul, Hado van Hasselt, and David Silver

    Andre Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  5. [5]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems, 2018

  6. [6]

    Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

    Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

  7. [7]

    Elaydi.An Introduction to Difference Equations

    Saber N. Elaydi.An Introduction to Difference Equations. Springer, 3 edition, 2005

  8. [8]

    Contrastive learning as goal-conditioned reinforcement learning

    Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, and Sergey Levine. Contrastive learning as goal-conditioned reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, 2022

  9. [9]

    Graph neural networks and reinforcement learning: A survey

    Fatemeh Fathinezhad, Peyman Adibi, Babak Shoushtarian, and Jocelyn Chanussot. Graph neural networks and reinforcement learning: A survey. InGraph Neural Networks. IntechOpen, 2023

  10. [10]

    Cambridge University Press, Cambridge, 2009.doi:10.1017/CBO9780511801655

    Philippe Flajolet and Robert Sedgewick.Analytic Combinatorics. Cambridge University Press, Cambridge, 2009. doi: 10.1017/CBO9780511801655

  11. [11]

    Automatic goal generation for reinforcement learning agents

    Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1515–1528. PMLR, 2018

  12. [12]

    Learning to reach goals via iterated supervised learning

    Dibya Ghosh, Abhishek Gupta, Ashwin Reddy, Justin Fu, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals via iterated supervised learning. InInternational Conference on Learning Representations, 2021

  13. [13]

    Vittorio Giammarino, Ruiqi Ni, and Ahmed H. Qureshi. Physics-informed value learner for offline goal-conditioned reinforcement learning. InAdvances in Neural Information Processing Systems, 2025

  14. [14]

    Addison-Wesley, 1969

    Frank Harary.Graph Theory. Addison-Wesley, 1969

  15. [15]

    Offline goal- conditioned reinforcement learning with quasimetric representations

    Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, and Sergey Levine. Offline goal- conditioned reinforcement learning with quasimetric representations. InAdvances in Neural Information Processing Systems, 2025

  16. [16]

    Au- tomatic curriculum learning for deep rl: A short survey

    Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. Au- tomatic curriculum learning for deep rl: A short survey. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), pages 4819–4825, 2020. doi: 10.24963/ijcai.2020/671

  17. [17]

    Universal value function approxima- tors

    Tom Schaul, Dan Horgan, Karol Gregor, and David Silver. Universal value function approxima- tors. InInternational Conference on Machine Learning (ICML), 2015

  18. [18]

    NerveNet: Learning structured policy with graph neural networks

    Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. NerveNet: Learning structured policy with graph neural networks. InInternational Conference on Learning Representations, 2018. 10

  19. [19]

    Louis-Pascal A. C. Xhonneux, Meng Qu, and Jian Tang. Continuous graph neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10432–10441, 2020

  20. [20]

    Rethinking goal-conditioned supervised learning and its connection to offline RL

    Rui Yang, Yiming Lu, Wenzhe Li, Hao Sun, Meng Fang, Yali Du, Xiu Li, Lei Han, and Chongjie Zhang. Rethinking goal-conditioned supervised learning and its connection to offline RL. In International Conference on Learning Representations, 2022

  21. [21]

    Reichert, Timothy P

    Vinícius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David P. Reichert, Timothy P. Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Pe- ter W. Battaglia. Deep reinforcement learning with relational inductive biases. InInternational Co...

  22. [22]

    IfLGS(g) = 0, thenSucc H (g) = 0

  23. [23]

    IfLGS(g)>0, thenSucc H (g)>0forH≥1, but this does not imply high success

  24. [24]

    16 Proof.The first statement follows from Proposition 2: any trajectory reaching g from a non-goal start must enter g from some valid neighbor s∈N(g)

    EvenLGS(g) = 1does not guaranteeSucc H (g) = 1. 16 Proof.The first statement follows from Proposition 2: any trajectory reaching g from a non-goal start must enter g from some valid neighbor s∈N(g) . If no neighbor maps to g, such an entry transition is impossible. For the second statement, if LGS(g)>0 , then there exists at least one neighbor s∈N(g) such...