On mechanisms for transfer using landmark value functions in multi-task lifelong reinforcement learning

Nick Denis

arxiv: 1907.00884 · v1 · pith:46UPRXGMnew · submitted 2019-07-01 · 💻 cs.LG · cs.AI· stat.ML

On mechanisms for transfer using landmark value functions in multi-task lifelong reinforcement learning

Nick Denis This is my paper

Pith reviewed 2026-05-25 11:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords landmark coveringtransfer learningmulti-task reinforcement learninggoal-based RLtopological coveringvalue functionsaction pruninglifelong learning

0 comments

The pith

Landmark coverings built from traversibility metrics enable three transfer mechanisms in goal-based multi-task RL and bound Q-values at each state-action pair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines two metrics on the state space that capture how readily an agent can move from one state to another. These metrics are used to select a set of landmark states that form a topological covering of the space in a fully self-supervised way. The resulting covering supports three concrete transfer mechanisms when the agent faces a new goal: extending the Landmark Options Via Reflection framework, treating landmark value functions as features for a greedy policy that performs near-oracle zero-shot transfer, and training a learned reward function that supplies denser signals. The same covering also supplies theoretical bounds on Q-values and automatically prunes actions that cannot belong to any optimal policy for the current goal. A reader would care because the approach reduces the samples needed to solve sequences of related tasks without external supervision.

Core claim

We show that these landmark coverings confer theoretical advantages for transfer learning within the goal-based multi-task RL setting. Specifically, we demonstrate three mechanisms by which landmark coverings can be used for successful transfer learning. First, we extend the Landmark Options Via Reflection (LOVR) framework to this new topological covering; second, we use the landmark-centric value functions themselves as features and define a greedy zombie policy that achieves near oracle performance on a sequence of zero-shot transfer tasks; finally, motivated by the second transfer mechanism, we introduce a learned reward function that provides a more dense reward signal for goal-based RL.

What carries the argument

The topological landmark covering, built from two traversibility metrics on the state space in a self-supervised manner, which supplies Q-value bounds and enables action pruning at infeasible actions.

If this is right

Extending LOVR with the new covering transfers options across tasks while preserving the original guarantees.
Treating landmark value functions as features yields a greedy policy that reaches near-oracle performance on zero-shot transfer tasks.
The learned dense reward derived from the covering improves sample efficiency for goal-based RL.
The Q-value bounds at each state-action pair allow systematic pruning of actions that cannot be optimal for the current goal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the traversibility metrics remain computable from interaction alone in larger or continuous spaces, the same covering construction could reduce sample complexity in lifelong settings beyond discrete grids.
The action-pruning mechanism could be combined with standard planners to limit the branching factor during planning for new goals.
Because the covering is built without task-specific labels, it may serve as a reusable substrate for transfer even when the set of goals changes over time.

Load-bearing premise

The two metrics on the state space encode useful notions of traversibility and can be used to construct a topological covering by landmark states in a fully self-supervised manner that supports the claimed transfer mechanisms and Q-value bounds.

What would settle it

Running the greedy zombie policy on a sequence of held-out goals in a discrete gridworld and measuring whether its performance remains near the oracle optimum only when the landmark covering is present and drops sharply when the covering is removed or the metrics are replaced by random landmarks.

Figures

Figures reproduced from arXiv: 1907.00884 by Nick Denis.

**Figure 4.** Figure 4: Bandit Controller Arm Selection. Red: LOVR arm; Black: Baseline DQN arm; Green: [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Transfer learning across different reinforcement learning (RL) tasks is becoming an increasingly valuable area of research. We consider a goal-based multi-task RL framework and mechanisms by which previously solved tasks can reduce sample complexity and regret when the agent is faced with a new task. Specifically, we introduce two metrics on the state space that encode notions of traversibility of the state space for an agent. Using these metrics a topological covering is constructed by way of a set of landmark states in a fully self-supervised manner. We show that these landmark coverings confer theoretical advantages for transfer learning within the goal-based multi-task RL setting. Specifically, we demonstrate three mechanisms by which landmark coverings can be used for successful transfer learning. First, we extend the Landmark Options Via Reflection (LOVR) framework to this new topological covering; second, we use the landmark-centric value functions themselves as features and define a greedy zombie policy that achieves near oracle performance on a sequence of zero-shot transfer tasks; finally, motivated by the second transfer mechanism, we introduce a learned reward function that provides a more dense reward signal for goal-based RL. Our novel topological landmark covering confers beneficial theoretical results, bounding the Q values at each state-action pair. In doing so, we introduce a mechanism that performs action-pruning at infeasible actions which cannot possibly be part of an optimal policy for the current goal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes two traversibility metrics and a landmark covering for transfer in goal-based multi-task RL with three concrete mechanisms, but the claimed Q-value bounds and action pruning do not clearly follow from the metric descriptions.

read the letter

The paper introduces two metrics on the state space to capture traversibility and builds a topological covering from landmark states in a self-supervised way. It then uses this for transfer in goal-based multi-task RL through three mechanisms: extending LOVR, a zombie policy from landmark features, and a learned reward function. The key theoretical claim is that the covering bounds Q-values and allows pruning infeasible actions. What is new is the specific traversibility metrics and the landmark covering construction, along with those three enumerated transfer uses. The zombie policy idea and the action pruning are concrete proposals that could help with sample efficiency in lifelong settings. The paper does a reasonable job laying out the high-level approach and motivating why landmarks might help with transfer. The self-supervised construction is a plus if it avoids needing extra supervision. The main soft spot is in the theoretical advantage. The Q-value bound at every state-action pair is central to all the transfer claims, but the abstract gives no details on the metric properties or how the covering radius relates to value differences. The stress-test concern is fair here: without a relation like the metric satisfying a triangle inequality with respect to the dynamics, the bound does not follow and the pruning is not justified. If the full paper does not provide that derivation or the necessary assumptions, the central result does not hold up. No experiments are described in the abstract, so we also lack evidence on whether the zombie policy gets close to oracle performance in practice. This work is for researchers focused on transfer learning in goal-based RL and lifelong settings. Someone looking for new mechanisms might find the ideas worth exploring, but it is not a broad reorganization of the field. It deserves a serious referee if the full manuscript has the math worked out and some empirical results. Based on the abstract alone, the soundness is low. Recommendation: Send it for peer review so that the derivations can be checked properly, but expect that the theory section will need strengthening.

Referee Report

1 major / 0 minor

Summary. The paper introduces two metrics on the state space that encode notions of traversibility, constructs a topological covering via landmark states in a fully self-supervised manner, and claims this confers theoretical advantages for transfer in goal-based multi-task RL. It demonstrates three mechanisms: extending the LOVR framework, using landmark-centric value functions as features for a greedy zombie policy that achieves near-oracle zero-shot performance, and introducing a learned reward function for denser signals. The novel covering is claimed to bound Q-values at each state-action pair, enabling a mechanism for pruning infeasible actions that cannot be part of an optimal policy for the current goal.

Significance. If the Q-value bounds are rigorously derived from the metrics and hold with respect to the underlying MDP dynamics, the work would provide a principled, self-supervised basis for action pruning and knowledge transfer across goal-based tasks in lifelong RL. The zombie policy and learned reward ideas could offer practical benefits for sample efficiency if empirically validated beyond the claimed near-oracle performance.

major comments (1)

[Abstract] Abstract (and the central theoretical claim): The manuscript asserts that the topological landmark covering 'bounds the Q values at each state-action pair' and thereby enables action pruning of infeasible actions. However, no explicit statement is given of the metric axioms satisfied by the two traversibility metrics, the covering radius, or the required relation (e.g., a Lipschitz condition or application of the triangle inequality) between metric distance and the difference in optimal Q-values induced by the transition kernel and rewards. Without this relation, the bound does not follow for a general goal-based MDP, undermining all three transfer mechanisms that rely on it.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and for identifying the need for greater explicitness in the theoretical development. The central concern is the lack of a clear statement of the metric properties and the precise relation linking the traversibility metric to optimal Q-value differences. We address this point directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (and the central theoretical claim): The manuscript asserts that the topological landmark covering 'bounds the Q values at each state-action pair' and thereby enables action pruning of infeasible actions. However, no explicit statement is given of the metric axioms satisfied by the two traversibility metrics, the covering radius, or the required relation (e.g., a Lipschitz condition or application of the triangle inequality) between metric distance and the difference in optimal Q-values induced by the transition kernel and rewards. Without this relation, the bound does not follow for a general goal-based MDP, undermining all three transfer mechanisms that rely on it.

Authors: We agree that the current manuscript does not state the metric axioms or the derivation of the Q-value bound with sufficient explicitness. The two traversibility metrics are constructed to satisfy the standard metric axioms (non-negativity, identity of indiscernibles, symmetry, triangle inequality). The covering radius is selected so that the metric distance d(s, s') upper-bounds |Q*(s, a, g) − Q*(s', a', g)| for goal-based rewards via repeated application of the triangle inequality to the optimal value functions, using the fact that the transition kernel respects traversibility. We will add a dedicated subsection (and appendix) that (i) lists the metric axioms, (ii) specifies the covering-radius condition, and (iii) provides the step-by-step derivation of the bound that justifies action pruning. This revision will make the theoretical support for the three transfer mechanisms fully rigorous and transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The provided abstract and description introduce two traversibility metrics, construct a topological landmark covering in a self-supervised manner, and claim that this covering yields Q-value bounds plus three transfer mechanisms. No equations, fitted parameters, or self-citations are exhibited that reduce the claimed bounds or performance to the inputs by construction. The bound is asserted as a theoretical consequence of the covering rather than defined into existence or obtained via a load-bearing self-citation chain. The central claims therefore retain independent content relative to the construction steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on standard MDP assumptions plus the new self-supervised landmark construction; no fitted parameters or invented physical entities are described.

axioms (1)

standard math The environment is modeled as a goal-based Markov Decision Process where value functions and optimal policies exist for each goal.
The entire transfer analysis and Q-value bounding presuppose the standard RL MDP setting invoked throughout the abstract.

invented entities (2)

Traversibility metrics on the state space no independent evidence
purpose: To define landmark states and construct the topological covering used for transfer.
Two new metrics are introduced in the abstract to enable the self-supervised covering; no independent evidence outside the paper is supplied.
Landmark states and topological covering no independent evidence
purpose: To provide structure for the three transfer mechanisms and Q-value bounds.
The covering is constructed from the metrics and is central to all claimed advantages.

pith-pipeline@v0.9.0 · 5767 in / 1349 out tokens · 35905 ms · 2026-05-25T11:57:33.701353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 5 internal anchors

[1]

fast to poor well

Upon closer inspection, we found that of the 20 task, 10 achieve better mean per-episode regret than a fully converged baseline DQN agent with 1000 episodes of learning, despite theV L zombie agent being a purely zero-shot transfer implementation, and experiencing as little as a mean episode regret of 1.5 time steps for some tasks. For 3 of the 25 tasks, ...

work page
[2]

This document is acting as a placeholder, and represents work in progress

beneﬁts to transfer learning in the multi-task goal based RL setting. This document is acting as a placeholder, and represents work in progress. The theoretical aspects of theV L representation and the zombie agent require much further exploration. Initial experiments on randomly generated MDPs that are highly connected show that the zombie agent performs...

work page
[3]

Silver, Q

D. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, pages 49–55, 2013

work page 2013
[4]

J. Baxter. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149–198, 2000

work page 2000
[5]

Barto and S

A.G. Barto and S. Mahedevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:341–379, 2003

work page 2003
[6]

Brunskill and L

E. Brunskill and L. Li. Sample complexity of multi-task reinforcement learning. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2013

work page 2013
[7]

Frans, J

K. Frans, J. Ho, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. Technical report,

work page
[8]

https://arxiv.org/pdf/1710.09767[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Multi-Advisor Reinforcement Learning

R. Laroche, M. Fatemi, J. Romoff, and H. van Seijen. Multi-advisor reinforcement learning. Technical report, 2017. https://arxiv.org/pdf/1704.00756[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Separation of Concerns in Reinforcement Learning

H. van Seijen, M. Fatemi, J. Romoff, and R. Laroche. Separation of concerns in reinforcement learning. Technical report, 2017. https://arxiv.org/pdf/1612.05159[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Representation Learning: A Review and New Perspectives

Y . Bengio, A. Courville, and P. Vincent. Representation learning: a review and new perspectives. Technical report, 2014. https://arxiv.org/pdf/1206.5538[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[12]

X. Zhu. Semi-supervised learning literature survey. Technical report, 2005. Technical Report 1530

work page 2005
[13]

M. Fraser. Multi-step learning and underlying structure in statistical models. In NIPS, pages 4815–4823, 2016

work page 2016
[14]

Deep Semi-Supervised Learning with Linguistically Motivated Sequence Labeling Task Hierarchies

J. Godwin, P. Stenetorp, and S. Riedel. Deep semi-supervised learning with linguistically motivated se- quence labelling task hierarchies. Technical report, 2016. https://arxiv.org/pdf/1612.09113[cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Denis and M

N. Denis and M. Fraser. Options in multi-task reinforcement learning. In 32nd Canadian Conference on Artiﬁcial Intelligence, pages 225–237, 2019

work page 2019
[16]

Sutton and A.G

R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2016

work page 2016
[17]

Koenig and R.G

S. Koenig and R.G. Simmons. Complexity analysis of real-time reinforcement learning. AAAI, pages 99–105, 1993

work page 1993
[18]

Sutton, D

S.R. Sutton, D. Precup, and S. Singh. Beteween mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 112:181–211, 1999

work page 1999
[19]

T.A. Mann, S. Mannor, and D. Precup. Approximate value iteration with temporally extended actions. Journal of Artiﬁcial Intelligence Research, 53:375–438, 2015. 12

work page 2015

[1] [1]

fast to poor well

Upon closer inspection, we found that of the 20 task, 10 achieve better mean per-episode regret than a fully converged baseline DQN agent with 1000 episodes of learning, despite theV L zombie agent being a purely zero-shot transfer implementation, and experiencing as little as a mean episode regret of 1.5 time steps for some tasks. For 3 of the 25 tasks, ...

work page

[2] [2]

This document is acting as a placeholder, and represents work in progress

beneﬁts to transfer learning in the multi-task goal based RL setting. This document is acting as a placeholder, and represents work in progress. The theoretical aspects of theV L representation and the zombie agent require much further exploration. Initial experiments on randomly generated MDPs that are highly connected show that the zombie agent performs...

work page

[3] [3]

Silver, Q

D. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, pages 49–55, 2013

work page 2013

[4] [4]

J. Baxter. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149–198, 2000

work page 2000

[5] [5]

Barto and S

A.G. Barto and S. Mahedevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:341–379, 2003

work page 2003

[6] [6]

Brunskill and L

E. Brunskill and L. Li. Sample complexity of multi-task reinforcement learning. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2013

work page 2013

[7] [7]

Frans, J

K. Frans, J. Ho, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. Technical report,

work page

[8] [8]

https://arxiv.org/pdf/1710.09767[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Multi-Advisor Reinforcement Learning

R. Laroche, M. Fatemi, J. Romoff, and H. van Seijen. Multi-advisor reinforcement learning. Technical report, 2017. https://arxiv.org/pdf/1704.00756[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Separation of Concerns in Reinforcement Learning

H. van Seijen, M. Fatemi, J. Romoff, and R. Laroche. Separation of concerns in reinforcement learning. Technical report, 2017. https://arxiv.org/pdf/1612.05159[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Representation Learning: A Review and New Perspectives

Y . Bengio, A. Courville, and P. Vincent. Representation learning: a review and new perspectives. Technical report, 2014. https://arxiv.org/pdf/1206.5538[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2014

[12] [12]

X. Zhu. Semi-supervised learning literature survey. Technical report, 2005. Technical Report 1530

work page 2005

[13] [13]

M. Fraser. Multi-step learning and underlying structure in statistical models. In NIPS, pages 4815–4823, 2016

work page 2016

[14] [14]

Deep Semi-Supervised Learning with Linguistically Motivated Sequence Labeling Task Hierarchies

J. Godwin, P. Stenetorp, and S. Riedel. Deep semi-supervised learning with linguistically motivated se- quence labelling task hierarchies. Technical report, 2016. https://arxiv.org/pdf/1612.09113[cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Denis and M

N. Denis and M. Fraser. Options in multi-task reinforcement learning. In 32nd Canadian Conference on Artiﬁcial Intelligence, pages 225–237, 2019

work page 2019

[16] [16]

Sutton and A.G

R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2016

work page 2016

[17] [17]

Koenig and R.G

S. Koenig and R.G. Simmons. Complexity analysis of real-time reinforcement learning. AAAI, pages 99–105, 1993

work page 1993

[18] [18]

Sutton, D

S.R. Sutton, D. Precup, and S. Singh. Beteween mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 112:181–211, 1999

work page 1999

[19] [19]

T.A. Mann, S. Mannor, and D. Precup. Approximate value iteration with temporally extended actions. Journal of Artiﬁcial Intelligence Research, 53:375–438, 2015. 12

work page 2015