Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards

Benjamin Eysenbach; Catherine Ji; Faisal Mohamed; Glen Berseth

arxiv: 2603.02008 · v2 · submitted 2026-03-02 · 💻 cs.LG · cs.AI

Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards

Faisal Mohamed , Catherine Ji , Benjamin Eysenbach , Glen Berseth This is my paper

Pith reviewed 2026-05-15 17:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningexplorationtemporal contrastive representationsintrinsic motivationlocomotionmanipulationembodied AI

0 comments

The pith

Temporal contrastive representations let agents explore complex behaviors without any extrinsic rewards by seeking states whose futures are hard to predict.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method for exploration in reinforcement learning that relies on temporal contrastive representations. These representations identify states where future outcomes are unpredictable, directing the agent to visit them and thereby build useful knowledge. The approach produces complex exploratory behaviors across locomotion, manipulation, and embodied AI tasks that normally demand hand-designed rewards. It works by building directly on temporal similarities instead of learning explicit distances or maintaining episodic memory.

Core claim

We propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory behavior in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms, our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.

What carries the argument

temporal contrastive representations that capture information for a wide range of potential tasks by contrasting temporal similarities while avoiding full state reconstruction, then used to prioritize states whose futures remain unpredictable under the representation

Load-bearing premise

Prioritizing states whose futures are unpredictable under the learned temporal contrastive representation will produce complex and useful exploratory behavior across locomotion, manipulation, and embodied AI tasks without any extrinsic reward signal.

What would settle it

In a standard locomotion or manipulation task, running the method yields no increase in behavioral diversity or task-relevant skills compared with a random policy or a simple state-counting baseline.

read the original abstract

Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean temporal-contrastive way to pick states for reward-free exploration, but the core link between embedding distance and actual future unpredictability is not directly checked.

read the letter

The headline takeaway is that this work tries to make exploration simpler by training a temporal contrastive representation and then sending the agent toward states whose futures look dissimilar in that space. It positions the approach as lighter than quasimetric learning or episodic memory tricks, and it reports that the resulting behavior covers locomotion, manipulation, and embodied tasks without any extrinsic reward. That framing is the main novelty: a direct use of temporal similarity rather than an explicit distance model or stored trajectory buffer. If the experiments hold, the method could be useful for anyone who wants to avoid reward engineering in continuous control. The paper does a reasonable job keeping the representation focused on temporal adjacency instead of full reconstruction, which keeps compute down. The soft spot is exactly the one the stress-test flags. The contrastive loss is defined on nearby states, but there is no separate check showing that low similarity in the embedding actually tracks high prediction error or stochasticity under a dynamics model. If the embedding mostly captures static appearance or low-variance features, the agent could end up chasing merely novel-looking states rather than genuinely unpredictable ones. That gap matters because the central claim rests on the representation doing the unpredictability work. Without that verification, the complex behaviors shown could be driven by something else in the implementation. The paper is aimed at RL people who already work on intrinsic motivation and want a lighter alternative to current methods. It is coherent on its own terms and shows clear thinking about the representation choice, so it deserves a serious referee even if the experiments need tightening on the unpredictability measurement. I would send it out for review rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes an exploration strategy in reinforcement learning that utilizes temporal contrastive representations to identify and prioritize states with unpredictable future outcomes. By building directly on temporal similarities rather than explicit distance metrics or memory mechanisms, the method aims to facilitate the learning of complex exploratory behaviors in locomotion, manipulation, and embodied-AI tasks without relying on extrinsic rewards.

Significance. If the central claim holds, this offers a simpler alternative to quasimetric or memory-based exploration methods in reward-free RL, with potential to enable sophisticated behaviors across locomotion, manipulation, and embodied AI domains through direct use of temporal contrastive similarities.

major comments (1)

Section 3: The contrastive loss is defined on temporally adjacent states, but the manuscript provides neither a derivation nor empirical validation showing that low similarity in the learned embedding space corresponds to high future unpredictability or model uncertainty. This link is load-bearing for the prioritization rule and the claim that complex exploratory behavior emerges without extrinsic rewards; if the representation collapses to static features, the method may prioritize mere dissimilarity instead.

minor comments (1)

Abstract: Typo in the phrase 'complex exploratory x in locomotion' — likely intended as 'complex exploratory behavior in locomotion'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and positive assessment of the work's potential significance. We address the single major comment below and will revise the manuscript to strengthen the relevant section.

read point-by-point responses

Referee: Section 3: The contrastive loss is defined on temporally adjacent states, but the manuscript provides neither a derivation nor empirical validation showing that low similarity in the learned embedding space corresponds to high future unpredictability or model uncertainty. This link is load-bearing for the prioritization rule and the claim that complex exploratory behavior emerges without extrinsic rewards; if the representation collapses to static features, the method may prioritize mere dissimilarity instead.

Authors: We agree that the manuscript requires an explicit derivation and empirical validation to establish that low similarity under the temporal contrastive loss corresponds to high future unpredictability rather than static dissimilarity. In the revised manuscript we will add a short derivation in Section 3 showing how the contrastive objective on temporally adjacent pairs encourages embeddings in which dissimilarity reflects divergence in future trajectories (and hence model uncertainty). We will also insert new experiments that measure the correlation between embedding similarity and one-step prediction error on held-out future states across the locomotion, manipulation, and embodied-AI domains, confirming that the learned representations prioritize dynamic unpredictability rather than static features. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; temporal contrastive prioritization is direct construction from similarities

full rationale

The paper defines exploration via temporal contrastive representations learned on adjacent states, then prioritizes states with low similarity (claimed as unpredictable futures). This is a direct application of the learned embedding distances rather than a fitted parameter renamed as prediction, self-citation load-bearing premise, or self-definitional reduction. No equations or claims in the abstract reduce the target behavior to the inputs by construction; the central demonstration is empirical performance on locomotion/manipulation tasks. Minor self-citation risk exists in representation learning literature but is not load-bearing here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that temporal representations suffice for exploration without full reconstruction; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction.
Stated directly in the abstract as the motivation for using temporal contrastive representations.

pith-pipeline@v0.9.0 · 5447 in / 1189 out tokens · 64574 ms · 2026-05-15T17:34:52.110493+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

E[rintr(st, at)] = E pT (sf |st,at) [−Cθ((st, at), sf)] = E pT (sf |st,at) [||ϕθ(st, at)−ψ θ(sf)||]
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The intrinsic reward evaluates to the negative of the KL-divergence between the conditional future-state distribution pT(sf|st,at) and the marginal pT(sf)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.