Provably Efficient Maximum Entropy Exploration

Abby Van Soest; Elad Hazan; Karan Singh; Sham M. Kakade

Provably Efficient Maximum Entropy Exploration

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1812.02690 v2 pith:GCMJFQQY submitted 2018-12-06 cs.LG cs.AIstat.ML

Provably Efficient Maximum Entropy Exploration

Elad Hazan , Sham M. Kakade , Karan Singh , Abby Van Soest This is my paper

classification cs.LG cs.AIstat.ML

keywords agentalgorithmdefinedefficientaccessintrinsicallylearnobjectives

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging Compute- and Data-Optimal Pretraining
cs.LG 2026-07 conditional novelty 7.0

Pretraining loss obeys a single law in which repeated or paraphrased tokens count as η(N, data-per-parameter, expansion-ratio) fresh tokens, with total effective data saturating as derived tokens grow.