When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Tengyang Xie; Xuanfei Ren

arxiv: 2606.18531 · v1 · pith:APIXJFSOnew · submitted 2026-06-16 · 📊 stat.ML · cs.LG

When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Xuanfei Ren , Tengyang Xie This is my paper

Pith reviewed 2026-06-26 22:03 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords offline reinforcement learningtrajectory-level supervisionoutcome-level feedbacksample complexitypessimistic actor-criticconcentrabilitypreference-based RL

0 comments

The pith

Trajectory-level labels incur an H squared statistical cost in offline RL but permit efficient learning under concentrability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a statistical theory for offline policy optimization when datasets provide only scalar trajectory outcomes rather than per-step rewards. In the standard expected cumulative reward setting, the proposed OPAC algorithm achieves a high-probability guarantee scaling as order H squared times square root of concentrability over n samples, with a matching lower bound that shows the precise cost of the information loss. The analysis extends to preference feedback while preserving the leading terms and then identifies when generalized nonlinear outcome objectives remain tractable. A reader cares because many real datasets record only final outcomes, so knowing the exact price of that restriction determines whether offline control stays feasible.

Core claim

In the canonical setting where each trajectory supplies a scalar label whose conditional mean equals the cumulative return, OPAC learns a latent reward model and attains high-probability performance of order tilde O of H squared square root of C_sa of pi star over n, matched by a lower bound. Preference-based feedback preserves the horizon and concentrability dependence up to model constants. For nonlinear aggregations of latent rewards, the problem requires Omega of 2 to the H trajectories in general, yet becomes polynomially learnable under structural coefficients kappa_mu of sigma and chi_mu of sigma that control information loss in aggregation and generalized Bellman updates.

What carries the argument

OPAC, the pessimistic actor-critic that learns a latent reward model from trajectory labels; the pair of structural coefficients kappa and chi that quantify information loss under outcome aggregation.

If this is right

The statistical price of one trajectory label versus process rewards is exactly quadratic in horizon length.
Preference-based supervision keeps the same leading dependence on H and concentrability.
Generalized all-success or nonlinear objectives are statistically intractable without further structure.
Polynomial sample complexity holds once kappa and chi remain bounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithms could adaptively select which trajectories to label to reduce effective concentrability for outcome supervision.
The same information-loss coefficients may bound complexity in related settings such as goal-conditioned RL.
Empirical checks of kappa and chi on a given aggregation function can decide whether outcome supervision is viable before data collection.

Load-bearing premise

The scalar label on each trajectory has conditional expectation exactly equal to the sum of the latent per-step rewards.

What would settle it

A concrete MDP instance with deterministic transitions and constant concentrability in which any algorithm using fewer than order H squared square root of C over n trajectories fails to achieve low regret on the expected-return objective.

read the original abstract

Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order $\widetilde O(H^2\sqrt{C_{sa}(\pi^\star)/n})$ and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require $\Omega(2^H)$ trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, $\kappa_\mu(\sigma)$ and $\chi_\mu(\sigma)$, capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives matching upper and lower bounds showing that trajectory-level labels cost an extra H factor in offline RL sample complexity, plus conditions for when nonlinear outcome supervision stays tractable.

read the letter

The main point is that this work pins down the statistical cost of moving from per-step rewards to one scalar label per trajectory. In the canonical case they get a high-probability bound of order H squared times square root of concentrability over n, with a matching lower bound. That extra horizon factor is the price of the missing per-step information, and the lower bound makes the claim sharp.

They introduce OPAC, which first fits a latent reward model from the trajectory labels then runs pessimistic actor-critic. The analysis follows the usual offline RL template once the latent rewards are in hand. They also extend the same style of bound to preference feedback, keeping the leading terms. For the generalized nonlinear aggregation setting they show an exponential lower bound for all-success objectives even under deterministic dynamics and constant concentrability, then isolate two structural coefficients that restore polynomial rates.

The matching lower bound and the explicit tractability conditions are the clearest contributions. The paper does a clean job separating the cases where outcome supervision is fine from the cases where it creates a hard barrier. The citation pattern looks standard for the offline RL literature.

A minor soft spot is that the latent reward estimation step could hide extra concentrability or coverage requirements that only appear in the full proofs; the abstract does not make those explicit. Otherwise the argument structure holds together.

This is for people who work on sample complexity in offline RL and want to understand supervision models beyond the usual per-step rewards. It is worth a serious referee because the bounds are tight and the tractability conditions are concrete enough to be useful.

Referee Report

0 major / 2 minor

Summary. The paper develops a statistical theory for offline RL under trajectory-level supervision. In the canonical setting (expected cumulative reward objective with scalar trajectory labels whose conditional mean is the return), it proposes the OPAC pessimistic actor-critic algorithm that learns a latent reward model and proves a high-probability guarantee of order ilde{O}(H^2 \sqrt{C_{sa}(\pi^*)/n}) together with a matching lower bound. The analysis is extended to preference-based feedback (preserving leading H and concentrability dependence up to model constants) and to generalized nonlinear outcome aggregation, where polynomial sample complexity holds only under explicit structural coefficients \kappa_ u(\sigma) and u_ u( u) while some regimes (e.g., all-success objectives) require ilde{ u}(2^H) trajectories even under deterministic transitions and constant concentrability.

Significance. If the derivations hold, the work supplies a sharp characterization of when trajectory-level labels permit sample-efficient offline control versus when they induce fundamental barriers. The matching upper/lower bounds in the canonical regime, the explicit algorithm, and the identification of tractable versus intractable regimes via structural coefficients constitute a substantive contribution to the theory of supervision granularity in offline RL.

minor comments (2)

[Abstract] Abstract: the notation C_sa( u^*) is introduced without an explicit definition or reference to its precise dependence on the state-action occupancy; a one-sentence clarification in the abstract or early introduction would improve readability.
[Abstract] The abstract states that generalized OPAC achieves polynomial sample complexity under u_ u( u) and u_ u( u), but does not indicate whether these coefficients appear in the leading term of the sample-complexity bound or only in lower-order factors; adding this detail would help readers assess the practical scope of the positive result.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report does not list any specific major comments requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on an explicit algorithm (OPAC) that learns a latent reward model from trajectory labels and then applies pessimistic offline RL analysis, together with a matching lower bound that is presented as an independent characterization of the statistical cost. No step reduces by construction to a fitted parameter renamed as a prediction, a self-definitional equivalence, or a load-bearing self-citation chain; the structural coefficients in the generalized setting are introduced as explicit assumptions rather than derived from the target result. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields minimal ledger; the work relies on standard offline-RL concentrability and concentration tools plus the modeling assumption that trajectory labels have conditional mean equal to cumulative return.

axioms (2)

domain assumption Scalar trajectory label has conditional expectation equal to cumulative return
Explicitly stated as the canonical setting in the abstract.
standard math Standard concentration inequalities and concentrability coefficients apply
Invoked to obtain the stated high-probability bounds.

pith-pipeline@v0.9.1-grok · 5834 in / 1315 out tokens · 49703 ms · 2026-06-26T22:03:46.859992+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 24 canonical work pages · 7 internal anchors

[1]

International Conference on Machine Learning , pages =

Adversarially trained actor critic for offline reinforcement learning , author =. International Conference on Machine Learning , pages =. 2022 , organization =

2022
[2]

arXiv preprint arXiv:2502.10581 , year =

Do we need to verify step by step? rethinking process supervision from a theoretical perspective , author =. arXiv preprint arXiv:2502.10581 , year =

work page arXiv
[3]

3rd Offline RL Workshop: Offline RL as a''Launchpad'' , year =

AMORE: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data , author =. 3rd Offline RL Workshop: Offline RL as a''Launchpad'' , year =
[4]

International conference on machine learning , pages =

Information-theoretic considerations in batch reinforcement learning , author =. International conference on machine learning , pages =. 2019 , organization =

2019
[5]

Proceedings of the nineteenth international conference on machine learning , pages =

Approximately optimal approximate reinforcement learning , author =. Proceedings of the nineteenth international conference on machine learning , pages =
[6]

arXiv preprint arXiv:2505.15311 , year =

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning , author =. arXiv preprint arXiv:2505.15311 , year =

work page arXiv
[7]

arXiv preprint arXiv:2505.20268 , year =

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits , author =. arXiv preprint arXiv:2505.20268 , year =

work page arXiv
[8]

arXiv preprint arXiv:2406.04274 , year =

Self-play with adversarial critic: Provable and scalable offline alignment for language models , author =. arXiv preprint arXiv:2406.04274 , year =

work page arXiv
[9]

Advances in neural information processing systems , volume =

Bellman-consistent pessimism for offline reinforcement learning , author =. Advances in neural information processing systems , volume =
[10]

Advances in neural information processing systems , volume =

Policy finetuning: Bridging sample-efficient offline and online reinforcement learning , author =. Advances in neural information processing systems , volume =
[11]

2006 , publisher =

Prediction, learning, and games , author =. 2006 , publisher =

2006
[12]

Journal of computer and system sciences , volume =

A decision-theoretic generalization of on-line learning and an application to boosting , author =. Journal of computer and system sciences , volume =. 1997 , publisher =

1997
[13]

the method of paired comparisons , author =

Rank analysis of incomplete block designs: I. the method of paired comparisons , author =. Biometrika , volume =. 1952 , publisher =

1952
[14]

Festschrift for Lucien Le Cam: research papers in probability and statistics , pages =

Assouad, fano, and le cam , author =. Festschrift for Lucien Le Cam: research papers in probability and statistics , pages =. 1997 , publisher =

1997
[15]

International Conference on Machine Learning , pages =

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author =. International Conference on Machine Learning , pages =. 2023 , organization =

2023
[16]

Advances in neural information processing systems , volume =

Deep reinforcement learning from human preferences , author =. Advances in neural information processing systems , volume =
[17]

arXiv preprint arXiv:2305.14816 , year =

Provable offline preference-based reinforcement learning , author =. arXiv preprint arXiv:2305.14816 , year =

work page arXiv
[18]

2019 , publisher =

High-dimensional statistics: A non-asymptotic viewpoint , author =. 2019 , publisher =

2019
[19]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author =. arXiv preprint arXiv:2005.01643 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2005
[20]

Reinforcement learning: State-of-the-art , pages =

Batch reinforcement learning , author =. Reinforcement learning: State-of-the-art , pages =. 2012 , publisher =

2012
[21]

Advances in neural information processing systems , volume =

Hindsight experience replay , author =. Advances in neural information processing systems , volume =
[22]

The twelfth international conference on learning representations , year =

Let's verify step by step , author =. The twelfth international conference on learning representations , year =
[23]

Solving math word problems with process- and outcome-based feedback

Solving math word problems with process-and outcome-based feedback , author =. arXiv preprint arXiv:2211.14275 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Nature medicine , volume =

The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care , author =. Nature medicine , volume =. 2018 , publisher =

2018
[25]

Nature medicine , volume =

Guidelines for reinforcement learning in healthcare , author =. Nature medicine , volume =. 2019 , publisher =

2019
[26]

Management science , volume =

Risk-sensitive Markov decision processes , author =. Management science , volume =. 1972 , publisher =

1972
[27]

2022 , url =

Reinforcement Learning: Theory and Algorithms , author =. 2022 , url =

2022
[28]

arXiv preprint arXiv:2312.16730 , year =

Foundations of reinforcement learning and interactive decision making , author =. arXiv preprint arXiv:2312.16730 , year =

work page arXiv
[29]

IEEE Transactions on Machine Learning in Communications and Networking , volume =

Reinforcement learning with non-cumulative objective , author =. IEEE Transactions on Machine Learning in Communications and Networking , volume =. 2023 , publisher =

2023
[30]

Advances in Neural Information Processing Systems , volume =

Planning with general objective functions: Going beyond total rewards , author =. Advances in Neural Information Processing Systems , volume =
[31]

URL https://arxiv.org/abs/2402.01361 , year=

To the max: Reinventing reward in reinforcement learning , author=. URL https://arxiv.org/abs/2402.01361 , year=

work page arXiv
[32]

arXiv preprint arXiv:2010.11863 , year=

Planning with submodular objective functions , author=. arXiv preprint arXiv:2010.11863 , year=

work page arXiv 2010
[33]

Advances in Neural Information Processing Systems , volume=

Variational policy gradient method for reinforcement learning with general utilities , author=. Advances in Neural Information Processing Systems , volume=
[34]

International Conference on Machine Learning , pages=

Reinforcement learning with general utilities: Simpler variance reduction and large state-action space , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[35]

arXiv preprint arXiv:2403.06323 , year=

A reductions approach to risk-sensitive reinforcement learning with optimized certainty equivalents , author=. arXiv preprint arXiv:2403.06323 , year=

work page arXiv
[36]

International Conference on Machine Learning , pages=

Near-minimax-optimal risk-sensitive reinforcement learning with cvar , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[37]

Advances in Neural Information Processing Systems , volume=

Regret bounds for risk-sensitive reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[38]

arXiv preprint arXiv:2505.04553 , year=

Risk-sensitive Reinforcement Learning Based on Convex Scoring Functions , author=. arXiv preprint arXiv:2505.04553 , year=

work page arXiv
[39]

2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Control synthesis from linear temporal logic specifications using model-free reinforcement learning , author=. 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2020 , organization=

2020
[40]

arXiv preprint arXiv:2408.09495 , year=

Directed exploration in reinforcement learning from linear temporal logic , author=. arXiv preprint arXiv:2408.09495 , year=

work page arXiv
[41]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Computably continuous reinforcement-learning objectives are PAC-learnable , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[42]

International conference on machine learning , pages=

Is pessimism provably efficient for offline rl? , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[43]

Advances in Neural Information Processing Systems , volume=

Bridging offline reinforcement learning and imitation learning: A tale of pessimism , author=. Advances in Neural Information Processing Systems , volume=
[44]

Advances in Neural Information Processing Systems , volume=

Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage , author=. Advances in Neural Information Processing Systems , volume=
[45]

Advances in neural information processing systems , volume=

Towards instance-optimal offline reinforcement learning with pessimism , author=. Advances in neural information processing systems , volume=
[46]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

On instance-dependent bounds for offline reinforcement learning with linear function approximation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[47]

The Annals of Statistics , volume=

Settling the sample complexity of model-based offline reinforcement learning , author=. The Annals of Statistics , volume=. 2024 , publisher=

2024
[48]

arXiv preprint arXiv:2111.10919 , year=

Offline reinforcement learning: Fundamental barriers for value function approximation , author=. arXiv preprint arXiv:2111.10919 , year=

work page arXiv
[49]

International Conference on Machine Learning , pages=

Batch value-function approximation with only realizability , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[50]

arXiv preprint arXiv:2406.11686 , year=

The role of inherent bellman error in offline reinforcement learning with linear function approximation , author=. arXiv preprint arXiv:2406.11686 , year=

work page arXiv
[51]

Conference on Learning Theory , pages=

Offline reinforcement learning with realizability and single-policy concentrability , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022
[52]

International conference on machine learning , pages=

Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[53]

Mathematics of Operations Research , volume=

Provably efficient reinforcement learning with linear function approximation , author=. Mathematics of Operations Research , volume=. 2023 , publisher=

2023
[54]

Advances in neural information processing systems , volume=

Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms , author=. Advances in neural information processing systems , volume=
[55]

Advances in neural information processing systems , volume=

Provable benefits of actor-critic methods for offline reinforcement learning , author=. Advances in neural information processing systems , volume=
[56]

Advances in neural information processing systems , volume=

Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=
[57]

International conference on machine learning , pages=

Off-policy deep reinforcement learning without exploration , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[58]

Advances in neural information processing systems , volume=

A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=
[59]

Offline Reinforcement Learning with Implicit Q-Learning

Offline reinforcement learning with implicit q-learning , author=. arXiv preprint arXiv:2110.06169 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

International Conference on Machine Learning , pages=

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[61]

Statistical Science , volume=

Offline reinforcement learning in large state spaces: Algorithms and guarantees , author=. Statistical Science , volume=. 2025 , publisher=

2025
[62]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[63]

International Conference on Machine Learning , pages=

Minimax-optimal off-policy evaluation with linear function approximation , author=. International Conference on Machine Learning , pages=. 2020 , organization=

2020
[64]

arXiv preprint arXiv:2210.06718 , year=

Hybrid rl: Using both offline and online data can make rl efficient , author=. arXiv preprint arXiv:2210.06718 , year=

work page arXiv
[65]

SIAM journal on control and optimization , volume=

Performance bounds in l\_p-norm for approximate value iteration , author=. SIAM journal on control and optimization , volume=. 2007 , publisher=

2007
[66]

Advances in neural information processing systems , volume=

Error propagation for approximate policy and value iteration , author=. Advances in neural information processing systems , volume=
[67]

Expert Systems with Applications , volume=

Maximum reward reinforcement learning: A non-cumulative reward criterion , author=. Expert Systems with Applications , volume=. 2006 , publisher=

2006
[68]

arXiv preprint arXiv:2010.03744 , year=

Maximum reward formulation in reinforcement learning , author=. arXiv preprint arXiv:2010.03744 , year=

work page arXiv 2010
[69]

International Conference on Algorithmic Learning Theory , pages=

An efficient algorithm for learning with semi-bandit feedback , author=. International Conference on Algorithmic Learning Theory , pages=. 2013 , organization=

2013
[70]

Proceedings of the AAAI conference on artificial intelligence , volume=

Reinforcement learning with trajectory feedback , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[71]

Advances in Neural Information Processing Systems , volume=

On the theory of reinforcement learning with once-per-episode feedback , author=. Advances in Neural Information Processing Systems , volume=
[72]

International Conference on Machine Learning , pages=

Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[73]

arXiv preprint arXiv:2405.07637 , year=

Near-optimal regret in linear mdps with aggregate bandit feedback , author=. arXiv preprint arXiv:2405.07637 , year=

work page arXiv
[74]

arXiv preprint arXiv:2502.04004 , year=

Near-optimal regret using policy optimization in online mdps with aggregate bandit feedback , author=. arXiv preprint arXiv:2502.04004 , year=

work page arXiv
[75]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[76]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025
[79]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[80]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

International Conference on Machine Learning , pages =

Adversarially trained actor critic for offline reinforcement learning , author =. International Conference on Machine Learning , pages =. 2022 , organization =

2022

[2] [2]

arXiv preprint arXiv:2502.10581 , year =

Do we need to verify step by step? rethinking process supervision from a theoretical perspective , author =. arXiv preprint arXiv:2502.10581 , year =

work page arXiv

[3] [3]

3rd Offline RL Workshop: Offline RL as a''Launchpad'' , year =

AMORE: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data , author =. 3rd Offline RL Workshop: Offline RL as a''Launchpad'' , year =

[4] [4]

International conference on machine learning , pages =

Information-theoretic considerations in batch reinforcement learning , author =. International conference on machine learning , pages =. 2019 , organization =

2019

[5] [5]

Proceedings of the nineteenth international conference on machine learning , pages =

Approximately optimal approximate reinforcement learning , author =. Proceedings of the nineteenth international conference on machine learning , pages =

[6] [6]

arXiv preprint arXiv:2505.15311 , year =

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning , author =. arXiv preprint arXiv:2505.15311 , year =

work page arXiv

[7] [7]

arXiv preprint arXiv:2505.20268 , year =

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits , author =. arXiv preprint arXiv:2505.20268 , year =

work page arXiv

[8] [8]

arXiv preprint arXiv:2406.04274 , year =

Self-play with adversarial critic: Provable and scalable offline alignment for language models , author =. arXiv preprint arXiv:2406.04274 , year =

work page arXiv

[9] [9]

Advances in neural information processing systems , volume =

Bellman-consistent pessimism for offline reinforcement learning , author =. Advances in neural information processing systems , volume =

[10] [10]

Advances in neural information processing systems , volume =

Policy finetuning: Bridging sample-efficient offline and online reinforcement learning , author =. Advances in neural information processing systems , volume =

[11] [11]

2006 , publisher =

Prediction, learning, and games , author =. 2006 , publisher =

2006

[12] [12]

Journal of computer and system sciences , volume =

A decision-theoretic generalization of on-line learning and an application to boosting , author =. Journal of computer and system sciences , volume =. 1997 , publisher =

1997

[13] [13]

the method of paired comparisons , author =

Rank analysis of incomplete block designs: I. the method of paired comparisons , author =. Biometrika , volume =. 1952 , publisher =

1952

[14] [14]

Festschrift for Lucien Le Cam: research papers in probability and statistics , pages =

Assouad, fano, and le cam , author =. Festschrift for Lucien Le Cam: research papers in probability and statistics , pages =. 1997 , publisher =

1997

[15] [15]

International Conference on Machine Learning , pages =

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author =. International Conference on Machine Learning , pages =. 2023 , organization =

2023

[16] [16]

Advances in neural information processing systems , volume =

Deep reinforcement learning from human preferences , author =. Advances in neural information processing systems , volume =

[17] [17]

arXiv preprint arXiv:2305.14816 , year =

Provable offline preference-based reinforcement learning , author =. arXiv preprint arXiv:2305.14816 , year =

work page arXiv

[18] [18]

2019 , publisher =

High-dimensional statistics: A non-asymptotic viewpoint , author =. 2019 , publisher =

2019

[19] [19]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author =. arXiv preprint arXiv:2005.01643 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2005

[20] [20]

Reinforcement learning: State-of-the-art , pages =

Batch reinforcement learning , author =. Reinforcement learning: State-of-the-art , pages =. 2012 , publisher =

2012

[21] [21]

Advances in neural information processing systems , volume =

Hindsight experience replay , author =. Advances in neural information processing systems , volume =

[22] [22]

The twelfth international conference on learning representations , year =

Let's verify step by step , author =. The twelfth international conference on learning representations , year =

[23] [23]

Solving math word problems with process- and outcome-based feedback

Solving math word problems with process-and outcome-based feedback , author =. arXiv preprint arXiv:2211.14275 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Nature medicine , volume =

The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care , author =. Nature medicine , volume =. 2018 , publisher =

2018

[25] [25]

Nature medicine , volume =

Guidelines for reinforcement learning in healthcare , author =. Nature medicine , volume =. 2019 , publisher =

2019

[26] [26]

Management science , volume =

Risk-sensitive Markov decision processes , author =. Management science , volume =. 1972 , publisher =

1972

[27] [27]

2022 , url =

Reinforcement Learning: Theory and Algorithms , author =. 2022 , url =

2022

[28] [28]

arXiv preprint arXiv:2312.16730 , year =

Foundations of reinforcement learning and interactive decision making , author =. arXiv preprint arXiv:2312.16730 , year =

work page arXiv

[29] [29]

IEEE Transactions on Machine Learning in Communications and Networking , volume =

Reinforcement learning with non-cumulative objective , author =. IEEE Transactions on Machine Learning in Communications and Networking , volume =. 2023 , publisher =

2023

[30] [30]

Advances in Neural Information Processing Systems , volume =

Planning with general objective functions: Going beyond total rewards , author =. Advances in Neural Information Processing Systems , volume =

[31] [31]

URL https://arxiv.org/abs/2402.01361 , year=

To the max: Reinventing reward in reinforcement learning , author=. URL https://arxiv.org/abs/2402.01361 , year=

work page arXiv

[32] [32]

arXiv preprint arXiv:2010.11863 , year=

Planning with submodular objective functions , author=. arXiv preprint arXiv:2010.11863 , year=

work page arXiv 2010

[33] [33]

Advances in Neural Information Processing Systems , volume=

Variational policy gradient method for reinforcement learning with general utilities , author=. Advances in Neural Information Processing Systems , volume=

[34] [34]

International Conference on Machine Learning , pages=

Reinforcement learning with general utilities: Simpler variance reduction and large state-action space , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[35] [35]

arXiv preprint arXiv:2403.06323 , year=

A reductions approach to risk-sensitive reinforcement learning with optimized certainty equivalents , author=. arXiv preprint arXiv:2403.06323 , year=

work page arXiv

[36] [36]

International Conference on Machine Learning , pages=

Near-minimax-optimal risk-sensitive reinforcement learning with cvar , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[37] [37]

Advances in Neural Information Processing Systems , volume=

Regret bounds for risk-sensitive reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[38] [38]

arXiv preprint arXiv:2505.04553 , year=

Risk-sensitive Reinforcement Learning Based on Convex Scoring Functions , author=. arXiv preprint arXiv:2505.04553 , year=

work page arXiv

[39] [39]

2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Control synthesis from linear temporal logic specifications using model-free reinforcement learning , author=. 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2020 , organization=

2020

[40] [40]

arXiv preprint arXiv:2408.09495 , year=

Directed exploration in reinforcement learning from linear temporal logic , author=. arXiv preprint arXiv:2408.09495 , year=

work page arXiv

[41] [41]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Computably continuous reinforcement-learning objectives are PAC-learnable , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[42] [42]

International conference on machine learning , pages=

Is pessimism provably efficient for offline rl? , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[43] [43]

Advances in Neural Information Processing Systems , volume=

Bridging offline reinforcement learning and imitation learning: A tale of pessimism , author=. Advances in Neural Information Processing Systems , volume=

[44] [44]

Advances in Neural Information Processing Systems , volume=

Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage , author=. Advances in Neural Information Processing Systems , volume=

[45] [45]

Advances in neural information processing systems , volume=

Towards instance-optimal offline reinforcement learning with pessimism , author=. Advances in neural information processing systems , volume=

[46] [46]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

On instance-dependent bounds for offline reinforcement learning with linear function approximation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[47] [47]

The Annals of Statistics , volume=

Settling the sample complexity of model-based offline reinforcement learning , author=. The Annals of Statistics , volume=. 2024 , publisher=

2024

[48] [48]

arXiv preprint arXiv:2111.10919 , year=

Offline reinforcement learning: Fundamental barriers for value function approximation , author=. arXiv preprint arXiv:2111.10919 , year=

work page arXiv

[49] [49]

International Conference on Machine Learning , pages=

Batch value-function approximation with only realizability , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[50] [50]

arXiv preprint arXiv:2406.11686 , year=

The role of inherent bellman error in offline reinforcement learning with linear function approximation , author=. arXiv preprint arXiv:2406.11686 , year=

work page arXiv

[51] [51]

Conference on Learning Theory , pages=

Offline reinforcement learning with realizability and single-policy concentrability , author=. Conference on Learning Theory , pages=. 2022 , organization=

2022

[52] [52]

International conference on machine learning , pages=

Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity , author=. International conference on machine learning , pages=. 2022 , organization=

2022

[53] [53]

Mathematics of Operations Research , volume=

Provably efficient reinforcement learning with linear function approximation , author=. Mathematics of Operations Research , volume=. 2023 , publisher=

2023

[54] [54]

Advances in neural information processing systems , volume=

Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms , author=. Advances in neural information processing systems , volume=

[55] [55]

Advances in neural information processing systems , volume=

Provable benefits of actor-critic methods for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

[56] [56]

Advances in neural information processing systems , volume=

Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

[57] [57]

International conference on machine learning , pages=

Off-policy deep reinforcement learning without exploration , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[58] [58]

Advances in neural information processing systems , volume=

A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

[59] [59]

Offline Reinforcement Learning with Implicit Q-Learning

Offline reinforcement learning with implicit q-learning , author=. arXiv preprint arXiv:2110.06169 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

International Conference on Machine Learning , pages=

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024

[61] [61]

Statistical Science , volume=

Offline reinforcement learning in large state spaces: Algorithms and guarantees , author=. Statistical Science , volume=. 2025 , publisher=

2025

[62] [62]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[63] [63]

International Conference on Machine Learning , pages=

Minimax-optimal off-policy evaluation with linear function approximation , author=. International Conference on Machine Learning , pages=. 2020 , organization=

2020

[64] [64]

arXiv preprint arXiv:2210.06718 , year=

Hybrid rl: Using both offline and online data can make rl efficient , author=. arXiv preprint arXiv:2210.06718 , year=

work page arXiv

[65] [65]

SIAM journal on control and optimization , volume=

Performance bounds in l\_p-norm for approximate value iteration , author=. SIAM journal on control and optimization , volume=. 2007 , publisher=

2007

[66] [66]

Advances in neural information processing systems , volume=

Error propagation for approximate policy and value iteration , author=. Advances in neural information processing systems , volume=

[67] [67]

Expert Systems with Applications , volume=

Maximum reward reinforcement learning: A non-cumulative reward criterion , author=. Expert Systems with Applications , volume=. 2006 , publisher=

2006

[68] [68]

arXiv preprint arXiv:2010.03744 , year=

Maximum reward formulation in reinforcement learning , author=. arXiv preprint arXiv:2010.03744 , year=

work page arXiv 2010

[69] [69]

International Conference on Algorithmic Learning Theory , pages=

An efficient algorithm for learning with semi-bandit feedback , author=. International Conference on Algorithmic Learning Theory , pages=. 2013 , organization=

2013

[70] [70]

Proceedings of the AAAI conference on artificial intelligence , volume=

Reinforcement learning with trajectory feedback , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[71] [71]

Advances in Neural Information Processing Systems , volume=

On the theory of reinforcement learning with once-per-episode feedback , author=. Advances in Neural Information Processing Systems , volume=

[72] [72]

International Conference on Machine Learning , pages=

Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[73] [73]

arXiv preprint arXiv:2405.07637 , year=

Near-optimal regret in linear mdps with aggregate bandit feedback , author=. arXiv preprint arXiv:2405.07637 , year=

work page arXiv

[74] [74]

arXiv preprint arXiv:2502.04004 , year=

Near-optimal regret using policy optimization in online mdps with aggregate bandit feedback , author=. arXiv preprint arXiv:2502.04004 , year=

work page arXiv

[75] [75]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[76] [76]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[77] [77]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[78] [78]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025

[79] [79]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

[80] [80]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv