pith. sign in

arxiv: 2606.01123 · v1 · pith:NC2XHRIXnew · submitted 2026-05-31 · 💻 cs.LG

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

Pith reviewed 2026-06-28 17:26 UTC · model grok-4.3

classification 💻 cs.LG
keywords preference-based reinforcement learningreward-free representation learningoffline reinforcement learningsuccessor measurescontrastive searchhuman feedbackpreference efficiency
0
0 comments X

The pith

Learning successor-measure representations from reward-free offline data improves preference efficiency in offline PbRL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new framework for offline preference-based reinforcement learning that first learns latent successor-measure representations from reward-free offline data. It then performs contrastive search and fine-tuning steps that incorporate pairwise human preference feedback. Experiments and ablations demonstrate that this yields superior preference efficiency compared to existing two-stage offline PbRL methods that first train a reward or preference model. A sympathetic reader would care because the approach suggests a route to reduce the volume of human preference labels required for effective policy learning.

Core claim

The central claim is that a training framework connecting reward-free representation learning with preference-based RL, by first acquiring latent successor-measure representations from unlabeled offline data and then applying contrastive search plus fine-tuning on preference data, delivers higher preference efficiency than standard offline PbRL pipelines.

What carries the argument

Latent successor-measure representations learned from reward-free offline data, which act as the foundation enabling subsequent contrastive search and fine-tuning with preference labels.

If this is right

  • The method achieves superior preference efficiency over offline PbRL baselines.
  • It establishes the first explicit connection between RFRL and PbRL.
  • It positions reward-free representation learning as a route to feedback-efficient solutions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same representation-first pattern could be tested in online PbRL settings where preferences arrive incrementally.
  • The approach might reduce human labeling costs in domains such as robotic control or language model alignment that rely on preference feedback.
  • Similar integrations of reward-free representations could be explored for other costly feedback modalities beyond pairwise preferences.

Load-bearing premise

Successor-measure representations learned from reward-free offline data supply a sufficiently rich basis for the contrastive search and fine-tuning steps that use preference data.

What would settle it

A set of controlled experiments on standard offline PbRL benchmarks in which the proposed method requires at least as many preference labels as existing baselines to reach equivalent performance levels would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2606.01123 by Chia-Heng Hsu, Jun-Jie Yang, Kui-Yuan Chen, Ping-Chun Hsieh.

Figure 1
Figure 1. Figure 1: An illustration of the commonly used PbRL training pipelines and our proposed FB-PbRL framework. model from preferences (e.g., under the Bradley–Terry for￾mulation) and then use this reward for RFRL-style task spec￾ification. However, this approach inherits the same reward over-optimization and generalization issues in conventional PbRL, particularly when preference data is scarce. This motivates a fundame… view at source ↗
Figure 2
Figure 2. Figure 2: Reward over-optimization under the proxy rewards on Walker-run and Walker-flip. “FB-BT-T” performs test-time task inference using a proxy BT-based reward function, whereas “FB” uses ground-truth rewards in task specification. (a) The bar chart reports average return with error bars indicating standard deviation over 5 seeds. (b) The scatter plot visualizes predicted rewards under BT modeling against the gr… view at source ↗
Figure 3
Figure 3. Figure 3: Reward-aligned latent geometry. t-SNE visualization of the learned latent zσ ≡ Bω(σ) on the Walker-walk task. (a) Test-time preference-based task specification without fine-tuning. (b) Preference-guided fine-tuning. The zσs of the offline trajectory segments are colored by ground-truth returns, with brighter hues indicating higher returns. The final task specifications z ∗ CPTS and z ∗ PGFT are marked by a… view at source ↗
Figure 4
Figure 4. Figure 4: The overall training pipeline of FB-PbRL. The archi￾tecture comprises two integral parts: (Top) Forward-Backward Framework: Adopt FB decomposition to learn latent state-action representations and a latent-conditioned policy from offline data. (Bottom) Preference-Guided Search and Fine-tuning: Lever￾age segment embeddings and a contrastive objective over prefer￾ence data to refine FB representations and opt… view at source ↗
Figure 5
Figure 5. Figure 5: a shows that FB-PbRL degrades gracefully as the number of preference pairs decreases. Reducing the budget from 2,000 to 200 pairs results in only about 10% perfor￾mance drop. Across all budgets, FB-PbRL consistently outperforms PSM, the strongest RFRL baseline in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training efficiency over wall-clock time. Training curves on the four Walker tasks evaluated under equal wall-clock time budgets. FB-PbRL demonstrates highly efficient task-specific fine-tuning, rapidly converging and surpassing the strongest offline PbRL baseline in approximately one hour of training time. Shaded regions denote the standard deviation across 3 seeds. 5. Related Work Offline PbRL with rewar… view at source ↗
Figure 7
Figure 7. Figure 7: Robustness analysis of FB-PbRL in the Walker domain. (a) Performance scaling with respect to the number of preference pairs. (b) Robustness against noisy preference annotations. Results in (a) and (b) represent the average performance averaged over four tasks in the Walker domain; the dashed lines indicate the best performance achieved by offline PbRL and zero-shot RFRL baselines. In all plots, shaded regi… view at source ↗
Figure 8
Figure 8. Figure 8: Dataset coverage and in the PointMass domain. The layout of the maze is centered around a cross-shaped obstacle (white/empty region), with the dots representing the four target goal states. (a) The state visitation heatmap of the RND PointMass dataset reveals a heavy bias toward the top-left region where the agent is initialized. (b) Analysis of sparsely visited states shows that the bottom-right goal area… view at source ↗
Figure 9
Figure 9. Figure 9: Robustness analysis against preference noise in the quadruped domain. The plot illustrates the performance of FB-PbRL and the DPPO baseline as the preference noise ratio increases. Result is averaged over four Quadruped tasks. In all plots, shaded regions indicate the standard deviation across 5 seeds. F.10. Extension to Online PbRL Settings While the primary focus of this work is on the offline regime, ou… view at source ↗
Figure 10
Figure 10. Figure 10: FB-PbRL compared with LiRE, OPPO, and DPPO. In all plots, shaded regions indicate the standard deviation across 5 seeds. Contrastive objective. We further compare our SimCLR-style contrastive learning objective with a margin-based alternative. Across nearly all tasks, the margin-based objective results in inferior and less stable performance, highlighting the advantage of the SimCLR formulation adopted in… view at source ↗
Figure 11
Figure 11. Figure 11: FB-PbRL compared with OPRL and CLARIFY. In all plots, shaded regions indicate the standard deviation across 5 seeds. via the backward encoder, and minimize the cosine distance Lrecon = 1 − cos(z, zˆ). However, this approach incurs substantial computational overhead, as it requires repeated environment rollouts during training. In practice, incorporating this loss significantly increases training time and … view at source ↗
Figure 12
Figure 12. Figure 12: Comparison between the PbRL and zero-shot RFRL protocols. In all plots, shaded regions indicate the standard deviation across 5 seeds [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of FB-PbRL under different numbers of preference pairs. In all plots, shaded regions indicate the standard deviation across 5 seeds. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison between different noise settings. In all plots, shaded regions indicate the standard deviation across 5 seeds. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison between different contrastive coefficients. In all plots, shaded regions indicate the standard deviation across 5 seeds [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison between different segment sizes. In all plots, shaded regions indicate the standard deviation across 3 seeds. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison between contrastive objectives. In all plots, shaded regions indicate the standard deviation across 5 seeds. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
read the original abstract

Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new training framework that first learns latent successor-measure representations from reward-free offline data, followed by contrastive search and fine-tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback-efficient solution. Our code is publicly available at https://github.com/rl-bandits-lab/FB-PbRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper proposes rethinking offline preference-based RL (PbRL) by integrating reward-free representation learning (RFRL). It first learns latent successor-measure representations from reward-free offline data, then applies contrastive search and fine-tuning steps that leverage limited preference data. The central claim is that this yields superior preference efficiency over standard offline PbRL baselines, with the work positioned as the first explicit connection between RFRL and PbRL; code is released publicly.

Significance. If the experimental claims hold, the approach could improve feedback efficiency in PbRL by reusing reward-free data for richer representations, reducing reliance on expensive human preferences. Public code release supports reproducibility, a positive factor in the assessment.

minor comments (1)
  1. The abstract references 'extensive experiments and ablations' but provides no details on datasets, baselines, or metrics; a results section or table would be needed to evaluate the superiority claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript, recognition of its potential impact on feedback efficiency in PbRL, and positive note on the public code release. The recommendation is listed as uncertain, but the MAJOR COMMENTS section contains no specific points or questions. We therefore have no individual comments to address point-by-point and remain available to provide further clarification or additional experiments if requested.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper proposes applying existing reward-free representation learning (RFRL) techniques—specifically successor-measure representations learned from unlabeled offline data—to the offline preference-based RL setting, followed by contrastive search and fine-tuning on preference labels. No equations, derivations, or load-bearing steps are shown that reduce a claimed prediction or uniqueness result to a fitted quantity defined by the same data, a self-citation chain, or an ansatz smuggled via prior work by the same authors. The abstract and described pipeline treat RFRL as an external starting point and present the connection plus empirical results as the contribution, without any of the six enumerated circular patterns. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; specific free parameters, axioms, or invented entities are not detailed. The approach rests on standard RL assumptions about successor measures and contrastive learning.

axioms (1)
  • domain assumption Latent successor-measure representations can be learned from reward-free offline data and transferred to preference-based fine-tuning
    This premise underpins the first stage of the proposed pipeline.

pith-pipeline@v0.9.1-grok · 5683 in / 1213 out tokens · 32964 ms · 2026-06-28T17:26:48.782182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

152 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=

  2. [2]

    ACM Computing Surveys , volume=

    A Survey on Document-level Neural Machine Translation: Methods and Evaluation , author=. ACM Computing Surveys , volume=

  3. [4]

    International Conference on Learning Representations (ICLR) , year=

    Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models , author=. International Conference on Learning Representations (ICLR) , year=

  4. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Zero-shot reinforcement learning from low quality data , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  5. [6]

    Reinforcement Learning Conference (RLC) , year=

    Fast Adaptation with Behavioral Foundation Models , author=. Reinforcement Learning Conference (RLC) , year=

  6. [7]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Distributional successor features enable zero-shot policy optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  7. [8]

    International Conference on Machine Learning (ICML) , year=

    Temporal Difference Flows , author=. International Conference on Machine Learning (ICML) , year=

  8. [9]

    International Conference on Learning Representations (ICLR) , year=

    Intention-Conditioned Flow Occupancy Models , author=. International Conference on Learning Representations (ICLR) , year=

  9. [10]

    Workshop on Reinforcement Learning Beyond Rewards@ Reinforcement Learning Conference 2025 , year=

    Zero-Shot Constraint Satisfaction with Forward-Backward Representations , author=. Workshop on Reinforcement Learning Beyond Rewards@ Reinforcement Learning Conference 2025 , year=

  10. [11]

    Workshop on Reinforcement Learning Beyond Rewards@ Reinforcement Learning Conference 2025 , year=

    Regularized latent dynamics prediction is a strong baseline for behavioral foundation models , author=. Workshop on Reinforcement Learning Beyond Rewards@ Reinforcement Learning Conference 2025 , year=

  11. [12]

    International Conference on Learning Representations (ICLR) , year=

    Unsupervised Zero-Shot Reinforcement Learning via Dual-Value Forward-Backward Representation , author=. International Conference on Learning Representations (ICLR) , year=

  12. [13]

    Marco Bagatella and Matteo Pirotta and Ahmed Touati and Alessandro Lazaric and Andrea Tirinzoni , booktitle=

  13. [14]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  14. [15]

    Reinforcement Learning Conference (RLC) , year=

    Zero-Shot Reinforcement Learning Under Partial Observability , author=. Reinforcement Learning Conference (RLC) , year=

  15. [16]

    International Conference on Machine Learning (ICML) , pages=

    A distributional analogue to the successor representation , author=. International Conference on Machine Learning (ICML) , pages=

  16. [17]

    International Conference on Learning Representations (ICLR) , year=

    Zero-Shot Adaptation of Behavioral Foundation Models to Unseen Dynamics , author=. International Conference on Learning Representations (ICLR) , year=

  17. [18]

    International Conference on Learning Representations (ICLR) , year=

    Fast Imitation via Behavior Foundation Models , author=. International Conference on Learning Representations (ICLR) , year=

  18. [19]

    International Conference on Machine Learning (ICML) , pages=

    Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings , author=. International Conference on Machine Learning (ICML) , pages=

  19. [20]

    Sikchi, Harshit and Agarwal, Siddhant and Jajoo, Pranaya and Parajuli, Samyak and Chuck, Caleb and Rudolph, Max and Stone, Peter and Zhang, Amy and Niekum, Scott , journal=

  20. [21]

    International Conference on Learning Representations (ICLR) , year=

    Does Zero-Shot Reinforcement Learning Exist? , author=. International Conference on Learning Representations (ICLR) , year=

  21. [22]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Learning one representation to optimize all rewards , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  22. [23]

    Proto Successor Measure: Representing the Behavior Space of an

    Agarwal, Siddhant and Sikchi, Harshit and Stone, Peter and Zhang, Amy , booktitle=. Proto Successor Measure: Representing the Behavior Space of an

  23. [24]

    Foundation Policies with

    Park, Seohong and Kreiman, Tobias and Levine, Sergey , booktitle=. Foundation Policies with

  24. [25]

    Wu, Yifan and Tucker, George and Nachum, Ofir , booktitle=. The

  25. [26]

    Reinforcement Learning Conference (RLC) , year=

    Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting , author=. Reinforcement Learning Conference (RLC) , year=

  26. [27]

    Huang, Nai-Chieh and Hsieh, Ping-Chun and Ho, Kuo-Hao and Wu, I-Chen , booktitle=

  27. [28]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Towards Robust Zero-Shot Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  28. [29]

    Lee, Kimin and Smith, Laura M and Abbeel, Pieter , booktitle=

  29. [30]

    Park, Jongjin and Seo, Younggyo and Shin, Jinwoo and Lee, Honglak and Abbeel, Pieter and Lee, Kimin , booktitle=

  30. [31]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    B-Pref: Benchmarking Preference-Based Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  31. [32]

    International Conference on Learning Representations (ICLR) , year=

    Reward Uncertainty for Exploration in Preference-based Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  32. [33]

    Cheng, Jie and Xiong, Gang and Dai, Xingyuan and Miao, Qinghai and Lv, Yisheng and Wang, Fei-Yue , booktitle=

  33. [34]

    IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences , author=. IEEE International Conference on Robotics and Automation (ICRA) , pages=

  34. [35]

    Luan, Yao and Mu, Ni and Yang, Yiqin and XU, Bo and Jia, Qing-Shan , booktitle=

  35. [36]

    Bai, Fengshuo and Zhao, Rui and Zhang, Hongming and Cui, Sijia and Zhang, Shao and Han, Lei and Wen, Ying and Yang, Yaodong and others , booktitle=

  36. [37]

    2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

    SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning , author=. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

  37. [38]

    International Conference on Learning Representations (ICLR) , year=

    Query-Policy Misalignment in Preference-Based Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  38. [39]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  39. [40]

    International Conference on Learning Representations (ICLR) , year=

    Preference Transformer: Modeling Human Preferences using Transformers for RL , author=. International Conference on Learning Representations (ICLR) , year=

  40. [41]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Direct preference-based policy optimization without reward modeling , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  41. [42]

    Transactions on Machine Learning Research (TMLR) , year=

    Benchmarks and Algorithms for Offline Preference-Based Reward Learning , author=. Transactions on Machine Learning Research (TMLR) , year=

  42. [43]

    Advances in neural information processing systems (NeurIPS) , pages=

    Inverse preference learning: preference-based rl without a reward function , author=. Advances in neural information processing systems (NeurIPS) , pages=

  43. [44]

    International Conference on Machine Learning (ICML) , pages=

    Beyond reward: Offline preference-guided policy optimization , author=. International Conference on Machine Learning (ICML) , pages=

  44. [45]

    International Conference on Learning Representations (ICLR) , year=

    Flow to better: Offline preference-based reinforcement learning via preferred trajectory generation , author=. International Conference on Learning Representations (ICLR) , year=

  45. [46]

    International Conference on Learning Representations (ICLR) , year=

    Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  46. [47]

    International Conference on Machine Learning (ICML) , pages=

    Listwise Reward Estimation for Offline Preference-based Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , pages=. 2024 , organization=

  47. [48]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume=

    In-dataset trajectory return regularization for offline preference-based reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume=

  48. [49]

    Mu, Ni and Hu, Hao and Hu, Xiao and Yang, Yiqin and XU, Bo and Jia, Qing-Shan , booktitle=

  49. [50]

    Kang, Sehyeok and Jeong, Jaewook and Yun, Se-Young , booktitle=

  50. [51]

    International Conference on Learning Representations (ICLR) , year=

    Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  51. [52]

    International Conference on Learning Representations (ICLR) , year=

    Provable Offline Preference-Based Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  52. [55]

    Advances in neural information processing systems (NeurIPS) , volume=

    Survival instinct in offline reinforcement learning , author=. Advances in neural information processing systems (NeurIPS) , volume=

  53. [56]

    International Conference on Learning Representations (ICLR) , year=

    Exploration by random network distillation , author=. International Conference on Learning Representations (ICLR) , year=

  54. [57]

    Conference on robot learning (CoRL) , pages=

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning , author=. Conference on robot learning (CoRL) , pages=

  55. [58]

    Generalizable Policy Learning in the Physical World Workshop (ICLR) , year=

    Don't change the algorithm, change the data: Exploratory data for offline reinforcement learning , author=. Generalizable Policy Learning in the Physical World Workshop (ICLR) , year=

  56. [59]

    the method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

  57. [60]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Online iterative reinforcement learning from human feedback with general preference model , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  58. [61]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Sequential preference ranking for efficient reinforcement learning from human feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  59. [62]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Deep reinforcement learning from human preferences , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  60. [63]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  61. [64]

    International Conference on Machine Learning (ICML) , pages=

    A simple framework for contrastive learning of visual representations , author=. International Conference on Machine Learning (ICML) , pages=

  62. [65]

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=

  63. [66]

    nature , volume=

    Vinyals, Oriol and Babuschkin, Igor and Czarnecki, Wojciech M and Mathieu, Micha. nature , volume=

  64. [67]

    Zheng, Guanjie and Zhang, Fuzheng and Zheng, Zihan and Xiang, Yang and Yuan, Nicholas Jing and Xie, Xing and Li, Zhenhui , booktitle=

  65. [68]

    ACM Computing Surveys , volume=

    Reinforcement learning based recommender systems: A survey , author=. ACM Computing Surveys , volume=

  66. [69]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Learning to dispatch for job shop scheduling via deep reinforcement learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  67. [70]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Learning combinatorial optimization algorithms over graphs , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  68. [71]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  69. [72]

    International Conference on Learning Representations (ICLR) , year=

    Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback , author=. International Conference on Learning Representations (ICLR) , year=

  70. [73]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Successor features for transfer in reinforcement learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  71. [74]

    International Conference on Learning Representations (ICLR) , year=

    Universal Successor Features Approximators , author=. International Conference on Learning Representations (ICLR) , year=

  72. [75]

    Neural computation , year=

    Improving generalization for temporal difference learning: The successor representation , author=. Neural computation , year=

  73. [76]

    Advances in neural information processing systems (NeurIPS) , volume=

    Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems (NeurIPS) , volume=

  74. [77]

    International Conference on Learning Representations (ICLR) , year=

    Offline Reinforcement Learning with Implicit Q-Learning , author=. International Conference on Learning Representations (ICLR) , year=

  75. [78]

    International Conference on Machine Learning (ICML) , pages=

    Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author=. International Conference on Machine Learning (ICML) , pages=. 2023 , organization=

  76. [79]

    International Conference on Learning Representations (ICLR) , year=

    A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  77. [80]

    Learning Human-Like

    Guo, Jian-Ting and Chen, Yu-Cheng and Hsieh, Ping-Chun and Ho, Kuo-Hao and Huang, Po-Wei and Wu, Ti-Rong and Wu, I and others , journal=. Learning Human-Like

  78. [81]

    M., Crump, T., and Far, B

    Afsar, M. M., Crump, T., and Far, B. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55 0 (7): 0 1--38, 2022

  79. [82]

    Proto successor measure: Representing the behavior space of an RL agent

    Agarwal, S., Sikchi, H., Stone, P., and Zhang, A. Proto successor measure: Representing the behavior space of an RL agent. In International Conference on Machine Learning (ICML), 2025

  80. [83]

    An, G., Lee, J., Zuo, X., Kosaka, N., Kim, K.-M., and Song, H. O. Direct preference-based policy optimization without reward modeling. Advances in Neural Information Processing Systems (NeurIPS), 36: 0 70247--70266, 2023

Showing first 80 references.