pith. machine review for the scientific record. sign in

arxiv: 2605.12288 · v2 · submitted 2026-05-12 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords direct preference optimizationtoken-level alignmentBregman divergencedensity ratio matchinglanguage model alignmentRL-free alignmentBradley-Terry model
0
0 comments X

The pith

Token-level Bregman Preference Optimization matches density ratios at each token to recover per-prefix optimality from sequence-level preference data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Direct Preference Optimization aligns models on full sequences even though generation proceeds token by token. The paper posits a token-level Bradley-Terry model in which humans compare next-token actions conditioned on the current prefix. From this assumption it derives a Bregman-divergence density-ratio matching objective that generalizes the logistic DPO loss. The objective preserves the optimal policy induced by the token-level model while retaining the same training simplicity as DPO. Experiments on instruction following, helpfulness, harmlessness, and summarization show gains in alignment quality, stability, and output diversity over both sequence-level and prior token-level baselines.

Core claim

TBPO posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix and derives a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity.

What carries the argument

Bregman-divergence density-ratio matching objective that enforces the token-level Bradley-Terry model from ordinary sequence-level pairwise comparisons.

Load-bearing premise

Human preferences over full sequences arise from independent next-token comparisons at each prefix and are exactly captured by the posited token-level Bradley-Terry model.

What would settle it

An experiment that fits a token-level reward model to the same preference data and then checks whether a TBPO-trained policy achieves strictly higher expected token-level reward than a DPO-trained policy on held-out prefixes.

Figures

Figures reproduced from arXiv: 2605.12288 by Duy Minh Ho Nguyen, Khoa Doan, Linh Ngo Van, Tien-Phat Nguyen, Trung Le, Truong Nguyen.

Figure 1
Figure 1. Figure 1: MT-Bench pairwise win/tie/lose rates for TBPO-Q (top) and TBPO-A (bottom) against prior preference-optimization baselines, evaluated by two LLM judges. TBPO achieves higher win rates with low loss rates across both judges, and the advantage persists even against the strongest baseline. are broad and strongest on reasoning-heavy tasks: TBPO￾Q attains the best GSM8K (39.34 vs. 34.87 for DPO and 18.49 for SFT… view at source ↗
Figure 2
Figure 2. Figure 2: LC win rate vs. average response length against the dataset-preferred completion for Llama 3 8B, evaluated by two LLM judges (Llama 3 70B, DeepSeek-V3); error bars are ±1 s.e. over 200 prompts. TBPO leads with shorter outputs, indicating gains beyond verbosity and consistent across judges. 0.62 0.63 0.64 0.65 0.66 Distinct-1 ( ) 0.4 0.5 0.6 0.7 0.8 Predictive Entropy ( ) SFT DPO TDPO TISDPO BPO TBPO-Q TBPO… view at source ↗
Figure 3
Figure 3. Figure 3: Generation diversity trade-offs: Distinct-1 vs. predictive entropy (higher is better), colored by self-BLEU (lower is better). TBPO achieves the best three-way trade-off across all three met￾rics. response quality rather than judge bias or verbosity effects. Generation Diversity. Preference optimization can re￾duce generation diversity (Kirk et al., 2024). We evaluate diversity on 100 held-out prompts from… view at source ↗
Figure 4
Figure 4. Figure 4: LC win rate vs. average response length on TLDR for Llama 3 8B, judged by Llama 3 70B and DeepSeek-V3. TBPO achieves highest win rate with shorter outputs, suggesting strong OOD generalization despite training only on UltraFeedback dataset. TLDR. The TL;DR dataset is a processed Reddit summarization corpus built from posts where authors append a “TL;DR” summary. It is provided in a prompt–completion format… view at source ↗
read the original abstract

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix. It derives a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss, with the goal of preserving the optimal policy induced by the token-level model while using only standard sequence-level pairwise comparisons. Two variants are presented: TBPO-Q (with explicit state baseline) and TBPO-A (via advantage normalization). Experiments across instruction following, helpfulness/harmlessness, and summarization benchmarks report gains in alignment quality, training stability, and output diversity relative to sequence-level and token-level baselines.

Significance. If the derivation is shown to hold exactly, TBPO would supply a principled, DPO-simple route to token-level optimality that could improve stability and diversity in aligned models. The Bregman-ratio-matching framing is a clean technical contribution that might extend to other divergences or settings.

major comments (1)
  1. [Derivation section (around the TBPO loss definition)] Derivation of TBPO objective: the claim that Bregman-divergence density-ratio matching recovers exactly the optimal policy of the posited token-level BT model from sequence-level data requires an explicit argument that sequence preferences decompose into per-token conditionals without residual cross-token terms and that no marginalization over future tokens or prefix distribution shifts the fixed point. The current presentation leaves this step implicit; a concrete fixed-point proof or counter-example check is needed to substantiate the preservation claim.
minor comments (2)
  1. [Model definition] The token-level BT model P(a ≻ a′ | prefix) should be written as an explicit equation early in the methods to clarify the conditioning and the transition from sequence-level data.
  2. [Experiments] Experimental tables would benefit from reporting standard deviations or statistical significance tests alongside the benchmark improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our work. The major comment on the derivation is well-taken; we agree that an explicit fixed-point argument will strengthen the manuscript and will incorporate it in the revision.

read point-by-point responses
  1. Referee: [Derivation section (around the TBPO loss definition)] Derivation of TBPO objective: the claim that Bregman-divergence density-ratio matching recovers exactly the optimal policy of the posited token-level BT model from sequence-level data requires an explicit argument that sequence preferences decompose into per-token conditionals without residual cross-token terms and that no marginalization over future tokens or prefix distribution shifts the fixed point. The current presentation leaves this step implicit; a concrete fixed-point proof or counter-example check is needed to substantiate the preservation claim.

    Authors: We thank the referee for this observation. The TBPO derivation begins from a token-level Bradley-Terry model over next-token actions conditioned on the prefix and shows that the Bregman density-ratio matching objective shares the same optimum as this model when trained on sequence-level pairs. To make the argument fully explicit, we will add a new subsection (Section 3.3 in the revision) containing a fixed-point proof. The proof proceeds by (i) writing the sequence preference probability as the product of per-token conditionals under the token-level BT assumption, (ii) showing that the gradient of the ratio-matching loss with respect to the policy logits telescopes exactly to the per-token log-ratio term without residual cross-token contributions, and (iii) verifying that the fixed point remains invariant under marginalization over future tokens because the advantage normalization (TBPO-A) or explicit baseline (TBPO-Q) cancels any prefix-distribution shift. We will also include a short synthetic MDP counter-example check confirming that the recovered policy matches the token-level optimum. These additions will be placed immediately after the loss definition and will not alter any experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation from posited token-level BT model

full rationale

The paper posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix and derives the Bregman-divergence density-ratio matching objective (TBPO) directly from it, generalizing the logistic/DPO loss while preserving the induced optimal policy. This is a standard forward derivation rather than a reduction of the claimed result to fitted inputs, self-citations, or definitional equivalence. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear in the provided text; the central claim remains conditional on the modeling choice but is mathematically self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of a token-level Bradley-Terry model; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Token-level Bradley-Terry preference model over next-token actions conditioned on the prefix
    Explicitly posited to enable the token-level objective and its Bregman derivation.

pith-pipeline@v0.9.0 · 5505 in / 1237 out tokens · 45448 ms · 2026-05-15T05:37:51.027432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

157 extracted references · 157 canonical work pages · 20 internal anchors

  1. [1]

    , title =

    Puterman, Martin L. , title =. 1994 , series =

  2. [2]

    Junkang Wu and Yuexiang Xie and Zhengyi Yang and Jiancan Wu and Jinyang Gao and Bolin Ding and Xiang Wang and Xiangnan He , editor =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , year =

  3. [3]

    SimPO: Simple Preference Optimization with a Reference-Free Reward , booktitle =

    Yu Meng and Mengzhou Xia and Danqi Chen , editor =. SimPO: Simple Preference Optimization with a Reference-Free Reward , booktitle =. 2024 , url =

  4. [4]

    Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , journal=

  5. [5]

    Advances in neural information processing systems , volume=

    Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

  6. [6]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. arXiv preprint arXiv:2305.18290 , year=

  7. [7]

    2024 , eprint=

    Token-level Direct Preference Optimization , author=. 2024 , eprint=

  8. [8]

    A General Theoretical Paradigm to Understand Learning from Human Preferences , booktitle =

    Mohammad Gheshlaghi Azar and Zhaohan Daniel Guo and Bilal Piot and R. A General Theoretical Paradigm to Understand Learning from Human Preferences , booktitle =. 2024 , url =

  9. [9]

    Forty-first International Conference on Machine Learning,

    Haoran Xu and Amr Sharaf and Yunmo Chen and Weiting Tan and Lingfeng Shen and Benjamin Van Durme and Kenton Murray and Young Jin Kim , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

  10. [10]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

    Jiwoo Hong and Noah Lee and James Thorne , editor =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.626 , timestamp =

  11. [11]

    A Survey of Direct Preference Optimization , journal =

    Shunyu Liu and Wenkai Fang and Zetian Hu and Junjie Zhang and Yang Zhou and Kongcheng Zhang and Rongcheng Tu and Ting. A Survey of Direct Preference Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.11701 , eprinttype =. 2503.11701 , timestamp =

  12. [12]

    CoRR , volume =

    Wenyi Xiao and Zechuan Wang and Leilei Gan and Shuai Zhao and Wanggui He and Luu Anh Tuan and Long Chen and Hao Jiang and Zhou Zhao and Fei Wu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.15595 , eprinttype =. 2410.15595 , timestamp =

  13. [13]

    The Twelfth International Conference on Learning Representations,

    Chaoqi Wang and Yibo Jiang and Chenghao Yang and Han Liu and Yuxin Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  14. [14]

    f-PO: Generalizing Preference Optimization with f-divergence Minimization , booktitle =

    Jiaqi Han and Mingjian Jiang and Yuxuan Song and Stefano Ermon and Minkai Xu , editor =. f-PO: Generalizing Preference Optimization with f-divergence Minimization , booktitle =. 2025 , url =

  15. [15]

    2025 , eprint=

    Preference Optimization by Estimating the Ratio of the Data Distribution , author=. 2025 , eprint=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    Yu and Meng Cao , title =

    Aiwei Liu and Haoping Bai and Zhiyun Lu and Yanchao Sun and Xiang Kong and Xiaoming Simon Wang and Jiulong Shan and Albin Madappally Jose and Xiaojiang Liu and Lijie Wen and Philip S. Yu and Meng Cao , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  18. [18]

    doi:10.5281/zenodo.5371628 , url =

    Sutawika, Lintang and Schoelkopf, Hailey and Gao, Leo and Abbasi, Baber and Biderman, Stella and Tow, Jonathan and others , title =. doi:10.5281/zenodo.5371628 , url =

  19. [19]

    2018 , eprint=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

  20. [20]

    2019 , eprint=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

  21. [21]

    2019 , eprint=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

  22. [22]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  23. [23]

    2022 , eprint=

    TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

  24. [24]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  25. [25]

    and Jordan, Michael I

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Brooks, Dacheng and Xing, Eric and Gonzalez, Joseph E. and Jordan, Michael I. and Stoica, Ion , booktitle =. Judging

  26. [26]

    2018 , eprint=

    Texygen: A Benchmarking Platform for Text Generation Models , author=. 2018 , eprint=

  27. [27]

    2016 , eprint=

    A Diversity-Promoting Objective Function for Neural Conversation Models , author=. 2016 , eprint=

  28. [28]

    The Twelfth International Conference on Learning Representations,

    Robert Kirk and Ishita Mediratta and Christoforos Nalmpantis and Jelena Luketina and Eric Hambro and Edward Grefenstette and Roberta Raileanu , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  29. [29]

    2025 , eprint=

    CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards , author=. 2025 , eprint=

  30. [30]

    2025 , eprint=

    Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement , author=. 2025 , eprint=

  31. [31]

    Gutmann and Aapo Hyv

    Michael U. Gutmann and Aapo Hyv. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , journal =. 2012 , volume =

  32. [32]

    Hashimoto , title =

    Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  33. [33]

    Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , url =

    Sugiyama, Masashi and Nakajima, Shinichi and Kashima, Hisashi and Buenau, Paul and Kawanabe, Motoaki , booktitle =. Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , url =

  34. [34]

    2025 , eprint=

    Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization , author=. 2025 , eprint=

  35. [35]

    Ussr Computational Mathematics and Mathematical Physics , year=

    The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , author=. Ussr Computational Mathematics and Mathematical Physics , year=

  36. [36]

    Annals of the Institute of Statistical Mathematics , year =

    Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation , author =. Annals of the Institute of Statistical Mathematics , year =. doi:10.1007/s10463-011-0343-8 , url =

  37. [37]

    2023 , eprint=

    Concrete Score Matching: Generalized Score Matching for Discrete Data , author=. 2023 , eprint=

  38. [38]

    2020 , month = mar, howpublished =

    John Schulman , title =. 2020 , month = mar, howpublished =

  39. [39]

    Correcting Sample Selection Bias by Unlabeled Data , url =

    Huang, Jiayuan and Gretton, Arthur and Borgwardt, Karsten and Sch\". Correcting Sample Selection Bias by Unlabeled Data , url =. Advances in Neural Information Processing Systems , editor =

  40. [40]

    arXiv preprint arXiv:2404.18922 , year=

    Dpo meets ppo: Reinforced token optimization for rlhf , author=. arXiv preprint arXiv:2404.18922 , year=

  41. [41]

    From r to

    Rafailov, Rafael and Hejna, Joey and Park, Ryan and Finn, Chelsea , journal=. From r to

  42. [42]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  43. [43]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  44. [44]

    M. J. Kearns , title =

  45. [45]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  46. [46]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  47. [47]

    Suppressed for Anonymity , author=

  48. [48]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  49. [49]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  50. [50]

    Structure and Interpretation of Computer Programs

    Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

  51. [51]

    2014 , publisher=

    Markov decision processes: discrete stochastic dynamic programming , author=. 2014 , publisher=

  52. [52]

    Visual Information Extraction with Lixto

    Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

  53. [53]

    Brachman and James G

    Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

  54. [54]

    Levesque

    Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

  55. [55]

    On the compilability and expressive power of propositional planning formalisms

    Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

  56. [56]

    moco v2 , journal=

    Chen, Xinlei and Fan, Haoqi and Girshick, Ross and He, Kaiming , year=. moco v2 , journal=

  57. [57]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  58. [58]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  59. [59]

    Advances in neural information processing systems , volume=

    Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=

  60. [60]

    Advances in Neural Information Processing Systems , volume=

    Multi-game decision transformers , author=. Advances in Neural Information Processing Systems , volume=

  61. [61]

    A Generalist Agent

    A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

  62. [62]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Fedus, William and Zoph, Barret and Shazeer, Noam , year=. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. , journal=

  63. [63]

    Meta-Reinforcement Learning of Structured Exploration Strategies , journal=

    Gupta, Abhishek and Mendonca, Russell and Liu, YuXuan and Abbeel, Pieter and Levine, Sergey , year=. Meta-Reinforcement Learning of Structured Exploration Strategies , journal=

  64. [64]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  65. [65]

    moco , url=

    He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross , year=. moco , url=. doi:10.1109/cvpr42600.2020.00975 , booktitle=

  66. [66]

    Offline Reinforcement Learning as One Big Sequence Modeling Problem , journal=

    Janner, Michael and Li, Qiyang and Levine, Sergey , year=. Offline Reinforcement Learning as One Big Sequence Modeling Problem , journal=

  67. [67]

    2022 , month=

    General-Purpose In-Context Learning by Meta-Learning Transformers , author=. 2022 , month=

  68. [68]

    arXiv preprint arXiv:2210.14215 , year=

    In-context reinforcement learning with algorithm distillation , author=. arXiv preprint arXiv:2210.14215 , year=

  69. [69]

    Supervised Pretraining Can Learn In-Context Reinforcement Learning , author=

  70. [70]

    Provably Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning , journal=

    Li, Lanqing and Huang, Yuanhao and Luo, Dijun , year=. Provably Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning , journal=

  71. [71]

    Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization , journal=

    Li, Lanqing and Yang, Rui and Luo, Dijun , year=. Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization , journal=

  72. [72]

    2023 , month=

    Transformers as Algorithms: Generalization and Stability in In-context Learning , author=. 2023 , month=

  73. [73]

    Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning , author=

  74. [74]

    2023 , month=

    Structured State Space Models for In-Context Reinforcement Learning , author=. 2023 , month=

  75. [75]

    Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , journal=

    Rakelly, Kate and Zhou, Aurick and Quillen, Deirdre and Finn, Chelsea and Levine, Sergey , year=. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , journal=

  76. [76]

    2023 , month=

    Case-Based Reasoning with Language Models for Classification of Logical Fallacies , author=. 2023 , month=

  77. [77]

    2023 , month=

    SMART: Self-supervised Multi-task pretrAining with contRol Transformers , author=. 2023 , month=

  78. [78]

    2022 , month=

    Bootstrapped Transformer for Offline Reinforcement Learning , author=. 2022 , month=

  79. [79]

    2022 , month=

    Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning , author=. 2022 , month=

  80. [80]

    Task Inference for Offline Meta Reinforcement Learning via Latent Shared Knowledge , author=

Showing first 80 references.