arxiv: 2605.12288 · v2 · submitted 2026-05-12 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

Truong Nguyen , Tien-Phat Nguyen , Linh Ngo Van , Duy Minh Ho Nguyen , Khoa Doan , Trung Le

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords direct preference optimizationtoken-level alignmentBregman divergencedensity ratio matchinglanguage model alignmentRL-free alignmentBradley-Terry model

0 comments

The pith

Token-level Bregman Preference Optimization matches density ratios at each token to recover per-prefix optimality from sequence-level preference data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Direct Preference Optimization aligns models on full sequences even though generation proceeds token by token. The paper posits a token-level Bradley-Terry model in which humans compare next-token actions conditioned on the current prefix. From this assumption it derives a Bregman-divergence density-ratio matching objective that generalizes the logistic DPO loss. The objective preserves the optimal policy induced by the token-level model while retaining the same training simplicity as DPO. Experiments on instruction following, helpfulness, harmlessness, and summarization show gains in alignment quality, stability, and output diversity over both sequence-level and prior token-level baselines.

Core claim

TBPO posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix and derives a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity.

What carries the argument

Bregman-divergence density-ratio matching objective that enforces the token-level Bradley-Terry model from ordinary sequence-level pairwise comparisons.

Load-bearing premise

Human preferences over full sequences arise from independent next-token comparisons at each prefix and are exactly captured by the posited token-level Bradley-Terry model.

What would settle it

An experiment that fits a token-level reward model to the same preference data and then checks whether a TBPO-trained policy achieves strictly higher expected token-level reward than a DPO-trained policy on held-out prefixes.

Figures

Figures reproduced from arXiv: 2605.12288 by Duy Minh Ho Nguyen, Khoa Doan, Linh Ngo Van, Tien-Phat Nguyen, Trung Le, Truong Nguyen.

**Figure 1.** Figure 1: MT-Bench pairwise win/tie/lose rates for TBPO-Q (top) and TBPO-A (bottom) against prior preference-optimization baselines, evaluated by two LLM judges. TBPO achieves higher win rates with low loss rates across both judges, and the advantage persists even against the strongest baseline. are broad and strongest on reasoning-heavy tasks: TBPOQ attains the best GSM8K (39.34 vs. 34.87 for DPO and 18.49 for SFT… view at source ↗

**Figure 2.** Figure 2: LC win rate vs. average response length against the dataset-preferred completion for Llama 3 8B, evaluated by two LLM judges (Llama 3 70B, DeepSeek-V3); error bars are ±1 s.e. over 200 prompts. TBPO leads with shorter outputs, indicating gains beyond verbosity and consistent across judges. 0.62 0.63 0.64 0.65 0.66 Distinct-1 ( ) 0.4 0.5 0.6 0.7 0.8 Predictive Entropy ( ) SFT DPO TDPO TISDPO BPO TBPO-Q TBPO… view at source ↗

**Figure 3.** Figure 3: Generation diversity trade-offs: Distinct-1 vs. predictive entropy (higher is better), colored by self-BLEU (lower is better). TBPO achieves the best three-way trade-off across all three metrics. response quality rather than judge bias or verbosity effects. Generation Diversity. Preference optimization can reduce generation diversity (Kirk et al., 2024). We evaluate diversity on 100 held-out prompts from… view at source ↗

**Figure 4.** Figure 4: LC win rate vs. average response length on TLDR for Llama 3 8B, judged by Llama 3 70B and DeepSeek-V3. TBPO achieves highest win rate with shorter outputs, suggesting strong OOD generalization despite training only on UltraFeedback dataset. TLDR. The TL;DR dataset is a processed Reddit summarization corpus built from posts where authors append a “TL;DR” summary. It is provided in a prompt–completion format… view at source ↗

read the original abstract

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TBPO gives a Bregman-based token-level DPO that preserves the claimed optimal policy if the derivation holds.

read the letter

The main takeaway is that this paper derives a token-level preference optimization method called TBPO using Bregman divergence for density-ratio matching. It starts from a token-level Bradley-Terry model over next-token actions given a prefix and produces a loss that generalizes the standard DPO logistic form while keeping the optimal policy intact under that model. They give two versions: TBPO-Q that learns a lightweight state baseline and TBPO-A that uses advantage normalization to drop the baseline. Both stay simple and avoid RL. The experiments report gains in alignment quality, training stability, and output diversity on instruction following, helpfulness/harmlessness, and summarization tasks compared with sequence-level DPO and earlier token decompositions.

Referee Report

1 major / 2 minor

Summary. The paper introduces Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix. It derives a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss, with the goal of preserving the optimal policy induced by the token-level model while using only standard sequence-level pairwise comparisons. Two variants are presented: TBPO-Q (with explicit state baseline) and TBPO-A (via advantage normalization). Experiments across instruction following, helpfulness/harmlessness, and summarization benchmarks report gains in alignment quality, training stability, and output diversity relative to sequence-level and token-level baselines.

Significance. If the derivation is shown to hold exactly, TBPO would supply a principled, DPO-simple route to token-level optimality that could improve stability and diversity in aligned models. The Bregman-ratio-matching framing is a clean technical contribution that might extend to other divergences or settings.

major comments (1)

[Derivation section (around the TBPO loss definition)] Derivation of TBPO objective: the claim that Bregman-divergence density-ratio matching recovers exactly the optimal policy of the posited token-level BT model from sequence-level data requires an explicit argument that sequence preferences decompose into per-token conditionals without residual cross-token terms and that no marginalization over future tokens or prefix distribution shifts the fixed point. The current presentation leaves this step implicit; a concrete fixed-point proof or counter-example check is needed to substantiate the preservation claim.

minor comments (2)

[Model definition] The token-level BT model P(a ≻ a′ | prefix) should be written as an explicit equation early in the methods to clarify the conditioning and the transition from sequence-level data.
[Experiments] Experimental tables would benefit from reporting standard deviations or statistical significance tests alongside the benchmark improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our work. The major comment on the derivation is well-taken; we agree that an explicit fixed-point argument will strengthen the manuscript and will incorporate it in the revision.

read point-by-point responses

Referee: [Derivation section (around the TBPO loss definition)] Derivation of TBPO objective: the claim that Bregman-divergence density-ratio matching recovers exactly the optimal policy of the posited token-level BT model from sequence-level data requires an explicit argument that sequence preferences decompose into per-token conditionals without residual cross-token terms and that no marginalization over future tokens or prefix distribution shifts the fixed point. The current presentation leaves this step implicit; a concrete fixed-point proof or counter-example check is needed to substantiate the preservation claim.

Authors: We thank the referee for this observation. The TBPO derivation begins from a token-level Bradley-Terry model over next-token actions conditioned on the prefix and shows that the Bregman density-ratio matching objective shares the same optimum as this model when trained on sequence-level pairs. To make the argument fully explicit, we will add a new subsection (Section 3.3 in the revision) containing a fixed-point proof. The proof proceeds by (i) writing the sequence preference probability as the product of per-token conditionals under the token-level BT assumption, (ii) showing that the gradient of the ratio-matching loss with respect to the policy logits telescopes exactly to the per-token log-ratio term without residual cross-token contributions, and (iii) verifying that the fixed point remains invariant under marginalization over future tokens because the advantage normalization (TBPO-A) or explicit baseline (TBPO-Q) cancels any prefix-distribution shift. We will also include a short synthetic MDP counter-example check confirming that the recovered policy matches the token-level optimum. These additions will be placed immediately after the loss definition and will not alter any experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation from posited token-level BT model

full rationale

The paper posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix and derives the Bregman-divergence density-ratio matching objective (TBPO) directly from it, generalizing the logistic/DPO loss while preserving the induced optimal policy. This is a standard forward derivation rather than a reduction of the claimed result to fitted inputs, self-citations, or definitional equivalence. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear in the provided text; the central claim remains conditional on the modeling choice but is mathematically self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of a token-level Bradley-Terry model; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Token-level Bradley-Terry preference model over next-token actions conditioned on the prefix
Explicitly posited to enable the token-level objective and its Bregman derivation.

pith-pipeline@v0.9.0 · 5505 in / 1237 out tokens · 45448 ms · 2026-05-15T05:37:51.027432+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

framing preference learning as likelihood-ratio estimation under Bregman divergences, connecting DPO to classical density-ratio estimators

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

157 extracted references · 157 canonical work pages · 20 internal anchors

[1]

, title =

Puterman, Martin L. , title =. 1994 , series =

work page 1994
[2]

Junkang Wu and Yuexiang Xie and Zhengyi Yang and Jiancan Wu and Jinyang Gao and Bolin Ding and Xiang Wang and Xiangnan He , editor =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , year =

work page 2024
[3]

SimPO: Simple Preference Optimization with a Reference-Free Reward , booktitle =

Yu Meng and Mengzhou Xia and Danqi Chen , editor =. SimPO: Simple Preference Optimization with a Reference-Free Reward , booktitle =. 2024 , url =

work page 2024
[4]

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , journal=

work page
[5]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[6]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. arXiv preprint arXiv:2305.18290 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

2024 , eprint=

Token-level Direct Preference Optimization , author=. 2024 , eprint=

work page 2024
[8]

A General Theoretical Paradigm to Understand Learning from Human Preferences , booktitle =

Mohammad Gheshlaghi Azar and Zhaohan Daniel Guo and Bilal Piot and R. A General Theoretical Paradigm to Understand Learning from Human Preferences , booktitle =. 2024 , url =

work page 2024
[9]

Forty-first International Conference on Machine Learning,

Haoran Xu and Amr Sharaf and Yunmo Chen and Weiting Tan and Lingfeng Shen and Benjamin Van Durme and Kenton Murray and Young Jin Kim , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

work page 2024
[10]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

Jiwoo Hong and Noah Lee and James Thorne , editor =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.626 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.626 2024
[11]

A Survey of Direct Preference Optimization , journal =

Shunyu Liu and Wenkai Fang and Zetian Hu and Junjie Zhang and Yang Zhou and Kongcheng Zhang and Rongcheng Tu and Ting. A Survey of Direct Preference Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.11701 , eprinttype =. 2503.11701 , timestamp =

work page doi:10.48550/arxiv.2503.11701 2025
[12]

CoRR , volume =

Wenyi Xiao and Zechuan Wang and Leilei Gan and Shuai Zhao and Wanggui He and Luu Anh Tuan and Long Chen and Hao Jiang and Zhou Zhao and Fei Wu , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.15595 , eprinttype =. 2410.15595 , timestamp =

work page doi:10.48550/arxiv.2410.15595 2024
[13]

The Twelfth International Conference on Learning Representations,

Chaoqi Wang and Yibo Jiang and Chenghao Yang and Han Liu and Yuxin Chen , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[14]

f-PO: Generalizing Preference Optimization with f-divergence Minimization , booktitle =

Jiaqi Han and Mingjian Jiang and Yuxuan Song and Stefano Ermon and Minkai Xu , editor =. f-PO: Generalizing Preference Optimization with f-divergence Minimization , booktitle =. 2025 , url =

work page 2025
[15]

2025 , eprint=

Preference Optimization by Estimating the Ratio of the Data Distribution , author=. 2025 , eprint=

work page 2025
[16]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

Yu and Meng Cao , title =

Aiwei Liu and Haoping Bai and Zhiyun Lu and Yanchao Sun and Xiang Kong and Xiaoming Simon Wang and Jiulong Shan and Albin Madappally Jose and Xiaojiang Liu and Lijie Wen and Philip S. Yu and Meng Cao , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[18]

doi:10.5281/zenodo.5371628 , url =

Sutawika, Lintang and Schoelkopf, Hailey and Gao, Leo and Abbasi, Baber and Biderman, Stella and Tow, Jonathan and others , title =. doi:10.5281/zenodo.5371628 , url =

work page doi:10.5281/zenodo.5371628
[19]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018
[20]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019
[21]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[22]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[23]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

work page 2022
[24]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[25]

and Jordan, Michael I

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Brooks, Dacheng and Xing, Eric and Gonzalez, Joseph E. and Jordan, Michael I. and Stoica, Ion , booktitle =. Judging

work page
[26]

2018 , eprint=

Texygen: A Benchmarking Platform for Text Generation Models , author=. 2018 , eprint=

work page 2018
[27]

2016 , eprint=

A Diversity-Promoting Objective Function for Neural Conversation Models , author=. 2016 , eprint=

work page 2016
[28]

The Twelfth International Conference on Learning Representations,

Robert Kirk and Ishita Mediratta and Christoforos Nalmpantis and Jelena Luketina and Eric Hambro and Edward Grefenstette and Roberta Raileanu , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[29]

2025 , eprint=

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards , author=. 2025 , eprint=

work page 2025
[30]

2025 , eprint=

Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement , author=. 2025 , eprint=

work page 2025
[31]

Gutmann and Aapo Hyv

Michael U. Gutmann and Aapo Hyv. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , journal =. 2012 , volume =

work page 2012
[32]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[33]

Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , url =

Sugiyama, Masashi and Nakajima, Shinichi and Kashima, Hisashi and Buenau, Paul and Kawanabe, Motoaki , booktitle =. Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , url =

work page
[34]

2025 , eprint=

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization , author=. 2025 , eprint=

work page 2025
[35]

Ussr Computational Mathematics and Mathematical Physics , year=

The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , author=. Ussr Computational Mathematics and Mathematical Physics , year=

work page
[36]

Annals of the Institute of Statistical Mathematics , year =

Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation , author =. Annals of the Institute of Statistical Mathematics , year =. doi:10.1007/s10463-011-0343-8 , url =

work page doi:10.1007/s10463-011-0343-8
[37]

2023 , eprint=

Concrete Score Matching: Generalized Score Matching for Discrete Data , author=. 2023 , eprint=

work page 2023
[38]

2020 , month = mar, howpublished =

John Schulman , title =. 2020 , month = mar, howpublished =

work page 2020
[39]

Correcting Sample Selection Bias by Unlabeled Data , url =

Huang, Jiayuan and Gretton, Arthur and Borgwardt, Karsten and Sch\". Correcting Sample Selection Bias by Unlabeled Data , url =. Advances in Neural Information Processing Systems , editor =

work page
[40]

arXiv preprint arXiv:2404.18922 , year=

Dpo meets ppo: Reinforced token optimization for rlhf , author=. arXiv preprint arXiv:2404.18922 , year=

work page arXiv
[41]

From r to

Rafailov, Rafael and Hejna, Joey and Park, Ryan and Finn, Chelsea , journal=. From r to

work page
[42]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[43]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[44]

M. J. Kearns , title =

work page
[45]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[46]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[47]

Suppressed for Anonymity , author=

work page
[48]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[49]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[50]

Structure and Interpretation of Computer Programs

Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

work page 1985
[51]

2014 , publisher=

Markov decision processes: discrete stochastic dynamic programming , author=. 2014 , publisher=

work page 2014
[52]

Visual Information Extraction with Lixto

Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

work page 2001
[53]

Brachman and James G

Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

work page 1985
[54]

Levesque

Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

work page 1984
[55]

On the compilability and expressive power of propositional planning formalisms

Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

work page 2000
[56]

moco v2 , journal=

Chen, Xinlei and Fan, Haoqi and Girshick, Ross and He, Kaiming , year=. moco v2 , journal=

work page
[57]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[58]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[59]

Advances in neural information processing systems , volume=

Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=

work page
[60]

Advances in Neural Information Processing Systems , volume=

Multi-game decision transformers , author=. Advances in Neural Information Processing Systems , volume=

work page
[61]

A Generalist Agent

A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Fedus, William and Zoph, Barret and Shazeer, Noam , year=. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. , journal=

work page
[63]

Meta-Reinforcement Learning of Structured Exploration Strategies , journal=

Gupta, Abhishek and Mendonca, Russell and Liu, YuXuan and Abbeel, Pieter and Levine, Sergey , year=. Meta-Reinforcement Learning of Structured Exploration Strategies , journal=

work page
[64]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[65]

moco , url=

He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross , year=. moco , url=. doi:10.1109/cvpr42600.2020.00975 , booktitle=

work page doi:10.1109/cvpr42600.2020.00975 2020
[66]

Offline Reinforcement Learning as One Big Sequence Modeling Problem , journal=

Janner, Michael and Li, Qiyang and Levine, Sergey , year=. Offline Reinforcement Learning as One Big Sequence Modeling Problem , journal=

work page
[67]

2022 , month=

General-Purpose In-Context Learning by Meta-Learning Transformers , author=. 2022 , month=

work page 2022
[68]

arXiv preprint arXiv:2210.14215 , year=

In-context reinforcement learning with algorithm distillation , author=. arXiv preprint arXiv:2210.14215 , year=

work page arXiv
[69]

Supervised Pretraining Can Learn In-Context Reinforcement Learning , author=

work page
[70]

Provably Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning , journal=

Li, Lanqing and Huang, Yuanhao and Luo, Dijun , year=. Provably Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning , journal=

work page
[71]

Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization , journal=

Li, Lanqing and Yang, Rui and Luo, Dijun , year=. Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization , journal=

work page
[72]

2023 , month=

Transformers as Algorithms: Generalization and Stability in In-context Learning , author=. 2023 , month=

work page 2023
[73]

Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning , author=

work page
[74]

2023 , month=

Structured State Space Models for In-Context Reinforcement Learning , author=. 2023 , month=

work page 2023
[75]

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , journal=

Rakelly, Kate and Zhou, Aurick and Quillen, Deirdre and Finn, Chelsea and Levine, Sergey , year=. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , journal=

work page
[76]

2023 , month=

Case-Based Reasoning with Language Models for Classification of Logical Fallacies , author=. 2023 , month=

work page 2023
[77]

2023 , month=

SMART: Self-supervised Multi-task pretrAining with contRol Transformers , author=. 2023 , month=

work page 2023
[78]

2022 , month=

Bootstrapped Transformer for Offline Reinforcement Learning , author=. 2022 , month=

work page 2022
[79]

2022 , month=

Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning , author=. 2022 , month=

work page 2022
[80]

Task Inference for Offline Meta Reinforcement Learning via Latent Shared Knowledge , author=

work page

Showing first 80 references.