TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Mixed citations
Wiley Series in Probability and Statistics, Wiley (1994)
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
A new efficient algorithm computes optimal conditional reachability probabilities in MDPs without creating hard cyclic reductions, achieving linear time on acyclic cases and substantial speedups on benchmarks from Bayesian networks, probabilistic programs, and runtime monitoring.
The optimal value and policy computation problem for finite-horizon objectives in multi-environment POMDPs is PSPACE-complete, and a new algorithm solves it more efficiently than previous methods on classical benchmarks.
A life-cycle optimization framework for deteriorating infrastructure under hazards is formulated as an MDP with a Kronecker-factored tensor method that reduces computational complexity from exponential to linear while preserving exact dynamic programming solutions.
A POMDP decomposition method scales solving of the Sensor Selection Problem and Positional Observability Problem by 3 and 5 orders of magnitude in instance size and runtime.
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Resampling clinical time series into uniform bins for offline RL reduces performance by up to 60% and causes retrospective evaluations to overestimate returns by 1.5-3x versus unprocessed data.
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.
citing papers explorer
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Fast Computation of Conditional Probabilities in MDPs and Markov Chain Families
A new efficient algorithm computes optimal conditional reachability probabilities in MDPs without creating hard cyclic reductions, achieving linear time on acyclic cases and substantial speedups on benchmarks from Bayesian networks, probabilistic programs, and runtime monitoring.
-
Multi-Environment POMDPs with Finite-Horizon Objectives
The optimal value and policy computation problem for finite-horizon objectives in multi-environment POMDPs is PSPACE-complete, and a new algorithm solves it more efficiently than previous methods on classical benchmarks.
-
Probabilistic Hazard Analysis Framework with Stochastic Optimal Control for Deteriorating Civil Infrastructure Systems
A life-cycle optimization framework for deteriorating infrastructure under hazards is formulated as an MDP with a Kronecker-factored tensor method that reduces computational complexity from exponential to linear while preserving exact dynamic programming solutions.
-
Scaling Observation-aware Planning in Uncertain Domains
A POMDP decomposition method scales solving of the Sensor Selection Problem and Positional Observability Problem by 3 and 5 orders of magnitude in instance size and runtime.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
The hidden risks of temporal resampling in clinical reinforcement learning
Resampling clinical time series into uniform bins for offline RL reduces performance by up to 60% and causes retrospective evaluations to overestimate returns by 1.5-3x versus unprocessed data.
-
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.
-
Benchmark Data Contamination of Large Language Models: A Survey
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
-
Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers
A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.
- Optimal strategies in the all-heads coin game