PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
The authors replace discontinuous precedence and frontier constraints in a partial-order model with smooth surrogates, producing a continuous posterior that supports gradient MCMC and variational inference while recovering the hard model in the limit.
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
A generalization of Elo ratings updates player strengths via the score (log-likelihood gradient) for varied game outcomes, with derived properties of zero expected value, summation to zero, and reversion to unobserved true skills.
In the dueling bandit setting, the (1+1) EA selects the Condorcet winner with only constant probability when its advantage is Ω(1/n), while a Max-Min Ant System EDA selects it with probability 1-Θ(p), and repeated duels improve the EA's performance.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
citing papers explorer
-
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
-
A Differentiable Bayesian Relaxation for Latent Partial-Order Inference
The authors replace discontinuous precedence and frontier constraints in a partial-order model with smooth surrogates, producing a continuous posterior that supports gradient MCMC and variational inference while recovering the hard model in the limit.
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
-
Score-Driven Rating System for Sports
A generalization of Elo ratings updates player strengths via the score (log-likelihood gradient) for varied game outcomes, with derived properties of zero expected value, summation to zero, and reversion to unobserved true skills.
-
Analysis of Search Heuristics in the Multi-Armed Bandit Setting
In the dueling bandit setting, the (1+1) EA selects the Condorcet winner with only constant probability when its advantage is Ω(1/n), while a Max-Min Ant System EDA selects it with probability 1-Θ(p), and repeated duels improve the EA's performance.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.