John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

doi: 10 · 1975 · DOI 10.2307/2346567

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open at publisher browse 7 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.

A Differentiable Bayesian Relaxation for Latent Partial-Order Inference

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

The authors replace discontinuous precedence and frontier constraints in a partial-order model with smooth surrogates, producing a continuous posterior that supports gradient MCMC and variational inference while recovering the hard model in the limit.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

cs.LG · 2023-05-29 · accept · novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

Score-Driven Rating System for Sports

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

A generalization of Elo ratings updates player strengths via the score (log-likelihood gradient) for varied game outcomes, with derived properties of zero expected value, summation to zero, and reversion to unobserved true skills.

Analysis of Search Heuristics in the Multi-Armed Bandit Setting

cs.NE · 2026-04-09 · unverdicted · novelty 6.0

In the dueling bandit setting, the (1+1) EA selects the Condorcet winner with only constant probability when its advantage is Ω(1/n), while a Max-Min Ant System EDA selects it with probability 1-Θ(p), and repeated duels improve the EA's performance.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

citing papers explorer

Showing 7 of 7 citing papers.

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking cs.LG · 2026-05-15 · unverdicted · none · ref 22
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
A Differentiable Bayesian Relaxation for Latent Partial-Order Inference stat.ML · 2026-05-07 · unverdicted · none · ref 32
The authors replace discontinuous precedence and frontier constraints in a partial-order model with smooth surrogates, producing a continuous posterior that supports gradient MCMC and variational inference while recovering the hard model in the limit.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model cs.LG · 2023-05-29 · accept · none · ref 34
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Score-Driven Rating System for Sports cs.LG · 2026-04-10 · unverdicted · none · ref 17
A generalization of Elo ratings updates player strengths via the score (log-likelihood gradient) for varied game outcomes, with derived properties of zero expected value, summation to zero, and reversion to unobserved true skills.
Analysis of Search Heuristics in the Multi-Armed Bandit Setting cs.NE · 2026-04-09 · unverdicted · none · ref 24
In the dueling bandit setting, the (1+1) EA selects the Condorcet winner with only constant probability when its advantage is Ω(1/n), while a Max-Min Ant System EDA selects it with probability 1-Θ(p), and repeated duels improve the EA's performance.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 113
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 31
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer