pith. sign in

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

citation-role summary

background 1 method 1

citation-polarity summary

representative citing papers

A Differentiable Bayesian Relaxation for Latent Partial-Order Inference

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

The authors replace discontinuous precedence and frontier constraints in a partial-order model with smooth surrogates, producing a continuous posterior that supports gradient MCMC and variational inference while recovering the hard model in the limit.

Score-Driven Rating System for Sports

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

A generalization of Elo ratings updates player strengths via the score (log-likelihood gradient) for varied game outcomes, with derived properties of zero expected value, summation to zero, and reversion to unobserved true skills.

Analysis of Search Heuristics in the Multi-Armed Bandit Setting

cs.NE · 2026-04-09 · unverdicted · novelty 6.0

In the dueling bandit setting, the (1+1) EA selects the Condorcet winner with only constant probability when its advantage is Ω(1/n), while a Max-Min Ant System EDA selects it with probability 1-Θ(p), and repeated duels improve the EA's performance.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

citing papers explorer

Showing 7 of 7 citing papers.

  • Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking cs.LG · 2026-05-15 · unverdicted · none · ref 22

    PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.

  • A Differentiable Bayesian Relaxation for Latent Partial-Order Inference stat.ML · 2026-05-07 · unverdicted · none · ref 32

    The authors replace discontinuous precedence and frontier constraints in a partial-order model with smooth surrogates, producing a continuous posterior that supports gradient MCMC and variational inference while recovering the hard model in the limit.

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model cs.LG · 2023-05-29 · accept · none · ref 34

    DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

  • Score-Driven Rating System for Sports cs.LG · 2026-04-10 · unverdicted · none · ref 17

    A generalization of Elo ratings updates player strengths via the score (log-likelihood gradient) for varied game outcomes, with derived properties of zero expected value, summation to zero, and reversion to unobserved true skills.

  • Analysis of Search Heuristics in the Multi-Armed Bandit Setting cs.NE · 2026-04-09 · unverdicted · none · ref 24

    In the dueling bandit setting, the (1+1) EA selects the Condorcet winner with only constant probability when its advantage is Ω(1/n), while a Max-Min Ant System EDA selects it with probability 1-Θ(p), and repeated duels improve the EA's performance.

  • Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 113

    Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

  • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 31

    POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.