pith. sign in

mega hub Mixed citations

Proximal Policy Optimization Algorithms

Mixed citation behavior. Most common role is background (52%).

1657 Pith papers citing it
Background 52% of classified citations
abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

citation-role summary

background 155 method 113 baseline 15 dataset 4

citation-polarity summary

claims ledger

  • abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

mega hub controls

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

clear filters

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

citing papers explorer

Showing 26 of 26 citing papers after filters.

  • Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation cs.CL · 2023-10-10 · conditional · none · ref 17 · internal anchor

    Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.

  • Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 123 · internal anchor

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  • Variational Sequential Optimal Experimental Design using Reinforcement Learning stat.ML · 2023-06-17 · unverdicted · none · ref 66 · internal anchor

    vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.

  • Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 48 · internal anchor

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

  • Measurement-Induced Phase Transitions in Informational Active Matter cond-mat.soft · 2023-02-14 · unverdicted · none · ref 72 · internal anchor

    A many-body Maxwell demon model of adaptive particles shows measurement-induced phase transitions to informational flocking, with order bounded by measured information and an informational activity that compresses phase space without work.

  • Mastering Diverse Domains through World Models cs.AI · 2023-01-10 · unverdicted · none · ref 5 · internal anchor

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  • Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations cs.AI · 2023-12-14 · conditional · none · ref 49 · internal anchor

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  • Scalable and Efficient Continual Learning from Demonstration via a Hypernetwork-generated Stable Dynamics Model cs.RO · 2023-11-06 · unverdicted · none · ref 48 · internal anchor

    A hypernetwork generates clock-augmented stable neural ODEs (sNODEs) for scalable continual learning from demonstration, achieving O(N) training time via stochastic regularization while outperforming baselines on LfD tasks up to 26 skills and 32 dimensions.

  • Zephyr: Direct Distillation of LM Alignment cs.LG · 2023-10-25 · accept · none · ref 1 · internal anchor

    Zephyr-7B achieves state-of-the-art chat benchmark results among 7B models by distilling alignment via dDPO on AI feedback preferences, surpassing the 70B Llama-2-Chat model on MT-Bench with no human data required.

  • Policy Gradient with Kernel Quadrature cs.LG · 2023-10-23 · unverdicted · none · ref 7 · internal anchor

    Episodic kernel quadrature compresses batches of episodes via GP-modeled returns to enable efficient policy gradient updates without evaluating rewards on every sample.

  • Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection cs.CL · 2023-10-17 · unverdicted · none · ref 159 · internal anchor

    Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

  • Aligning Large Multimodal Models with Factually Augmented RLHF cs.CV · 2023-09-25 · conditional · none · ref 28 · internal anchor

    Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

  • MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 62 · internal anchor

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  • RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback cs.CL · 2023-09-01 · conditional · none · ref 97 · internal anchor

    RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.

  • A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 3 · internal anchor

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.

  • Reinforced Self-Training (ReST) for Language Modeling cs.CL · 2023-08-17 · unverdicted · none · ref 21 · internal anchor

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  • MiniLLM: On-Policy Distillation of Large Language Models cs.CL · 2023-06-14 · conditional · none · ref 20 · internal anchor

    MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

  • Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 254 · internal anchor

    UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

  • Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 26 · internal anchor

    DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

  • Aligning Text-to-Image Models using Human Feedback cs.LG · 2023-02-23 · unverdicted · none · ref 20 · internal anchor

    A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.

  • A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 283 · internal anchor

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  • Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment cs.AI · 2023-08-10 · accept · none · ref 38 · internal anchor

    Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.

  • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment cs.LG · 2023-04-13 · unverdicted · none · ref 95 · internal anchor

    RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.

  • A Multi-Agent Reinforcement Learning Framework for Public Health Decision Analysis cs.AI · 2023-11-01 · unverdicted · none · ref 7 · internal anchor

    MARL framework for jurisdiction-specific HIV intervention allocation accounting for cross-jurisdictional interactions outperforms single-agent RL in CA/FL simulations under fixed budgets.

  • Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 58 · internal anchor

    Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.

  • A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 130 · internal anchor

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.