mega hub Mixed citations

Proximal Policy Optimization Algorithms

Alec Radford, Filip Wolski, John Schulman, Oleg Klimov, Prafulla Dhariwal · 2017 · cs.LG · arXiv 1707.06347

Mixed citation behavior. Most common role is background (52%).

1493 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1493 citing papers more from Alec Radford arXiv PDF

abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 155 method 113 baseline 15 dataset 4

citation-polarity summary

background 150 use method 109 baseline 15 unclear 7 use dataset 4 support 2

claims ledger

abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

Alec Radford Filip Wolski John Schulman Oleg Klimov Prafulla Dhariwal

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

cs.RO · 2026-05-28 · unverdicted · novelty 8.0

Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

cs.RO · 2026-02-10 · unverdicted · novelty 8.0

A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

cs.LG · 2025-06-02 · unverdicted · novelty 8.0

Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

cs.RO · 2024-03-14 · accept · novelty 8.0

BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

STEMGym: Benchmarking Sequential Decision-Making under Dose Budgets in Autonomous Electron Microscopy

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

STEMGym benchmark demonstrates that perception pipelines dominate dose efficiency in autonomous STEM over navigation methods across 33 agent setups.

citing papers explorer

Showing 50 of 1493 citing papers.

MMSearch-R1: Incentivizing LMMs to Search cs.CV · 2025-06-25 · unverdicted · none · ref 53 · internal anchor
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning cs.RO · 2025-06-18 · unverdicted · none · ref 60 · internal anchor
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
Towards AI-assisted Neutrino Flavor Theory Design hep-ph · 2025-06-09 · unverdicted · none · ref 35 · internal anchor
AMBer applies reinforcement learning with physics feedback to automate construction of neutrino flavor models that minimize free parameters, validated on known cases and extended to a new symmetry group.
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation cs.LG · 2025-06-01 · unverdicted · none · ref 83 · internal anchor
Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning cs.CV · 2025-05-30 · conditional · none · ref 35 · internal anchor
Reason-SVG adds a Drawing-with-Thought reasoning stage and GRPO-based reinforcement learning with a hybrid reward to improve LLM and VLM performance on accurate SVG generation.
CAD-Coder: Text-to-CAD Generation with Chain-of-Thought and Geometric Reward cs.GR · 2025-05-26 · unverdicted · none · ref 23 · internal anchor
CAD-Coder generates valid CadQuery scripts from text via supervised fine-tuning followed by reinforcement learning with geometric Chamfer Distance rewards and chain-of-thought planning.
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs cs.AI · 2025-05-25 · unverdicted · none · ref 33 · internal anchor
UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment cs.CV · 2025-05-24 · unverdicted · none · ref 33 · internal anchor
Chain-of-Zoom factorizes extreme super-resolution into an autoregressive sequence of intermediate scales using a reused backbone model plus GRPO-tuned multi-scale VLM prompts.
Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 42 · internal anchor
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 7 · internal anchor
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
An Information-Geometric Approach to Artificial Curiosity cs.LG · 2025-04-08 · unverdicted · none · ref 13 · internal anchor
Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 33 · internal anchor
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning cs.CV · 2025-03-10 · unverdicted · none · ref 32 · internal anchor
AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.
Simultaneous Multi-die Floorplanning and Technology Assignment eess.SY · 2025-02-15 · unverdicted · none · ref 15 · internal anchor
A joint optimization framework for multi-die floorplanning and technology assignment that uses ML-based PPA estimation to optimize area, wirelength, performance, power, and cost, outperforming greedy baselines in 2.5D and 3D ICs.
EARL-BO: Reinforcement Learning for Multi-Step Lookahead, High-Dimensional Bayesian Optimization cs.LG · 2024-10-31 · unverdicted · none · ref 24 · internal anchor
EARL-BO uses RL with an Attention-DeepSets encoder and end-to-end on-policy multi-task fine-tuning to approximate near-optimal multi-step lookahead policies for high-dimensional black-box optimization.
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey cs.IR · 2024-09-16 · unverdicted · none · ref 16 · internal anchor
Introduces Trust-RAG Compass framework and TRC Bench benchmark to assess RAG trustworthiness across factuality, robustness, fairness, transparency, accountability, and privacy, with evaluations showing performance gaps between LLMs.
Diffusion Models Are Real-Time Game Engines cs.LG · 2024-08-27 · conditional · none · ref 87 · internal anchor
A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 31 · internal anchor
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
KTO: Model Alignment as Prospect Theoretic Optimization cs.LG · 2024-02-02 · conditional · none · ref 16 · internal anchor
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 114 · internal anchor
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation cs.CL · 2023-10-10 · conditional · none · ref 17 · internal anchor
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 123 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Variational Sequential Optimal Experimental Design using Reinforcement Learning stat.ML · 2023-06-17 · unverdicted · none · ref 66 · internal anchor
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 48 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
Measurement-Induced Phase Transitions in Informational Active Matter cond-mat.soft · 2023-02-14 · unverdicted · none · ref 72 · internal anchor
A many-body Maxwell demon model of adaptive particles shows measurement-induced phase transitions to informational flocking, with order bounded by measured information and an informational activity that compresses phase space without work.
Mastering Diverse Domains through World Models cs.AI · 2023-01-10 · unverdicted · none · ref 5 · internal anchor
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Mastering Atari with Discrete World Models cs.LG · 2020-10-05 · accept · none · ref 43 · internal anchor
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Learning to summarize from human feedback cs.CL · 2020-09-02 · conditional · none · ref 58 · internal anchor
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 15 · internal anchor
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Dream to Control: Learning Behaviors by Latent Imagination cs.LG · 2019-12-03 · accept · none · ref 43 · internal anchor
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Solving Rubik's Cube with a Robot Hand cs.LG · 2019-10-16 · accept · none · ref 98 · internal anchor
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Fine-Tuning Language Models from Human Preferences cs.CL · 2019-09-18 · unverdicted · none · ref 24 · internal anchor
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Benchmarking Model-Based Reinforcement Learning cs.LG · 2019-07-03 · accept · none · ref 44 · internal anchor
Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termination dilemma.
Exploring Model-based Planning with Policy Networks cs.LG · 2019-06-20 · unverdicted · none · ref 34 · internal anchor
POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 27 · internal anchor
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
Searching for Activation Functions cs.NE · 2017-10-16 · conditional · none · ref 16 · internal anchor
Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning cs.CL · 2026-06-29 · unverdicted · none · ref 40 · internal anchor
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding cs.CV · 2026-06-29 · unverdicted · none · ref 141 · internal anchor
InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.
Neural Subspace Reallocation: Continual Learning as Retrieval-Based Subspace Memory Management cs.LG · 2026-06-29 · unverdicted · none · ref 17 · internal anchor
NSR reframes continual learning as retrieval-based subspace memory management with SVD compression and similarity retrieval from a TaskKnowledgeBank, showing that the memory mechanism itself drives performance gains over learned allocation policies on cyclic and heterogeneous benchmarks.
RoAd-RL: A Unified Library and Benchmark for Robust Adversarial Reinforcement Learning cs.LG · 2026-06-29 · conditional · none · ref 26 · internal anchor
RoAd-RL is a new benchmarking library for adversarial reinforcement learning that evaluates DQN, PPO, and SAC agents across 192 attack-defense configurations and finds substantial robustness variations plus cases where defenses harm performance more than attacks.
PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF cs.LG · 2026-06-29 · unverdicted · none · ref 32 · internal anchor
PS-PPO samples prefixes of trajectories in critic-free RLHF and uses importance-weighted updates to reduce compute and memory while claiming to preserve the full-trajectory objective.
AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance cs.RO · 2026-06-28 · unverdicted · none · ref 38 · internal anchor
AnyBody distills a privileged teacher tracker into a latent unit-sphere representation and uses a masked transformer to drive humanoid control from arbitrary keypoint subsets.
Fairness Attacks on Recommender Systems cs.IR · 2026-06-27 · unverdicted · none · ref 43 · internal anchor
A structure-aware RL fairness attack with joint item and gender selection policies is introduced and shown effective on four recommender models across two datasets.
TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL cs.CV · 2026-06-26 · unverdicted · none · ref 18 · internal anchor
TempAct applies hierarchical planner-executor RL with group exploration and multi-level rewards to improve temporal consistency in autoregressive video models.
ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 229 · internal anchor
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents cs.AI · 2026-06-26 · unverdicted · none · ref 6 · internal anchor
ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.
NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning cs.LG · 2026-06-26 · unverdicted · none · ref 23 · internal anchor
NormGuard adds a training-time hinge penalty on velocity norm inflation in flow-matching RL to improve MLLM-judged image quality and forensic realism while preserving reward across multiple setups.
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing cs.CV · 2026-06-25 · unverdicted · none · ref 37 · internal anchor
SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
Finding the Time to Think: Learning Planning Budgets in Real-Time RL cs.LG · 2026-06-24 · unverdicted · none · ref 23 · internal anchor
Trains a gating policy to select state-dependent planning budgets in variable-delay real-time RL, outperforming fixed-budget and heuristic baselines across Pac-Man, Tetris, Snake, Speed Hex, and Speed Go.
Multi-Agent Goal Recognition with Team- and Goal-Conditioned Reinforcement Learning and Factorized Branch-and-Bound cs.MA · 2026-06-24 · unverdicted · none · ref 19 · internal anchor
MAGR-BB matches exhaustive search accuracy on multi-agent Blocksworld while reducing hypothesis evaluations by orders of magnitude via RL scoring inside factorized branch-and-bound.