Mixed citations

Title resolution pending

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

Mixed citation behavior. Most common role is background (40%).

44 Pith papers citing it

Background 40% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2 method 2 baseline 1

citation-polarity summary

background 2 use method 2 baseline 1

representative citing papers

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency

quant-ph · 2026-05-12 · unverdicted · novelty 7.0

TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.

Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.

Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

AdaTTA is an actor-critic RL framework that selects sequence-specific test-time augmentations and improves recommendation metrics by up to 26% over fixed augmentation strategies on four datasets.

S-GRPO: Unified Post-Training for Large Vision-Language Models

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.

Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation

cs.NI · 2026-04-09 · unverdicted · novelty 7.0

MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.

An Iterative Test-and-Repair Framework for Competitive Code Generation

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.

EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

cs.SE · 2025-08-22 · unverdicted · novelty 7.0

EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

cs.LG · 2026-05-18 · conditional · novelty 6.0

A maximum entropy reinforcement learning framework generates realistic customer trajectories in retail spaces that match real data better than TSP or PNN heuristics and support more accurate layout optimization decisions.

Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.

Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning

cs.SE · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Reinforcement learning on MIR features combined with cargo-fuzz validation reduces false positives in Rust static memory safety analysis, raising precision from 25.6% to 59.0% and accuracy to 65.2%.

SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.

State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.

Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields

cs.AI · 2026-04-28 · unverdicted · novelty 6.0

Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.

GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

GeoMind applies an agentic workflow with tool-augmented modules and process supervision to outperform static models on lithology classification from well logs while producing traceable decisions.

Mitigating Multimodal Hallucination via Phase-wise Self-reward

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

TOPCELL: Topology Optimization of Standard Cell via LLMs

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.

Adaptive Bounded-Rationality Modeling of Early-Stage Takeover in Shared-Control Driving

cs.HC · 2026-04-12 · unverdicted · novelty 6.0

The adaptive bounded-rationality model anticipates hazardous takeovers with better coverage and lead time than baselines while aligning inferred parameters with eye-tracking metrics.

AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.

JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing

cs.GT · 2026-04-07 · unverdicted · novelty 6.0

JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.

citing papers explorer

Showing 44 of 44 citing papers.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees cs.CV · 2026-04-17 · unverdicted · none · ref 39
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency quant-ph · 2026-05-12 · unverdicted · none · ref 63
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation cs.CV · 2026-04-21 · unverdicted · none · ref 35
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation cs.IR · 2026-04-17 · unverdicted · none · ref 25
AdaTTA is an actor-critic RL framework that selects sequence-specific test-time augmentations and improves recommendation metrics by up to 26% over fixed augmentation strategies on four datasets.
S-GRPO: Unified Post-Training for Large Vision-Language Models cs.LG · 2026-04-17 · unverdicted · none · ref 38
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees cs.LG · 2026-04-17 · unverdicted · none · ref 20
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning cs.LG · 2026-04-15 · unverdicted · none · ref 25
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 32
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation cs.NI · 2026-04-09 · unverdicted · none · ref 25
MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.
An Iterative Test-and-Repair Framework for Competitive Code Generation cs.SE · 2026-04-07 · unverdicted · none · ref 43
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention cs.SE · 2025-08-22 · unverdicted · none · ref 45
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models cs.AI · 2026-05-19 · unverdicted · none · ref 29
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights cs.LG · 2026-05-18 · conditional · none · ref 40
A maximum entropy reinforcement learning framework generates realistic customer trajectories in retail spaces that match real data better than TSP or PNN heuristics and support more accurate layout optimization decisions.
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models cs.LG · 2026-05-06 · unverdicted · none · ref 27
Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning cs.SE · 2026-05-05 · unverdicted · none · ref 52 · 2 links
Reinforcement learning on MIR features combined with cargo-fuzz validation reduces false positives in Rust static memory safety analysis, raising precision from 25.6% to 59.0% and accuracy to 65.2%.
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems cs.AI · 2026-05-05 · unverdicted · none · ref 26
SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading cs.CV · 2026-04-29 · unverdicted · none · ref 33
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields cs.AI · 2026-04-28 · unverdicted · none · ref 65
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation cs.AI · 2026-04-23 · unverdicted · none · ref 33
GeoMind applies an agentic workflow with tool-augmented modules and process supervision to outperform static models on lithology classification from well logs while producing traceable decisions.
Mitigating Multimodal Hallucination via Phase-wise Self-reward cs.CV · 2026-04-20 · unverdicted · none · ref 40
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
TOPCELL: Topology Optimization of Standard Cell via LLMs cs.LG · 2026-04-15 · unverdicted · none · ref 31
TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
Adaptive Bounded-Rationality Modeling of Early-Stage Takeover in Shared-Control Driving cs.HC · 2026-04-12 · unverdicted · none · ref 43
The adaptive bounded-rationality model anticipates hazardous takeovers with better coverage and lead time than baselines while aligning inferred parameters with eye-tracking metrics.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV · 2026-04-09 · unverdicted · none · ref 23
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing cs.GT · 2026-04-07 · unverdicted · none · ref 34
JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.
Optimizing Chlorination in Water Distribution Systems via Surrogate-assisted Neuroevolution cs.NE · 2026-02-07 · unverdicted · none · ref 29
Surrogate-assisted neuroevolution produces Pareto-optimal chlorine dosing policies for water distribution systems that outperform PPO on four practical objectives.
Adaptive Prompt Elicitation for Text-to-Image Generation cs.HC · 2026-02-04 · unverdicted · none · ref 77
Adaptive Prompt Elicitation (APE) uses an information-theoretic framework to generate visual queries that elicit and compile user intent into better prompts for text-to-image models, showing improved alignment in benchmarks and a user study.
Learning to Trust: Dynamic Utilization of Retrieval-Augmented Generation for E-commerce Search Relevance cs.IR · 2025-10-13 · unverdicted · none · ref 16
DyKnow-RAG uses Group Relative Policy Optimization with dual-group rollouts and posterior-driven advantage scaling to optimize context utilization in RAG for e-commerce relevance, showing offline gains and production lifts when deployed at Taobao.
Joint Optimization of Handoff and Video Rate in LEO Satellite Networks cs.NI · 2025-04-06 · unverdicted · none · ref 25
The paper introduces MPC and RL algorithms for joint satellite handoff and video bitrate selection to optimize QoE in LEO networks, validated in trace-driven simulations and testbed experiments.
Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL cs.IR · 2026-05-14 · unverdicted · none · ref 15
CQ-SID semantic IDs and EG-GRPO RL improve generative retrieval hit rates up to 26.76% over RQ-VAE baselines and deliver +1.15% GMV in live e-commerce A/B tests.
Rethinking Priority Scheduling for Sequential Multi-Agent Decision Making in Stackelberg Games cs.MA · 2026-05-08 · unverdicted · none · ref 9
HPA dynamically selects agent decision orders in Stackelberg games to improve equilibria and performance in multi-agent MuJoCo control tasks.
CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning cs.LG · 2026-04-26 · unverdicted · none · ref 12
CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement learning.
ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement cs.RO · 2026-04-22 · unverdicted · none · ref 25
ALAS disentangles environment and self-state streams via bio-inspired modules to deliver 23% higher subtask success and 29% better execution efficiency on long-horizon HSI tasks.
RAMP: Hybrid DRL for Online Learning of Numeric Action Models cs.AI · 2026-04-09 · unverdicted · none · ref 35
RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor cs.DB · 2026-03-10 · unverdicted · none · ref 58
LLMs can outperform DTA on index recommendations for some workloads but remain less reliable with practical adoption challenges.
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation cs.IR · 2025-11-24 · unverdicted · none · ref 44
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance cs.AI · 2025-10-09 · unverdicted · none · ref 15
SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
Search-R3: Unifying Reasoning and Embedding in Large Language Models cs.CL · 2025-10-08 · unverdicted · none · ref 64
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning cs.CL · 2025-05-20 · unverdicted · none · ref 63
Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.
Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution q-fin.TR · 2024-10-19 · conditional · none · ref 21
HRT is a bi-level RL framework with a sparse high-level controller for asset direction selection from signals and a risk-aware low-level controller for weight adjustments, reporting Sharpe 1.24 and turnover 0.090 on 2020-2023 Nasdaq data.
Trust the AI, Doubt Yourself: The Effect of Urgency on Self-Confidence in Human-AI Interaction cs.AI · 2026-04-08 · unverdicted · none · ref 50
Urgency in human-AI interactions leaves trust in AI unchanged but reduces self-confidence and self-efficacy, per a 30-participant experiment.
When & How to Write for Personalized Demand-aware Query Rewriting in Video Search cs.IR · 2025-12-17 · unverdicted · none · ref 19
WeWrite mines user logs to decide when personalization is needed and trains LLMs with SFT and GRPO to rewrite video search queries, delivering 1.07% more long-view clicks and 2.97% fewer reformulations in live A/B tests.
Responsible Federated LLMs via Safety Filtering and Constitutional AI cs.CL · 2025-02-23 · unverdicted · none · ref 26
Integrates safety filtering and constitutional AI into FedLLM, reporting over 20% safety improvement on AdvBench.
A Survey of Scaling in Large Language Model Reasoning cs.AI · 2025-04-02 · unverdicted · none · ref 173
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting cs.CE · 2025-02-26 · unreviewed · ref 59

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer