RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (40%).
citation-role summary
citation-polarity summary
representative citing papers
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
AdaTTA is an actor-critic RL framework that selects sequence-specific test-time augmentations and improves recommendation metrics by up to 26% over fixed augmentation strategies on four datasets.
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
A maximum entropy reinforcement learning framework generates realistic customer trajectories in retail spaces that match real data better than TSP or PNN heuristics and support more accurate layout optimization decisions.
Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
Reinforcement learning on MIR features combined with cargo-fuzz validation reduces false positives in Rust static memory safety analysis, raising precision from 25.6% to 59.0% and accuracy to 65.2%.
SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
GeoMind applies an agentic workflow with tool-augmented modules and process supervision to outperform static models on lithology classification from well logs while producing traceable decisions.
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
The adaptive bounded-rationality model anticipates hazardous takeovers with better coverage and lead time than baselines while aligning inferred parameters with eye-tracking metrics.
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.
citing papers explorer
-
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
-
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
-
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
-
Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation
AdaTTA is an actor-critic RL framework that selects sequence-specific test-time augmentations and improves recommendation metrics by up to 26% over fixed augmentation strategies on four datasets.
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
-
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
-
Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation
MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.
-
An Iterative Test-and-Repair Framework for Competitive Code Generation
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
-
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
-
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
-
Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights
A maximum entropy reinforcement learning framework generates realistic customer trajectories in retail spaces that match real data better than TSP or PNN heuristics and support more accurate layout optimization decisions.
-
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models
Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
-
Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning
Reinforcement learning on MIR features combined with cargo-fuzz validation reduces false positives in Rust static memory safety analysis, raising precision from 25.6% to 59.0% and accuracy to 65.2%.
-
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems
SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.
-
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
-
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
-
GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation
GeoMind applies an agentic workflow with tool-augmented modules and process supervision to outperform static models on lithology classification from well logs while producing traceable decisions.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
TOPCELL: Topology Optimization of Standard Cell via LLMs
TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
-
Adaptive Bounded-Rationality Modeling of Early-Stage Takeover in Shared-Control Driving
The adaptive bounded-rationality model anticipates hazardous takeovers with better coverage and lead time than baselines while aligning inferred parameters with eye-tracking metrics.
-
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
-
JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing
JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.
-
Optimizing Chlorination in Water Distribution Systems via Surrogate-assisted Neuroevolution
Surrogate-assisted neuroevolution produces Pareto-optimal chlorine dosing policies for water distribution systems that outperform PPO on four practical objectives.
-
Adaptive Prompt Elicitation for Text-to-Image Generation
Adaptive Prompt Elicitation (APE) uses an information-theoretic framework to generate visual queries that elicit and compile user intent into better prompts for text-to-image models, showing improved alignment in benchmarks and a user study.
-
Learning to Trust: Dynamic Utilization of Retrieval-Augmented Generation for E-commerce Search Relevance
DyKnow-RAG uses Group Relative Policy Optimization with dual-group rollouts and posterior-driven advantage scaling to optimize context utilization in RAG for e-commerce relevance, showing offline gains and production lifts when deployed at Taobao.
-
Joint Optimization of Handoff and Video Rate in LEO Satellite Networks
The paper introduces MPC and RL algorithms for joint satellite handoff and video bitrate selection to optimize QoE in LEO networks, validated in trace-driven simulations and testbed experiments.
-
Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL
CQ-SID semantic IDs and EG-GRPO RL improve generative retrieval hit rates up to 26.76% over RQ-VAE baselines and deliver +1.15% GMV in live e-commerce A/B tests.
-
Rethinking Priority Scheduling for Sequential Multi-Agent Decision Making in Stackelberg Games
HPA dynamically selects agent decision orders in Stackelberg games to improve equilibria and performance in multi-agent MuJoCo control tasks.
-
CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning
CAPSULE learns probabilistic control-affine dynamics offline to construct uncertainty-incorporating control barrier functions that enforce conservative safety constraints via online action correction in reinforcement learning.
-
ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement
ALAS disentangles environment and self-state streams via bio-inspired modules to deliver 23% higher subtask success and 29% better execution efficiency on long-horizon HSI tasks.
-
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
-
Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor
LLMs can outperform DTA on index recommendations for some workloads but remain less reliable with practical adoption challenges.
-
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.
-
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance
SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
-
Search-R3: Unifying Reasoning and Embedding in Large Language Models
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
-
Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning
Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.
-
Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution
HRT is a bi-level RL framework with a sparse high-level controller for asset direction selection from signals and a risk-aware low-level controller for weight adjustments, reporting Sharpe 1.24 and turnover 0.090 on 2020-2023 Nasdaq data.
-
Trust the AI, Doubt Yourself: The Effect of Urgency on Self-Confidence in Human-AI Interaction
Urgency in human-AI interactions leaves trust in AI unchanged but reduces self-confidence and self-efficacy, per a 30-participant experiment.
-
When & How to Write for Personalized Demand-aware Query Rewriting in Video Search
WeWrite mines user logs to decide when personalization is needed and trains LLMs with SFT and GRPO to rewrite video search queries, delivering 1.07% more long-view clicks and 2.97% fewer reformulations in live A/B tests.
-
Responsible Federated LLMs via Safety Filtering and Constitutional AI
Integrates safety filtering and constitutional AI into FedLLM, reporting over 20% safety improvement on AdvBench.
-
A Survey of Scaling in Large Language Model Reasoning
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
- FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting