mega hub Mixed citations

Proximal Policy Optimization Algorithms

Alec Radford, Filip Wolski, John Schulman, Oleg Klimov, Prafulla Dhariwal · 2017 · cs.LG · arXiv 1707.06347

Mixed citation behavior. Most common role is background (52%).

1992 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 1992 citing papers more from Alec Radford arXiv PDF

abstract

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 156 method 114 baseline 15 dataset 4

citation-polarity summary

background 151 use method 110 baseline 15 unclear 7 use dataset 4 support 2

claims ledger

abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more ge

authors

Alec Radford Filip Wolski John Schulman Oleg Klimov Prafulla Dhariwal

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

On the Policy Gradient Foundations of Group Relative Policy Optimization: Credit Assignment, Gradient Sparsity, and Rank Collapse

cs.LG · 2026-06-28 · conditional · novelty 8.0

GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

cs.LG · 2026-06-22 · conditional · novelty 8.0 · 2 refs

RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

IRumAI: Reinforcement Learning for Indian Rummy

cs.AI · 2026-06-20 · unverdicted · novelty 8.0

IRumAI is the first RL agent for Indian Rummy, trained on weak heuristics to beat strong search opponents at 7000x speed.

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

cs.AI · 2026-06-17 · conditional · novelty 8.0

DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.

Efficient AI-Inspired Reduction of Feynman Integrals via Tube Seeding

hep-ph · 2026-06-09 · unverdicted · novelty 8.0

Machine learning discovers a tube-seeding strategy for IBP reduction of Feynman integrals that scales linearly with numerator power, demonstrated on rank-20 2-loop 5-point integrals.

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

A reward-free representation learning pipeline for offline PbRL achieves better preference efficiency than standard two-stage baselines by connecting RFRL concepts to preference data.

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

cs.RO · 2026-05-28 · unverdicted · novelty 8.0

Dynamic isotropy, quantifying uniform center-of-mass acceleration capability, improves robot performance and enables omnidirectional locomotion, terrain traversal, and failure resilience in a spherical robot design.

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

AtomComposer uses online RL with multi-composition training to discover up to 10x more valid 3D isomers on unseen chemical formulas than single-composition baselines.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Structural Equivalence and Learning Dynamics in Delayed MARL

cs.LG · 2026-05-05 · accept · novelty 8.0

Observation and action delays are formally equivalent in cooperative Dec-POMDPs, yielding identical optimal solutions and enabling zero-shot transfer, though learning dynamics differ due to credit assignment and operational constraints.

Language Game: Talking to Non-Human Systems

cs.LG · 2026-05-05 · unverdicted · novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior without parameter changes.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

cs.RO · 2026-02-10 · unverdicted · novelty 8.0

A certified gradient-based method for contact-rich manipulation that quantifies smoothing-induced errors via set-valued discrepancies and incorporates them into analytical reachable sets for robust affine feedback policies.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

cs.LG · 2025-06-02 · unverdicted · novelty 8.0

Develops and tests the first effective safeguard for analytic gradient-based provably safe RL, showing safe training on three control tasks without performance loss.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

citing papers explorer

Showing 50 of 266 citing papers after filters.

PRISM: Perception Reasoning Interleaved for Sequential Decision Making cs.AI · 2026-05-06 · unverdicted · none · ref 58 · internal anchor
PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 21 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
Agentic AI Systems Should Be Designed as Marginal Token Allocators cs.AI · 2026-05-02 · unverdicted · none · ref 38 · internal anchor
Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 59 · internal anchor
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures cs.AI · 2026-04-23 · unverdicted · none · ref 17 · 2 links · internal anchor
ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 10 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 4 · internal anchor
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy cs.AI · 2026-04-18 · unverdicted · none · ref 9 · internal anchor
LAPD, derived from the provable preference discrepancy in aligned LLMs, improves zero-shot AI text detection by 45.82% over baselines with claimed statistical dominance over Fast-DetectGPT.
LACE: Lattice Attention for Cross-thread Exploration cs.AI · 2026-04-16 · unverdicted · none · ref 28 · 3 links · internal anchor
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production cs.AI · 2026-04-14 · unverdicted · none · ref 20 · 2 links · internal anchor
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
StaRPO: Stability-Augmented Reinforcement Policy Optimization cs.AI · 2026-04-10 · unverdicted · none · ref 27 · internal anchor
StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
RAMP: Hybrid DRL for Online Learning of Numeric Action Models cs.AI · 2026-04-09 · unverdicted · none · ref 36 · internal anchor
RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments cs.AI · 2026-03-25 · unverdicted · none · ref 192 · internal anchor
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
The Informational Cost of Agency: A Bounded Measure of Interaction Efficiency for Deployed Reinforcement Learning cs.AI · 2026-03-01 · unverdicted · none · ref 14 · 2 links · internal anchor
Introduces Bipredictability P with a provable bound P ≤ 0.5 from entropy subadditivity, showing responsive agency imposes an informational cost by suppressing P to ~0.33, validated across RL agents and other systems, plus an IDT architecture outperforming reward monitoring.
MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents cs.AI · 2026-02-13 · unverdicted · none · ref 86 · internal anchor
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation cs.AI · 2026-02-07 · unverdicted · none · ref 34 · internal anchor
VGAS uses best-of-N selection with a geometrically grounded critic and explicit regularization to improve success rates of few-shot VLA policies under limited data and distribution shifts.
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning cs.AI · 2026-01-29 · unverdicted · none · ref 19 · internal anchor
MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.
Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving cs.AI · 2026-01-29 · unverdicted · none · ref 26 · internal anchor
An MLLM interpreter generates concise CDL descriptions from diagrams, enabling an off-the-shelf LLM to solve plane geometry problems competitively after training on only 5.5k examples.
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models? cs.AI · 2025-12-30 · unverdicted · none · ref 58 · internal anchor
An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis cs.AI · 2025-11-13 · unverdicted · none · ref 13 · internal anchor
A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reasoning benchmarks.
OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices cs.AI · 2025-10-28 · unverdicted · none · ref 29 · internal anchor
OpsAgent presents a training-free multi-agent framework with dual self-evolution for automated incident management in microservices, claiming SOTA results on OPENRCA benchmark and successful production deployment at Lenovo.
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance cs.AI · 2025-10-09 · unverdicted · none · ref 16 · internal anchor
SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning cs.AI · 2025-09-02 · conditional · none · ref 56 · internal anchor
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning cs.AI · 2025-08-27 · unverdicted · none · ref 17 · internal anchor
InquireMobile applies two-stage reinforcement fine-tuning and pre-action reasoning to VLM mobile agents, raising inquiry success rate by 46.8% on the introduced InquireBench benchmark.
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization cs.AI · 2025-08-13 · unverdicted · none · ref 25 · internal anchor
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.
AutoSculpt: A Pattern-based Model Auto-pruning Framework Using Reinforcement Learning and Graph Learning cs.AI · 2024-12-24 · unverdicted · none · ref 42 · internal anchor
AutoSculpt models DNNs as graphs, embeds pruning patterns, and uses deep reinforcement learning to reach up to 90% pruning and 18% better FLOPs reduction than baselines on ResNet, MobileNet, VGG, and Vision Transformers.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment cs.AI · 2023-08-10 · accept · none · ref 38 · internal anchor
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards cs.AI · 2026-06-27 · unverdicted · none · ref 25 · internal anchor
BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.
A Formula-Driven Survey and Research Agenda for On-Policy Distillation cs.AI · 2026-06-22 · unverdicted · none · ref 9 · internal anchor
A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.
Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training cs.AI · 2026-06-19 · unverdicted · none · ref 23 · internal anchor
JS divergence in a unified f-divergence framework for GRPO-style T2I alignment yields competitive performance while preserving generation diversity.
Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing cs.AI · 2026-06-18 · unverdicted · none · ref 28 · internal anchor
Integrates multi-head attention with SAC for faster convergence in optimizing additive manufacturing parameters to minimize porosity, outperforming DQN, PPO, TD3, and vanilla SAC.
ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch cs.AI · 2026-06-17 · unverdicted · none · ref 26 · internal anchor
ProfiLLM deploys tool-augmented LLM agents to generate reusable global knowledge and utility-selected user profiles, delivering up to 6.14% AUC lift and measurable GMV gains in DiDi's live dispatcher.
HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning cs.AI · 2026-06-09 · unverdicted · none · ref 20 · internal anchor
HIPIF trains LLM agents end-to-end using subgoal-based hierarchical planning and information folding of completed histories, plus hierarchical reflection and process rewards, to handle long-horizon tasks without auxiliary models or expert trajectories.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization cs.AI · 2026-06-08 · unverdicted · none · ref 80 · internal anchor
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution cs.AI · 2026-06-07 · unverdicted · none · ref 34 · internal anchor
TT-DAC-PS, an enhanced version of TD3, achieves lower mean implementation shortfall than PPO, SAC, A2C, TWAP, VWAP, and AC on LOB data from ten U.S. stocks.
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes cs.AI · 2026-05-27 · unverdicted · none · ref 24 · internal anchor
DenoiseRL optimizes recovery from noisy prefixes in weak-model reasoning failures to improve performance and self-correction on math and general reasoning benchmarks without external supervision.
Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning cs.AI · 2026-05-27 · unverdicted · none · ref 11 · internal anchor
Offline RL post-training boosts code generation performance in LLMs, with larger gains for small models and hard problems, using pre-collected datasets.
Darwin Mobile Agent: A Roadmap for Self-Evolution cs.AI · 2026-05-26 · unverdicted · none · ref 11 · internal anchor
Introduces an open-source mobile GUI agent training framework and a roadmap for autonomous self-evolution via removal of human priors in three pillars.
Distilling Game Code World Model Generation into Lightweight Large Language Models cs.AI · 2026-05-23 · unverdicted · none · ref 23 · internal anchor
SFT followed by RLVR on Qwen2.5-3B-Instruct raises syntactic and execution correctness when generating Game Code World Models across 30 games.
Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals cs.AI · 2026-05-21 · conditional · none · ref 8 · internal anchor
A PPO-trained DRL agent selects from established dispatching rules to minimize total job completion time in FJSP with random arrivals, outperforming single rules and performing competitively with arrival-triggered MILP on heterogeneous datasets.
Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects cs.AI · 2026-05-16 · unverdicted · none · ref 200 · internal anchor
A survey organizing AI methods for inverse PDE problems into inverse problems, inverse design, and control categories, covering applications and future challenges like physics-informed models and uncertainty quantification.
expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling cs.AI · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
EXPO improves GRPO for LLM mathematical reasoning via accuracy-conditioned KL scaling and Gaussian curriculum sampling, delivering gains such as 13.34 points on AIME 2025 pass@32.
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents cs.AI · 2026-05-05 · reject · none · ref 7 · 2 links · internal anchor
MEMTIER reports 0.382 accuracy and 0.412 F1 on the 500-question LongMemEval-S benchmark, a 33pp gain over full-context baseline using tiered memory and retrieval components on 6GB GPU hardware.
An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources cs.AI · 2026-04-27 · unverdicted · none · ref 19 · internal anchor
Joint training of multi-agent RL schedulers for jobs and AGVs outperforms modular training plus dispatching rules except in severe bottleneck environments where the coordination advantage shrinks.
Trust the AI, Doubt Yourself: The Effect of Urgency on Self-Confidence in Human-AI Interaction cs.AI · 2026-04-08 · unverdicted · none · ref 51 · internal anchor
Urgency in human-AI interactions leaves trust in AI unchanged but reduces self-confidence and self-efficacy, per a 30-participant experiment.
RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin cs.AI · 2026-04-04 · unverdicted · none · ref 17 · internal anchor
A PPO reinforcement learning agent on a 50x50 grid increases modeled ecosystem service value in the Lake Malawi Basin by reallocating land-cover classes while adding spatial contiguity and buffer constraints.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 62 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs cs.AI · 2024-10-24 · unverdicted · none · ref 18 · internal anchor
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
A Multi-Agent Reinforcement Learning Framework for Public Health Decision Analysis cs.AI · 2023-11-01 · unverdicted · none · ref 7 · internal anchor
MARL framework for jurisdiction-specific HIV intervention allocation accounting for cross-jurisdictional interactions outperforms single-agent RL in CA/FL simulations under fixed budgets.
Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness cs.AI · 2026-06-09 · unverdicted · none · ref 16 · internal anchor
Soul Computing is introduced as a framework distinguishing narrow and broad forms for constructing intelligent agents with self-identity via intensional cores, separate from affective computing or virtual humans.