Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
hub Canonical reference
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
WaterAdmin uses a bi-level design with LLM agents for dynamic context abstraction and optimization for real-time pump/valve control, achieving better pressure reliability and lower energy use than traditional methods in EPANET simulations of variable community water demands.
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves
Multicultural multi-agent LLM systems exhibit substantially lower value diversity than human societies on the World Values Survey, with diversity uncorrelated to per-agent alignment and further reduced by agent interactions.
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.
TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
An LLM-orchestrated framework automates the full XANES workflow from natural language to normalized spectra and curated data.
LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
SupChain-Bench reveals substantial gaps in LLM reliability for long-horizon supply chain orchestration, while the proposed SupChain-ReAct framework improves tool-calling by autonomously synthesizing procedures.
WorldCup is a new multi-bit LLM watermarking framework that models token sampling as a communication channel and uses hierarchical competition with entropy-aware modulation for robust message embedding and recovery.
EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
-
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
-
WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents
WaterAdmin uses a bi-level design with LLM agents for dynamic context abstraction and optimization for real-time pump/valve control, achieving better pressure reliability and lower energy use than traditional methods in EPANET simulations of variable community water demands.
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
-
GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents
GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves
-
Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems
Multicultural multi-agent LLM systems exhibit substantially lower value diversity than human societies on the World Values Survey, with diversity uncorrelated to per-agent alignment and further reduced by agent interactions.
-
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
-
OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences
OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.
-
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization
TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
-
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
-
ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis
An LLM-orchestrated framework automates the full XANES workflow from natural language to normalized spectra and curated data.
-
Chain-of-Authorization: Embedding authorization into large language models
LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management
SupChain-Bench reveals substantial gaps in LLM reliability for long-horizon supply chain orchestration, while the proposed SupChain-ReAct framework improves tool-calling by autonomously synthesizing procedures.
-
WorldCup Sampling for Multi-bit LLM Watermarking
WorldCup is a new multi-bit LLM watermarking framework that models token sampling as a communication channel and uses hierarchical competition with entropy-aware modulation for robust message embedding and recovery.
-
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
-
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
-
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP
The paper identifies twelve protocol-level security risks across MCP, A2A, Agora, and ANP and quantifies wrong-provider tool execution risk in MCP via a measurement-driven case study on multi-server composition.
-
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning
MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.
-
Small Language Models are the Future of Agentic AI
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
CogEvolution: A Human-like Generative Educational Agent to Simulate Student's Cognitive Evolution
CogEvolution combines ICAP cognitive taxonomy, IRT memory retrieval, and evolutionary algorithms into a generative agent that simulates dynamic student cognitive evolution and outperforms baselines in fidelity and learning curves.
-
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
-
Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.
-
What Factors Affect LLMs and RLLMs in Financial Question Answering?
Prompting and agent methods boost standard LLMs on financial QA by simulating long chain-of-thought reasoning, but reasoning LLMs already have this capability and show limited further gains, while multilingual alignment helps mainly by lengthening reasoning with minimal benefit for reasoning models.
-
FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research
FundaPod presents a multi-persona AI agent architecture with knowledge-graph memory to support human-adjudicated fundamental investment research through independent agent work and verifiable evidence links.
-
Rethinking Agentic Reinforcement Learning In Large Language Models
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
-
When control meets large language models: From words to dynamics
The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
- Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
- SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding