hub Canonical reference

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, M · 2025 · cs.CL · arXiv 2503.21460

Canonical reference. 75% of citing Pith papers cite this work as background.

57 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 other 2 baseline 1

citation-polarity summary

background 9 unclear 2 baseline 1

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

LLM-as-an-Investigator improves diagnostic accuracy over direct prompting by using an evidence-first protocol of hypothesis generation, clarification questions, and iterative probability updates in technical problem solving.

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

TimeSage-MT introduces a multi-turn benchmark for agentic time series reasoning and shows frontier LLMs drop sharply on decision-oriented tasks due to memory and uncertainty failures.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

AgingBench demonstrates multi-dimensional degradation in deployed AI agents through four aging mechanisms diagnosed by temporal graphs and counterfactual probes across hundreds of runs.

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.

WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

WaterAdmin uses a bi-level design with LLM agents for dynamic context abstraction and optimization for real-time pump/valve control, achieving better pressure reliability and lower energy use than traditional methods in EPANET simulations of variable community water demands.

Skill-Conditioned Visual Geolocation for Vision-Language Models

cs.CV · 2026-04-10 · unverdicted · novelty 7.0 · 2 refs

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.

GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents

q-bio.QM · 2025-10-14 · unverdicted · novelty 7.0

GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves

Mandol: An Agglomerative Agent Memory System for Long-Term Conversations

cs.DB · 2026-06-29 · unverdicted · novelty 6.0

Mandol unifies memory storage and retrieval into an agglomerative semantic graph architecture with quantitative query mechanisms, reporting best accuracy on LoCoMo and LongMemEval plus 5.4x retrieval and 4.8x insertion speedups.

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

cs.SE · 2026-06-18 · unverdicted · novelty 6.0

LLM agents commonly select over-privileged tools across domains, a tendency worsened by transient failures and not fixed by standard safety alignment, but reducible by privilege-aware post-training.

BioHarness: Substrate-Aware Evidence Assembly for Biomedical Question Answering across Literature, Knowledge Bases, and Biological Atlases

q-bio.QM · 2026-06-17 · unverdicted · novelty 6.0

BioHarness improves pooled biomedical QA score from 65.9 to 71.0 on 19,302 items by using staged, substrate-aware evidence assembly that escalates only when needed.

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.

Civil Court Simulation with Large Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Multi-agent LLM framework simulates Chinese civil trials through five-stage procedures with memory and retrieval, producing judgments strong in liability allocation and multi-item decisions.

Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Multicultural multi-agent LLM systems exhibit substantially lower value diversity than human societies on the World Values Survey, with diversity uncorrelated to per-agent alignment and further reduced by agent interactions.

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

cs.CR · 2026-06-01 · unverdicted · novelty 6.0

SeClaw provides spec-driven synthesis of security tasks and an execution-based docker testbed for evaluating unsafe behaviors in autonomous LLM agents.

TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

cs.SE · 2026-05-26 · unverdicted · novelty 6.0

TrajAudit diagnoses failures in repository-level agentic coding trajectories by filtering noise and injecting test-failure priors, achieving >24.4 pp higher localization accuracy and 18% lower token use on the new RootSE benchmark of 93 instances.

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.

OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.

Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.

citing papers explorer

Showing 50 of 57 citing papers.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis cs.AI · 2026-06-11 · unverdicted · none · ref 4 · internal anchor
LLM-as-an-Investigator improves diagnostic accuracy over direct prompting by using an evidence-first protocol of hypothesis generation, clarification questions, and iterative probability updates in technical problem solving.
TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning cs.CL · 2026-05-31 · unverdicted · none · ref 17 · internal anchor
TimeSage-MT introduces a multi-turn benchmark for agentic time series reasoning and shows frontier LLMs drop sharply on decision-oriented tasks due to memory and uncertainty failures.
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning cs.CL · 2026-05-30 · unverdicted · none · ref 59 · internal anchor
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems cs.AI · 2026-05-25 · unverdicted · none · ref 27 · internal anchor
AgingBench demonstrates multi-dimensional degradation in deployed AI agents through four aging mechanisms diagnosed by temporal graphs and counterfactual probes across hundreds of runs.
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE cs.CL · 2026-05-18 · unverdicted · none · ref 8 · internal anchor
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain cs.CL · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents cs.LG · 2026-04-11 · unverdicted · none · ref 24 · internal anchor
WaterAdmin uses a bi-level design with LLM agents for dynamic context abstraction and optimization for real-time pump/valve control, achieving better pressure reliability and lower energy use than traditional methods in EPANET simulations of variable community water demands.
Skill-Conditioned Visual Geolocation for Vision-Language Models cs.CV · 2026-04-10 · unverdicted · none · ref 18 · 2 links · internal anchor
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents q-bio.QM · 2025-10-14 · unverdicted · none · ref 29 · internal anchor
GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves
Mandol: An Agglomerative Agent Memory System for Long-Term Conversations cs.DB · 2026-06-29 · unverdicted · none · ref 33 · internal anchor
Mandol unifies memory storage and retrieval into an agglomerative semantic graph architecture with quantitative query mechanisms, reporting best accuracy on LoCoMo and LongMemEval plus 5.4x retrieval and 4.8x insertion speedups.
When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents cs.SE · 2026-06-18 · unverdicted · none · ref 3 · internal anchor
LLM agents commonly select over-privileged tools across domains, a tendency worsened by transient failures and not fixed by standard safety alignment, but reducible by privilege-aware post-training.
BioHarness: Substrate-Aware Evidence Assembly for Biomedical Question Answering across Literature, Knowledge Bases, and Biological Atlases q-bio.QM · 2026-06-17 · unverdicted · none · ref 9 · internal anchor
BioHarness improves pooled biomedical QA score from 65.9 to 71.0 on 19,302 items by using staged, substrate-aware evidence assembly that escalates only when needed.
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior cs.AI · 2026-06-10 · unverdicted · none · ref 7 · internal anchor
Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.
Civil Court Simulation with Large Language Models cs.CL · 2026-06-08 · unverdicted · none · ref 17 · internal anchor
Multi-agent LLM framework simulates Chinese civil trials through five-stage procedures with memory and retrieval, producing judgments strong in liability allocation and multi-item decisions.
Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition cs.AI · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.
Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems cs.CL · 2026-06-04 · unverdicted · none · ref 38 · internal anchor
Multicultural multi-agent LLM systems exhibit substantially lower value diversity than human societies on the World Values Survey, with diversity uncorrelated to per-agent alignment and further reduced by agent interactions.
SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents cs.CR · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
SeClaw provides spec-driven synthesis of security tasks and an execution-based docker testbed for evaluating unsafe behaviors in autonomous LLM agents.
TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems cs.SE · 2026-05-26 · unverdicted · none · ref 34 · internal anchor
TrajAudit diagnoses failures in repository-level agentic coding trajectories by filtering noise and injecting test-failure priors, achieving >24.4 pp higher localization accuracy and 18% lower token use on the new RootSE benchmark of 93 instances.
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills cs.AI · 2026-05-22 · unverdicted · none · ref 2 · internal anchor
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences cs.CR · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 16 · internal anchor
TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 32 · internal anchor
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
Chain-of-Authorization: Embedding authorization into large language models cs.AI · 2026-03-24 · unverdicted · none · ref 3 · internal anchor
LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies cs.CL · 2026-02-10 · conditional · none · ref 22 · internal anchor
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management cs.AI · 2026-02-07 · unverdicted · none · ref 2 · internal anchor
SupChain-Bench reveals substantial gaps in LLM reliability for long-horizon supply chain orchestration, while the proposed SupChain-ReAct framework improves tool-calling by autonomously synthesizing procedures.
WorldCup Sampling for Multi-bit LLM Watermarking cs.CL · 2026-02-02 · unverdicted · none · ref 6 · internal anchor
WorldCup is a new multi-bit LLM watermarking framework that models token sampling as a communication channel and uses hierarchical competition with entropy-aware modulation for robust message embedding and recovery.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle cs.CL · 2025-10-17 · unverdicted · none · ref 2 · 2 links · internal anchor
EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 19 · internal anchor
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 16
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources cs.SE · 2026-06-28 · unverdicted · none · ref 6 · internal anchor
RESOURCE2SKILL converts multimodal human resources into a hierarchical Skill Wiki of executable agent skills, reporting +11.9 percentage point average gains over no-skill baselines across seven authoring domains.
RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills cs.CL · 2026-06-16 · unverdicted · none · ref 22 · internal anchor
RubricsTree is a scalable, evolving rubric-based evaluation system with adaptive routing that improves physician alignment for health AI responses and enables performance gains when used as feedback or rewards.
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning cs.LG · 2026-06-16 · unverdicted · none · ref 21 · internal anchor
EnvRL incorporates environment dynamics learning via state prediction and inverse dynamics auxiliary objectives into agentic RL, reporting higher success rates than RL-only baselines on ALFWorld and WebShop.
A Unified Framework for the Evaluation of LLM Agentic Capabilities cs.AI · 2026-05-27 · unverdicted · none · ref 3 · 2 links · internal anchor
A unified framework integrates 7 benchmarks across 24 domains using standardized formats, fixed ReAct execution, and offline snapshots to disentangle intrinsic LLM agent capabilities from scaffold and environment artifacts, shown via 400K rollouts on 15 models.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 21 · internal anchor
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning cs.LG · 2026-05-11 · unverdicted · none · ref 34 · 2 links · internal anchor
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images cs.CV · 2026-04-13 · unverdicted · none · ref 21 · internal anchor
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 94 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP cs.CR · 2026-02-11 · unverdicted · none · ref 5 · internal anchor
The paper identifies twelve protocol-level security risks across MCP, A2A, Agora, and ANP and quantifies wrong-provider tool execution risk in MCP via a measurement-driven case study on multi-server composition.
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning cs.AI · 2026-01-29 · unverdicted · none · ref 15 · internal anchor
MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.
Small Language Models are the Future of Agentic AI cs.AI · 2025-06-02 · unverdicted · none · ref 50 · internal anchor
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Heterogeneous Scientific Foundation Model Collaboration cs.AI · 2026-04-30 · unverdicted · none · ref 34
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
From Question Answering to Task Completion: A Survey on Agent System and Harness Design cs.AI · 2026-06-14 · unverdicted · none · ref 3 · internal anchor
Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.
A Robust Agentic Framework for Expert-Level Automation of Atomistic Simulations cond-mat.mtrl-sci · 2026-06-08 · unverdicted · none · ref 22 · internal anchor
Paimon is an agentic framework that automates atomistic simulations and improves reliability by suppressing silent errors in agent workflows, demonstrated on liquid electrolyte cases and literature reproduction.
Post-AGI Economies: Superposition and the Second Fundamental Theorem of Welfare Economics cs.GT · 2026-06-06 · unverdicted · none · ref 46 · internal anchor
An autonomy-qualified Second Welfare Theorem is stated for post-AGI economies under the joint conditions of convexity, stable moral status, non-fungible rights, welfare selection, non-manipulation, governed self-modification, and verification.
Scaling Behavior of Single LLM-Driven Multi-Agent Systems cs.MA · 2026-05-30 · unverdicted · none · ref 29 · internal anchor
LLM multi-agent systems exhibit diminishing returns with more agents due to coordination overhead rather than monotonic scaling.
CogEvolution: A Human-like Generative Educational Agent to Simulate Student's Cognitive Evolution cs.AI · 2026-04-16 · unverdicted · none · ref 10 · internal anchor
CogEvolution combines ICAP cognitive taxonomy, IRT memory retrieval, and evolutionary algorithms into a generative agent that simulates dynamic student cognitive evolution and outperforms baselines in fidelity and learning curves.
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub cs.CL · 2026-03-19 · unverdicted · none · ref 11 · internal anchor
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts cs.CR · 2025-10-26 · unverdicted · none · ref 13 · internal anchor
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer