Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
arXiv preprint arXiv:2510.04550 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9verdicts
UNVERDICTED 9roles
background 1polarities
background 1representative citing papers
PI-Hunter automates red-teaming of LLM agents by generating and iteratively evolving source-aware test cases to induce retrieval of embedded malicious instructions from external environments.
Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
Case study of CMBAgent on 18 astrophysical tasks finds strong performance on well-specified problems but frequent silent failures yielding physically inconsistent outputs.
Trajel introduces a five-type taxonomy and benchmark for trajectory-level hallucinations in multi-agent LLM workflows, showing existing final-answer benchmarks miss common failures.
A hybrid deterministic-plus-semantic interception layer for continuous task-based authorization of multi-turn LLM agent tool invocations, with new multi-turn datasets and initial experiments.
ReadingMachine introduces a staged LLM-based methodology for structured corpus reading that emphasizes coverage and traceability, demonstrated on 152 industrial policy documents yielding over 17,500 insights.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections
PI-Hunter automates red-teaming of LLM agents by generating and iteratively evolving source-aware test cases to induce retrieval of embedded malicious instructions from external environments.
-
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.
-
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
-
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
-
Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
Case study of CMBAgent on 18 astrophysical tasks finds strong performance on well-specified problems but frequent silent failures yielding physically inconsistent outputs.
-
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
Trajel introduces a five-type taxonomy and benchmark for trajectory-level hallucinations in multi-agent LLM workflows, showing existing final-answer benchmarks miss common failures.
-
Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI
A hybrid deterministic-plus-semantic interception layer for continuous task-based authorization of multi-turn LLM agent tool invocations, with new multi-turn datasets and initial experiments.
-
ReadingMachine: A Computational Methodology for Structured Corpus Reading and Large-Scale Synthesis
ReadingMachine introduces a staged LLM-based methodology for structured corpus reading that emphasizes coverage and traceability, demonstrated on 152 industrial policy documents yielding over 17,500 insights.