Title resolution pending

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao · 2023

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

cs.SE · 2026-05-18 · conditional · novelty 7.0

The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

cs.AI · 2026-05-15 · unverdicted · novelty 7.0 · 2 refs

X-SYNTH synthesizes enterprise context from digital human attention using Digital Twin Signatures and seven attention filters, raising true lead rate from 9.5% to 61.9% while cutting false lead rate to 18.8%.

A Language for Describing Agentic LLM Contexts

cs.AI · 2026-05-03 · accept · novelty 7.0

ACDL is a language for specifying the structure and dynamics of LLM input contexts in agent systems using constructs for roles, dynamic content, time references, and conditional structures.

Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

cs.SE · 2026-04-02 · accept · novelty 7.0

Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task difficulty predictions.

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

cs.CR · 2026-05-04 · accept · novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.

Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study

cs.SE · 2026-04-10 · unverdicted · novelty 6.0

Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.

"When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction

cs.HC · 2026-03-02 · unverdicted · novelty 6.0

Concurrent human-agent interactions occur in 31.8% of turns and follow five action patterns explained by six triggers and four enabling factors, enabled by a context-aware design probe called CLEO.

Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

cs.HC · 2026-04-20 · unverdicted · novelty 5.0

AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.

MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought

cs.MA · 2026-04-09 · unverdicted · novelty 5.0 · 2 refs

MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.

How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent

cs.AI · 2026-04-08 · unverdicted · novelty 5.0

Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.

Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey

eess.SY · 2026-04-09 · unverdicted · novelty 4.0

The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.

DataClaw: An Autonomous Data Agent with Instant Messaging Integration

cs.DB · 2026-04-27 · unverdicted · novelty 3.0

DataClaw embeds an autonomous agent into instant messaging platforms so users can request and receive complete data analysis pipelines through natural language chat.

Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review

cs.SE · 2026-04-10 · unverdicted · novelty 2.0

A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.

citing papers explorer

Showing 16 of 16 citing papers.

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents cs.SE · 2026-05-18 · conditional · none · ref 27
The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention cs.AI · 2026-05-15 · unverdicted · none · ref 75 · 2 links
X-SYNTH synthesizes enterprise context from digital human attention using Digital Twin Signatures and seven attention filters, raising true lead rate from 9.5% to 61.9% while cutting false lead rate to 18.8%.
A Language for Describing Agentic LLM Contexts cs.AI · 2026-05-03 · accept · none · ref 10
ACDL is a language for specifying the structure and dynamics of LLM input contexts in agent systems using constructs for roles, dynamic content, time references, and conditional structures.
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery cs.SE · 2026-04-27 · unverdicted · none · ref 48
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding cs.AI · 2026-04-21 · unverdicted · none · ref 60
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 67
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure cs.SE · 2026-04-02 · accept · none · ref 30
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task difficulty predictions.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 54
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study cs.SE · 2026-04-10 · unverdicted · none · ref 54
Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.
"When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction cs.HC · 2026-03-02 · unverdicted · none · ref 65
Concurrent human-agent interactions occur in 31.8% of turns and follow five action patterns explained by six triggers and four enabling factors, enabled by a context-aware design probe called CLEO.
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research cs.HC · 2026-04-20 · unverdicted · none · ref 119
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought cs.MA · 2026-04-09 · unverdicted · none · ref 46 · 2 links
MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.
How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent cs.AI · 2026-04-08 · unverdicted · none · ref 17
Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.
Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey eess.SY · 2026-04-09 · unverdicted · none · ref 124
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.
DataClaw: An Autonomous Data Agent with Instant Messaging Integration cs.DB · 2026-04-27 · unverdicted · none · ref 9
DataClaw embeds an autonomous agent into instant messaging platforms so users can request and receive complete data analysis pipelines through natural language chat.
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review cs.SE · 2026-04-10 · unverdicted · none · ref 63
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer