The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
Title resolution pending
16 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 16roles
background 3polarities
background 3representative citing papers
X-SYNTH synthesizes enterprise context from digital human attention using Digital Twin Signatures and seven attention filters, raising true lead rate from 9.5% to 61.9% while cutting false lead rate to 18.8%.
ACDL is a language for specifying the structure and dynamics of LLM input contexts in agent systems using constructs for roles, dynamic content, time references, and conditional structures.
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task difficulty predictions.
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.
Concurrent human-agent interactions occur in 31.8% of turns and follow five action patterns explained by six triggers and four enabling factors, enabled by a context-aware design probe called CLEO.
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.
Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.
DataClaw embeds an autonomous agent into instant messaging platforms so users can request and receive complete data analysis pipelines through natural language chat.
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.
citing papers explorer
-
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
-
X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention
X-SYNTH synthesizes enterprise context from digital human attention using Digital Twin Signatures and seven attention filters, raising true lead rate from 9.5% to 61.9% while cutting false lead rate to 18.8%.
-
A Language for Describing Agentic LLM Contexts
ACDL is a language for specifying the structure and dynamics of LLM input contexts in agent systems using constructs for roles, dynamic content, time references, and conditional structures.
-
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
-
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
-
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task difficulty predictions.
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
-
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study
Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.
-
"When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction
Concurrent human-agent interactions occur in 31.8% of turns and follow five action patterns explained by six triggers and four enabling factors, enabled by a context-aware design probe called CLEO.
-
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
-
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEval-S benchmarks.
-
How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent
Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.
-
Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.
-
DataClaw: An Autonomous Data Agent with Instant Messaging Integration
DataClaw embeds an autonomous agent into instant messaging platforms so users can request and receive complete data analysis pipelines through natural language chat.
-
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.