The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
Canonical reference
Title resolution pending
Canonical reference. 100% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
AADvark extends agent-aided CAD design to dynamic 3D assemblies with movable parts by integrating constraint solvers and visual feedback to create a verification signal for the agent.
A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.
Empirical study of 3977 agent trajectories finds Python execution errors correlate with lower success rates on GitHub issues, flags challenging errors, and reports three confirmed bugs in the SWE-Bench platform.
Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.
The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.
XARP provides a WebSocket-based remote-procedure system that lets Python code and AI agents control Unity XR clients, with benchmarks and user studies showing faster iteration than conventional XR workflows.
AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.
citing papers explorer
-
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
-
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
-
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
-
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
-
Agent-Aided Design for Dynamic CAD Models
AADvark extends agent-aided CAD design to dynamic 3D assemblies with movable parts by integrating constraint solvers and visual feedback to create a verification signal for the agent.
-
Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.
-
Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Empirical study of 3977 agent trajectories finds Python execution errors correlate with lower success rates on GitHub issues, flags challenging errors, and reports three confirmed bugs in the SWE-Bench platform.
-
"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution
Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.
-
From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines
The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.
-
XARP Tools: An Extended Reality Platform for Humans and AI Agents
XARP provides a WebSocket-based remote-procedure system that lets Python code and AI agents control Unity XR clients, with benchmarks and user studies showing faster iteration than conventional XR workflows.
-
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.