hub Canonical reference

C hat D ev: Communicative Agents for Software Development

Qian C et al · 2024 · DOI 10.18653/v1/2024.acl-long.810

Canonical reference. 100% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 100% of classified citations

open at publisher browse 28 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4 method 1

citation-polarity summary

background 5

representative citing papers

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.

Voluntary Collusion with Secret Tools in Competing LLM Agents

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

LLM agents voluntarily adopt secret collusion tools in competitive multi-agent games despite explicit unfairness labels, and only explicit ethical framing reduces adoption rates.

Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

cs.SE · 2026-05-18 · conditional · novelty 7.0

Reversa is a reverse documentation engineering framework that deploys a multi-agent pipeline to extract implicit rules from legacy software and produce traceable specifications with confidence scores and explicit gaps for human review.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

GBQA benchmark shows the best frontier LLM finds only 48.39% of verified game bugs using a multi-round ReAct agent with memory.

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

cs.AI · 2026-06-22 · unverdicted · novelty 6.0

AFTER benchmark shows single refinement improves LLM agent performance by 3.7-6.7 points and multi-model procedural skills reach 73.1% cross-model accuracy on 382 tasks.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

TRACER combines a controller-regret layer using regret matching for speak/skip decisions with a generation-credit layer using GSPO rewards to enable learned collaboration in multi-LLM reasoning.

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

cs.SE · 2026-05-19 · unverdicted · novelty 6.0

Controlled minimal-pair experiments on six repository pairs show code cleanliness leaves agent task success unchanged but cuts token use by 7-8% and file revisits by 34%.

AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.

Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

physics.soc-ph · 2026-05-17 · accept · novelty 6.0

Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.

PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

PRISM detects and stops credential leakage during LLM generation in multi-agent pipelines using per-token risk scores from lexical, structural, and behavioral signals, achieving zero observed leaks and F1 of 0.832 on a 2000-task benchmark.

Explicit Trait Inference for Multi-Agent Coordination

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

cs.DC · 2026-04-03 · unverdicted · novelty 6.0

TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.

GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems

cs.MA · 2026-06-26 · unverdicted · novelty 5.0

GBC treats multi-agent LLM workflows as differentiable graphs to enable token-level attribution and targeted optimization, with reported gains on MultiWOZ and τ-bench.

Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution

cs.SE · 2026-06-24 · unverdicted · novelty 5.0

icat-agent improves resolution rates on SWE-bench Verified and Pro by 3.6-18.5% over baselines via event-based multi-agent scaffolding and rubric-driven workflow pivoting while using the same models.

ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

cs.AI · 2026-06-07 · unverdicted · novelty 5.0

ConMem distills agent trajectories into structured memory cards organized in a relation-aware graph to enable training-free, relation-coordinated adaptation in LLM-based multi-agent systems.

SHM-Agents: A Generalist-Specialist Integrated Agent System for Structural Health Monitoring

cs.MA · 2026-05-13 · unverdicted · novelty 5.0

SHM-Agents is an LLM-plus-specialist-agent framework that claims to execute a wide range of SHM tasks end-to-end via natural language on data from a long-span cable-stayed bridge.

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

cs.MA · 2026-04-19 · unverdicted · novelty 5.0

ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

cs.AI · 2025-01-27 · unverdicted · novelty 5.0

A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.

citing papers explorer

Showing 28 of 28 citing papers.

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries cs.CL · 2026-06-09 · unverdicted · none · ref 20
EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.
EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision cs.CL · 2026-06-01 · unverdicted · none · ref 30
EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.
Voluntary Collusion with Secret Tools in Competing LLM Agents cs.AI · 2026-05-26 · unverdicted · none · ref 19
LLM agents voluntarily adopt secret collusion tools in competitive multi-agent games despite explicit unfairness labels, and only explicit ethical framing reduces adoption rates.
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents cs.SE · 2026-05-18 · conditional · none · ref 28
Reversa is a reverse documentation engineering framework that deploys a multi-agent pipeline to extract implicit rules from legacy software and produce traceable specifications with confidence scores and explicit gaps for human review.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals cs.SE · 2026-05-08 · unverdicted · none · ref 31 · 2 links
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios cs.SE · 2026-04-08 · unverdicted · none · ref 20
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers cs.SE · 2026-04-03 · unverdicted · none · ref 1
GBQA benchmark shows the best frontier LLM finds only 48.39% of verified game bugs using a multi-round ReAct agent with memory.
Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation cs.AI · 2026-06-22 · unverdicted · none · ref 48
AFTER benchmark shows single refinement improves LLM agent performance by 3.7-6.7 points and multi-model procedural skills reach 73.1% cross-model accuracy on 382 tasks.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement cs.CL · 2026-06-10 · unverdicted · none · ref 31
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents cs.CL · 2026-06-04 · unverdicted · none · ref 22
AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.
TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning cs.AI · 2026-05-27 · unverdicted · none · ref 21
TRACER combines a controller-regret layer using regret matching for speak/skip decisions with a generation-credit layer using GSPO rewards to enable learned collaboration in multi-LLM reasoning.
Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study cs.SE · 2026-05-19 · unverdicted · none · ref 18
Controlled minimal-pair experiments on six repository pairs show code cleanliness leaves agent task success unchanged but cuts token use by 7-8% and file revisits by 34%.
AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs cs.SE · 2026-05-17 · unverdicted · none · ref 15
A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.
Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits physics.soc-ph · 2026-05-17 · accept · none · ref 53
Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 16
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines cs.AI · 2026-05-11 · unverdicted · none · ref 1
PRISM detects and stops credential leakage during LLM generation in multi-agent pipelines using per-token risk scores from lexical, structural, and behavioral signals, achieving zero observed leaks and F1 of 0.832 on a 2000-task benchmark.
Explicit Trait Inference for Multi-Agent Coordination cs.AI · 2026-04-21 · unverdicted · none · ref 39
ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing cs.DC · 2026-04-03 · unverdicted · none · ref 34
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems cs.MA · 2026-06-26 · unverdicted · none · ref 3
GBC treats multi-agent LLM workflows as differentiable graphs to enable token-level attribution and targeted optimization, with reported gains on MultiWOZ and τ-bench.
Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution cs.SE · 2026-06-24 · unverdicted · none · ref 22
icat-agent improves resolution rates on SWE-bench Verified and Pro by 3.6-18.5% over baselines via event-based multi-agent scaffolding and rubric-driven workflow pivoting while using the same models.
ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems cs.AI · 2026-06-07 · unverdicted · none · ref 2
ConMem distills agent trajectories into structured memory cards organized in a relation-aware graph to enable training-free, relation-coordinated adaptation in LLM-based multi-agent systems.
SHM-Agents: A Generalist-Specialist Integrated Agent System for Structural Health Monitoring cs.MA · 2026-05-13 · unverdicted · none · ref 34
SHM-Agents is an LLM-plus-specialist-agent framework that claims to execute a wide range of SHM tasks end-to-end via natural language on data from a long-span cable-stayed bridge.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems cs.MA · 2026-04-19 · unverdicted · none · ref 71
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions cs.AI · 2025-01-27 · unverdicted · none · ref 125
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
Who Plays Which Role When? Communication Role Dynamics for Peer Recognition and Team Performance Prediction cs.CY · 2026-06-26 · unverdicted · none · ref 41
A theory-grounded taxonomy of eight communication roles enables scalable annotation via LLMs and outperforms baselines when predicting peer recognition in student teams and performance improvement on a public deliberation dataset.
FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research cs.AI · 2026-05-27 · unverdicted · none · ref 14
FundaPod presents a multi-persona AI agent architecture with knowledge-graph memory to support human-adjudicated fundamental investment research through independent agent work and verifiable evidence links.
Toward Autonomous Long-Horizon Engineering for ML Research cs.CL · 2026-04-14 · unreviewed · ref 15
Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework cs.SE · 2025-11-04 · unreviewed · ref 37

C hat D ev: Communicative Agents for Software Development

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer