{"total":15,"items":[{"citing_arxiv_id":"2605.20049","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study","primary_cat":"cs.SE","submitted_at":"2026-05-19T16:06:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Controlled minimal-pair experiments on six repository pairs show code cleanliness leaves agent task success unchanged but cuts token use by 7-8% and file revisits by 34%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18684","ref_index":28,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents","primary_cat":"cs.SE","submitted_at":"2026-05-18T17:23:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reversa is a reverse documentation engineering framework that deploys a multi-agent pipeline to extract implicit rules from legacy software and produce traceable specifications with confidence scores and explicit gaps for human review.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17535","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs","primary_cat":"cs.SE","submitted_at":"2026-05-17T16:39:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18890","ref_index":53,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits","primary_cat":"physics.soc-ph","submitted_at":"2026-05-17T00:21:53+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15573","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-15T03:33:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"After either staying in the parallel regime or completing one propagation pass, NEXAselects the final answer without using an external judge. Let ˜Rn denote the final candidate response for agent n and let zn =f( ˜Rn), z avg = 1 N NX n=1 zn, w n = cos(zn, zavg).(15) We then compute the contribution-weighted centroidz centroid = PN n=1 wnznPN n=1 wn and select n⋆ = arg max n cos(zn, zcentroid),ˆy= ˜Rn⋆ .(16) This aggregation rule directly inherits the response-conditioned, judge-free philosophy of Self- Org [Tastan et al., 2026]. 3.5 Training Objective The deployment objective is the final task correctness. For a labeled example (Q, y), let ˆyG denote the final output under graph G. In the current implementation, correctness is checked with the same verifier used in evaluation, instantiated as an xVerify-based binary reward [Chen et al."},{"citing_arxiv_id":"2605.10614","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:11:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRISM detects and stops credential leakage during LLM generation in multi-agent pipelines using per-token risk scores from lexical, structural, and behavioral signals, achieving zero observed leaks and F1 of 0.832 on a 2000-task benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"toward deterministic reproduction of a structured or memorised string, such as an API key, its token- level uncertainty decreases and probability mass becomes increasingly concentrated. This transition can provide an early signal of leakage before the full secret is emitted. PRISMoperationalises this insight by treating generation as sequential risk accumulation. At each decoding step t, it estimates a scalar risk score rt ∈[0,1] from observable generation signals and applies intervention before the token is committed to the shared context. This design allows PRISMto act before full secret reconstruction, reducing the likelihood that sensitive information propagates to downstream agents. Algorithm 1 summarises the per-token monitoring process: for each token, features are extracted, risk"},{"citing_arxiv_id":"2605.07509","ref_index":31,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals","primary_cat":"cs.SE","submitted_at":"2026-05-08T09:40:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"extracts from token 𝑗. Since these weights indicate how later tokens attend to earlier ones, they can provide a signal for estimating the relevance between different events in the execution trace [19, 37]. This information can assist in identifying relevant relationships. Another useful signal from the prefill stage is thenegative log- likelihood(NLL) [ 31]. For a specific token 𝑥𝑖 appearing after a se- quence of previous tokens 𝑥<𝑖, this value is computed asNLL(𝑥 𝑖 )= −log𝑃(𝑥 𝑖 |𝑥<𝑖). This equation calculates the negative logarithm of the probability that the model assigns to the actual token. When a token has a high negative log-likelihood, it indicates that the model finds the token highly unlikely to appear given the context."},{"citing_arxiv_id":"2604.19278","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Explicit Trait Inference for Multi-Agent Coordination","primary_cat":"cs.AI","submitted_at":"2026-04-21T09:48:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17658","ref_index":71,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Self-Improving Error Diagnosis in Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-04-19T23:13:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13018","ref_index":15,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Toward Autonomous Long-Horizon Engineering for ML Research","primary_cat":"cs.CL","submitted_at":"2026-04-14T17:55:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06742","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios","primary_cat":"cs.SE","submitted_at":"2026-04-08T07:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03143","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing","primary_cat":"cs.DC","submitted_at":"2026-04-03T16:04:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the round-level KV Cache redundancy that TokenDance tar- gets. Request-Centric KV Cache Reuse.Prefix caching in SGLang [ 40] and vLLM [ 12] reuses cached state when a new request shares an exact prefix with a stored se- quence. PromptCache [7] reuses cached modules identified by markup, and DroidSpeak [18] shares KV Caches across different LLMs. EPIC [10], KVLink [34], and KVComm [36] recover reuse at arbitrary positions through position cor- rection, and CacheBlend [35] further adds selective recom- putation to restore accuracy at important positions. All of these methods operate on one request at a time. TokenDance differs by treating the agent round as the reuse unit: it shares reuse work across all agents in a round and compresses the"},{"citing_arxiv_id":"2604.02648","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers","primary_cat":"cs.SE","submitted_at":"2026-04-03T02:23:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GBQA benchmark shows the best frontier LLM finds only 48.39% of verified game bugs using a multi-round ReAct agent with memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02399","ref_index":37,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents","primary_cat":"cs.SE","submitted_at":"2025-11-04T09:27:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvoDev introduces an iterative feature-driven framework with a DAG-based Feature Map for context propagation that improves LLM agent performance on end-to-end software development tasks by 56.8% over the best baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.16150","ref_index":125,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions","primary_cat":"cs.AI","submitted_at":"2025-01-27T15:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A dynamic benchmarking environment for autonomous agents. https://doi.org/10.48550/arXiv.2405.14573 [123] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2023. Android in the Wild: A Large-Scale Dataset for Android Device Control. https://doi.org/10.48550/arXiv.2307.10088 [124] James Reason. 1990.Human error. Cambridge University Press, Cambridge, United Kingdom. [125] Steven I. Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D. Weisz. 2023. The programmer's assistant: Conversational interaction with a large language model for software development. InProc. of the 28th Int. Conf. on IUI. ACM, Sydney, NSW, Australia, 491-514. https://doi.org/10.1145/3581641.3584037 [126] Stuart J. Russell and Peter Norvig."}],"limit":50,"offset":0}