Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik; Apaar Shanker; Jason Qin; Kaustubh Deshpande; Levi Lentz; Veronica Chatrath; Vijay S. Kalmath; Yash Maurya; Yuan (Emily) Xue

arxiv: 2605.21347 · v2 · pith:OP2RC5MVnew · submitted 2026-05-20 · 💻 cs.AI · cs.LG· cs.SE

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik , Apaar Shanker , Kaustubh Deshpande , Jason Qin , Yash Maurya , Veronica Chatrath , Vijay S. Kalmath , Levi Lentz

show 1 more author

Yuan (Emily) Xue

This is my paper

Pith reviewed 2026-05-22 09:43 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE

keywords LLM agentstrace diagnosticsmulti-agent systemsfailure analysisinsights generationcorpus-level analysisagent performancescaffolding

0 comments

The pith

A multi-agent system generates evidence-backed natural-language insights from entire corpora of LLM agent execution traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diagnosing failures in LLM agents has relied on manual inspection of small trace subsets, missing patterns visible only across large populations. The paper formalizes corpus-level trace diagnostics and introduces the Insights Generator, a multi-agent system that proposes and tests hypotheses against full trace sets to produce grounded reports with linked evidence. This matters because individual traces can span tens of thousands of tokens, making exhaustive human review impractical at production scale. Evaluation shows human experts using the reports improve scaffold performance by 30.4 percentage points over baseline, while coding agents obtain consistent gains from the derived insights. Domain experts rate the reports higher than alternatives on depth and evidence quality.

Core claim

The Insights Generator answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report that characterizes systematic behavioral patterns across trace groups, each linked to supporting evidence.

What carries the argument

The scout-investigator multi-agent architecture that proposes hypotheses and tests them against the full corpus to generate evidence-linked insights.

If this is right

Human experts achieve a 30.4 percentage point gain in scaffold performance when using IG reports versus the unmodified baseline.
Coding agents that incorporate IG-derived insights exhibit consistent and stable performance improvements.
IG reports match competing methods in detection coverage while receiving higher expert ratings for depth and evidence quality.
The approach scales to corpora with long individual traces without requiring full manual review of each one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated corpus diagnostics could feed into ongoing monitoring systems that flag new failure modes as agent deployments evolve.
The same hypothesis-testing loop might apply to non-coding domains such as web agents or multi-tool workflows if trace logging is standardized.
Periodic re-running of IG on accumulating traces could track whether implemented fixes resolve the original patterns or introduce new ones.

Load-bearing premise

The multi-agent scout-investigator architecture can reliably propose and test hypotheses across the entire trace corpus without systematic bias or omission of important patterns.

What would settle it

A side-by-side comparison where exhaustive human review of the same corpus identifies a major systematic failure mode that IG reports miss, or where implementing IG insights produces no measurable performance lift over manual hypothesis formation.

Figures

Figures reproduced from arXiv: 2605.21347 by Akshay Manglik, Apaar Shanker, Jason Qin, Kaustubh Deshpande, Levi Lentz, Veronica Chatrath, Vijay S. Kalmath, Yash Maurya, Yuan (Emily) Xue.

**Figure 1.** Figure 1: Insights Generator (IG) system overview. Left: the input layer provides a diagnostic question, Q, trace corpus, C, and processed data store, S. Center: the Orchestrator dispatches Scout agents (H: hypothesize over sampled traces) and Investigator agents (H∗ : validate via corpus-scale cohort comparison). The Investigator analyzes H∗ to generate findings, Fr , which are sent to the orchestrator. The orchest… view at source ↗

**Figure 2.** Figure 2: Overview of the four evaluation settings used to assess the Insights Generator (rubric-based and intervention-based). experiments involving autonomous judges and intervention agents, we vary corpus scale, benchmark diversity (HLE [17] and SpreadsheetBench [12]), and comparison systems spanning single-agent baselines to multi-agent alternatives. The benchmarks span diverse agent task domains: SpreadsheetBen… view at source ↗

read the original abstract

Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IG introduces a scout-investigator multi-agent system for corpus-level LLM trace diagnostics and reports 30pp downstream gains, but the evaluation lacks controls and coverage checks.

read the letter

The main thing here is that the paper formalizes corpus-level trace diagnostics for LLM agents and builds a multi-agent system called IG that uses scout agents to surface hypotheses and investigator agents to test them against the full trace set, producing natural-language reports with evidence links. They show humans applying those reports get a 30.4pp lift on scaffold performance and coding agents see stable gains, with experts rating the reports high on depth and evidence quality.

Referee Report

2 major / 1 minor

Summary. The paper formalizes corpus-level trace diagnostics for LLM agents and introduces the Insights Generator (IG), a multi-agent scout-investigator system that proposes and tests hypotheses over execution trace corpora to produce evidence-backed natural-language insights characterizing systematic behavioral patterns. Evaluation combines rubric-based expert assessment of report quality with downstream experiments showing that human experts using IG reports achieve a 30.4pp improvement in scaffold performance over baseline, while coding agents incorporating IG-derived insights exhibit consistent gains; IG reports are rated highly for depth and evidence quality relative to alternatives.

Significance. If the reported performance gains and report quality hold under controlled conditions, the work is significant for LLM agent engineering. It shifts failure diagnosis from ad-hoc manual inspection of small trace subsets to scalable, systematic analysis of full corpora, directly addressing a practical bottleneck in production agent systems. The dual demonstration of human-expert and autonomous-agent improvements via the same insight reports strengthens the practical case.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation description: the 30.4pp scaffold improvement and stable agent gains are presented as central results, yet no information is given on trace sampling strategy, experimental controls, statistical significance testing, or how the baseline scaffold was constructed. Without these, it is impossible to assess whether the gains are attributable to IG insights or to uncontrolled variables in the evaluation setup.
[Abstract] Abstract description of scout-investigator architecture: the claim that IG produces grounded, complete insights rests on the multi-agent loop reliably surfacing corpus-wide patterns. No ablation, coverage argument, or verification that scout sampling examines the full trace population (rather than high-salience subsets) is provided. This directly bears on whether the reported downstream improvements reflect systematic diagnostics or partial pattern capture.

minor comments (1)

[Methods] Notation for trace groups and insight linking could be clarified with a small diagram or explicit definition early in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight important areas for improving the clarity and rigor of our evaluation and architectural claims. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the 30.4pp scaffold improvement and stable agent gains are presented as central results, yet no information is given on trace sampling strategy, experimental controls, statistical significance testing, or how the baseline scaffold was constructed. Without these, it is impossible to assess whether the gains are attributable to IG insights or to uncontrolled variables in the evaluation setup.

Authors: We agree that the current description of the evaluation lacks sufficient methodological detail to fully substantiate the reported gains. In the revised manuscript, we will expand the evaluation section with a new subsection that explicitly describes the trace sampling strategy employed for the corpus, the precise construction of the baseline scaffold, the experimental controls used to isolate the contribution of IG insights, and the statistical significance testing performed on the 30.4pp improvement. These additions will enable readers to evaluate whether the performance differences can be attributed to the insights generated by IG. revision: yes
Referee: [Abstract] Abstract description of scout-investigator architecture: the claim that IG produces grounded, complete insights rests on the multi-agent loop reliably surfacing corpus-wide patterns. No ablation, coverage argument, or verification that scout sampling examines the full trace population (rather than high-salience subsets) is provided. This directly bears on whether the reported downstream improvements reflect systematic diagnostics or partial pattern capture.

Authors: The manuscript already reports that IG's scout-investigator architecture yields detection coverage comparable to competing approaches across benchmarks. Nevertheless, we acknowledge that an explicit ablation study and a dedicated coverage argument would provide stronger support for the claim of systematic, corpus-wide pattern detection. In the revision, we will add an ablation analysis of the scout and investigator components together with an empirical verification or formal argument showing that scout sampling examines the full trace population rather than high-salience subsets. This will clarify the relationship between the architecture and the observed downstream improvements. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmarks

full rationale

The paper presents the Insights Generator as a multi-agent system for corpus-level trace diagnostics and supports its claims through direct empirical measurements: rubric-based report quality ratings by domain experts, downstream scaffold performance gains of 30.4pp when humans apply IG reports, and consistent agent improvements on benchmarks. No equations, parameter fits, or derivation steps are described that reduce a claimed result to the system's own inputs or definitions by construction. The evaluation relies on external baselines, human assessments, and benchmark comparisons rather than any self-referential loop, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5758 in / 1136 out tokens · 25844 ms · 2026-05-22T09:43:04.809149+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scout agents explore samples; Investigator agents validate at corpus scale via cohort comparison and distributional statistics.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 10 internal anchors

[1]

Claude Code.https://docs.claude.com/en/docs/claude-code/, 2024

Anthropic. Claude Code.https://docs.claude.com/en/docs/claude-code/, 2024

work page 2024
[2]

Barke, A

S. Barke, A. Goyal, A. Khare, A. Singh, S. Nath, and C. Bansal. AgentRx: Diagnosing AI agent failures from execution trajectories, 2026. URLhttps://arxiv.org/abs/2602.02475

work page arXiv 2026
[3]

Bertsch, A

A. Bertsch, A. Pratapa, T. Mitamura, G. Neubig, and M. R. Gormley. Oolong: Evaluating long context reasoning and aggregation capabilities, 2025. URLhttps://arxiv.org/abs/2511.02817

work page arXiv 2025
[4]

Why Do Multi-Agent LLM Systems Fail?

M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, et al. Why do multi-agent LLM systems fail? InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://arxiv.org/abs/2503.13657

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

HALO: Hierarchical agent loop optimizer, 2025

Context Labs. HALO: Hierarchical agent loop optimizer, 2025. URL https://github.com/ context-labs/halo. Software repository

work page 2025
[6]

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025. URL...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Deshpande, V

D. Deshpande, V . Gangal, H. Mehta, J. Krishnan, A. Kannappan, and R. Qian. TRAIL: Trace reasoning and agentic issue localization, 2025. URLhttps://arxiv.org/abs/2505.08638

work page arXiv 2025
[8]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025. URLhttps://arxiv.org/abs/2408.08435

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URLhttps://arxiv.org/abs/2603.28052

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

M. Ma, J. Zhang, F. Yang, Y. Kang, Q. Lin, et al. DoVer: Intervention-driven auto debugging for LLM multi-agent systems. InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2512.06749

work page arXiv 2026
[11]

X. Ma, X. Xie, Y. Wang, J. Wang, B. Wu, et al. Demystifying the lifecycle of failures in platform- orchestrated agentic workflows, 2026. URLhttps://arxiv.org/abs/2509.23735

work page arXiv 2026
[12]

Z. Ma, B. Zhang, J. Zhang, J. Yu, X. Zhang, X. Zhang, S. Luo, X. Wang, and J. Tang. SpreadsheetBench: Towards challenging real world spreadsheet manipulation. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://arxiv.org/abs/2406.14991

work page arXiv 2024
[13]

Maekawa, H

S. Maekawa, H. Iso, and N. Bhutani. Holistic reasoning with long-context LMs: A benchmark for database operations on massive textual data. InInternational Conference on Learning Representations,

work page
[14]

URLhttps://arxiv.org/abs/2410.11996

work page arXiv
[15]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. InInternational Conference on Learning Representations, 2026. URLhttps://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Nanda, C

R. Nanda, C. Maddila, S. Jha, E. M. Khan, M. Paltenghi, and S. Chandra. Wink: Recovering from misbehaviors in coding agents, 2026. URLhttps://arxiv.org/abs/2602.17037

work page arXiv 2026
[17]

J. Ni, Y. Liu, X. Liu, Y. Sun, M. Zhou, P . Cheng, D. Wang, E. Zhao, X. Jiang, and G. Jiang. Trace2Skill: Distill trajectory-local lessons into transferable agent skills, 2026. URL https://arxiv.org/abs/ 2603.25158

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/s41586-025-09962-4. URLhttps://www.nature.com/articles/s41586-025-09962-4

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[19]

SpreadsheetBench Verified: A curated evaluation set

Shortcut Research Team. SpreadsheetBench Verified: A curated evaluation set. https://shortcut. ai/blog/posts/spreadsheetbench-verified, Dec. 2025

work page 2025
[20]

VeRO: An Evaluation Harness for Agents to Optimize Agents

V . Ursekar, A. Shanker, V . Chatrath, Y. Xue, and S. Denton. VeRO: An evaluation harness for agents to optimize agents, 2026. URLhttps://arxiv.org/abs/2602.22480

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Vidgen, A

B. Vidgen, A. Mann, A. Fennelly, J. W. Stanly, L. Rothman, M. Burstein, J. Benchek, D. Ostrofsky, A. Ravichandran, D. Sur, N. Venugopal, A. Hsia, I. Robinson, C. Huang, O. Varones, D. Khan, M. Haines, A. Bridges, J. Boyle, K. Twist, Z. Richards, C. Mahapatra, B. Foody, and O. Nitski. APEX-Agents.arXiv preprint arXiv:2601.14242, 2026. URLhttps://arxiv.org/...

work page arXiv 2026
[22]

A. L. Zhang, T. Kraska, and O. Khattab. Recursive language models, 2025. URL https://arxiv. org/abs/2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Zhang, J

G. Zhang, J. Wang, J. Chen, W. Zhou, K. Wang, and S. Yan. AgenTracer: Who is inducing failure in the LLM agentic systems? InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2509.03312

work page arXiv 2026
[24]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, et al. AFlow: Automating agentic workflow generation. InInternational Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2410. 10762

work page 2025
[25]

Zhang, M

S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, et al. Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems. InInternational Conference on Machine Learning, 2025. URLhttps://arxiv.org/abs/2505.00212

work page arXiv 2025
[26]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Q. Zhou, J. Zhang, H. Wang, R. Hao, J. Wang, M. Han, Y. Yang, S. Wu, F. Pan, L. Fan, D. Tu, and Z. Zhang. FeatureBench: Benchmarking agentic coding for complex feature development. In International Conference on Learning Representations, 2026. URLhttps://arxiv.org/pdf/2602.10975

work page arXiv 2026
[28]

verbose database queries correlate with null results

K. Zhu, Z. Liu, B. Li, M. Tian, Y. Yang, J. Zhang, et al. Where LLM agents fail and how they can learn from failures.arXiv preprint, 2025. URLhttps://arxiv.org/abs/2509.25370. 13 Scale AI Research A Appendix A.1 Agent System Prompts The system prompts below correspond to the production configuration in which both Scout and Investi- gator subagents are dis...

work page arXiv 2025

[1] [1]

Claude Code.https://docs.claude.com/en/docs/claude-code/, 2024

Anthropic. Claude Code.https://docs.claude.com/en/docs/claude-code/, 2024

work page 2024

[2] [2]

Barke, A

S. Barke, A. Goyal, A. Khare, A. Singh, S. Nath, and C. Bansal. AgentRx: Diagnosing AI agent failures from execution trajectories, 2026. URLhttps://arxiv.org/abs/2602.02475

work page arXiv 2026

[3] [3]

Bertsch, A

A. Bertsch, A. Pratapa, T. Mitamura, G. Neubig, and M. R. Gormley. Oolong: Evaluating long context reasoning and aggregation capabilities, 2025. URLhttps://arxiv.org/abs/2511.02817

work page arXiv 2025

[4] [4]

Why Do Multi-Agent LLM Systems Fail?

M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, et al. Why do multi-agent LLM systems fail? InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://arxiv.org/abs/2503.13657

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

HALO: Hierarchical agent loop optimizer, 2025

Context Labs. HALO: Hierarchical agent loop optimizer, 2025. URL https://github.com/ context-labs/halo. Software repository

work page 2025

[6] [6]

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025. URL...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Deshpande, V

D. Deshpande, V . Gangal, H. Mehta, J. Krishnan, A. Kannappan, and R. Qian. TRAIL: Trace reasoning and agentic issue localization, 2025. URLhttps://arxiv.org/abs/2505.08638

work page arXiv 2025

[8] [8]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025. URLhttps://arxiv.org/abs/2408.08435

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses, 2026. URLhttps://arxiv.org/abs/2603.28052

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

M. Ma, J. Zhang, F. Yang, Y. Kang, Q. Lin, et al. DoVer: Intervention-driven auto debugging for LLM multi-agent systems. InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2512.06749

work page arXiv 2026

[11] [11]

X. Ma, X. Xie, Y. Wang, J. Wang, B. Wu, et al. Demystifying the lifecycle of failures in platform- orchestrated agentic workflows, 2026. URLhttps://arxiv.org/abs/2509.23735

work page arXiv 2026

[12] [12]

Z. Ma, B. Zhang, J. Zhang, J. Yu, X. Zhang, X. Zhang, S. Luo, X. Wang, and J. Tang. SpreadsheetBench: Towards challenging real world spreadsheet manipulation. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://arxiv.org/abs/2406.14991

work page arXiv 2024

[13] [13]

Maekawa, H

S. Maekawa, H. Iso, and N. Bhutani. Holistic reasoning with long-context LMs: A benchmark for database operations on massive textual data. InInternational Conference on Learning Representations,

work page

[14] [14]

URLhttps://arxiv.org/abs/2410.11996

work page arXiv

[15] [15]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. InInternational Conference on Learning Representations, 2026. URLhttps://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Nanda, C

R. Nanda, C. Maddila, S. Jha, E. M. Khan, M. Paltenghi, and S. Chandra. Wink: Recovering from misbehaviors in coding agents, 2026. URLhttps://arxiv.org/abs/2602.17037

work page arXiv 2026

[17] [17]

J. Ni, Y. Liu, X. Liu, Y. Sun, M. Zhou, P . Cheng, D. Wang, E. Zhao, X. Jiang, and G. Jiang. Trace2Skill: Distill trajectory-local lessons into transferable agent skills, 2026. URL https://arxiv.org/abs/ 2603.25158

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/s41586-025-09962-4. URLhttps://www.nature.com/articles/s41586-025-09962-4

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026

[19] [19]

SpreadsheetBench Verified: A curated evaluation set

Shortcut Research Team. SpreadsheetBench Verified: A curated evaluation set. https://shortcut. ai/blog/posts/spreadsheetbench-verified, Dec. 2025

work page 2025

[20] [20]

VeRO: An Evaluation Harness for Agents to Optimize Agents

V . Ursekar, A. Shanker, V . Chatrath, Y. Xue, and S. Denton. VeRO: An evaluation harness for agents to optimize agents, 2026. URLhttps://arxiv.org/abs/2602.22480

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Vidgen, A

B. Vidgen, A. Mann, A. Fennelly, J. W. Stanly, L. Rothman, M. Burstein, J. Benchek, D. Ostrofsky, A. Ravichandran, D. Sur, N. Venugopal, A. Hsia, I. Robinson, C. Huang, O. Varones, D. Khan, M. Haines, A. Bridges, J. Boyle, K. Twist, Z. Richards, C. Mahapatra, B. Foody, and O. Nitski. APEX-Agents.arXiv preprint arXiv:2601.14242, 2026. URLhttps://arxiv.org/...

work page arXiv 2026

[22] [22]

A. L. Zhang, T. Kraska, and O. Khattab. Recursive language models, 2025. URL https://arxiv. org/abs/2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Zhang, J

G. Zhang, J. Wang, J. Chen, W. Zhou, K. Wang, and S. Yan. AgenTracer: Who is inducing failure in the LLM agentic systems? InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2509.03312

work page arXiv 2026

[24] [24]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, et al. AFlow: Automating agentic workflow generation. InInternational Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2410. 10762

work page 2025

[25] [25]

Zhang, M

S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, et al. Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems. InInternational Conference on Machine Learning, 2025. URLhttps://arxiv.org/abs/2505.00212

work page arXiv 2025

[26] [26]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Q. Zhou, J. Zhang, H. Wang, R. Hao, J. Wang, M. Han, Y. Yang, S. Wu, F. Pan, L. Fan, D. Tu, and Z. Zhang. FeatureBench: Benchmarking agentic coding for complex feature development. In International Conference on Learning Representations, 2026. URLhttps://arxiv.org/pdf/2602.10975

work page arXiv 2026

[28] [28]

verbose database queries correlate with null results

K. Zhu, Z. Liu, B. Li, M. Tian, Y. Yang, J. Zhang, et al. Where LLM agents fail and how they can learn from failures.arXiv preprint, 2025. URLhttps://arxiv.org/abs/2509.25370. 13 Scale AI Research A Appendix A.1 Agent System Prompts The system prompts below correspond to the production configuration in which both Scout and Investi- gator subagents are dis...

work page arXiv 2025