arxiv: 2605.02244 · v1 · submitted 2026-05-04 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

Yelin Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:26 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords software engineering agentstriadic datalong-horizon tasksagent training datahuman-AI interactiondata collection frameworks

0 comments

The pith

Software engineering agents need triadic data that records human conversations, AI sessions, and full team workflows to handle long-horizon projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that frontier agents have hit a ceiling on short tasks but regress on the ambiguous, multi-week, cross-functional work that defines senior engineering. It identifies the missing element as triadic data: synchronized records of the human-human talks that build engineering context, the human-AI exchanges that apply it, and the surrounding multi-week team activity. The authors propose two concrete products to generate this data—long-horizon expert trajectories collected through stimulated-recall methods and fully instrumented simulated companies—and insist that any corpus, triadic or not, must pass a four-tier validation process before use in training. They argue this data is obtainable with existing techniques and directly addresses open questions in agent scaling.

Core claim

The central claim is that the training substrate required for capable long-horizon software engineering agents is triadic data, formed by synchronized capture of human-human conversations where context is created, human-AI sessions where that context is used, and the multi-week cross-functional work that encompasses both. This substrate is realized through long-horizon expert trajectories under stimulated-recall protocols and simulated cross-functional companies with instrumented senior teams. Any candidate corpus must demonstrate quality via mechanical verification, statistical characterization, probe experiments, and pre-registered blind evaluation, and the authors state that such triadic

What carries the argument

Triadic data: synchronized capture of human-human conversations, human-AI sessions, and surrounding multi-week cross-functional work.

If this is right

Agents trained on triadic data will succeed on long-horizon, multi-engineer, ambiguous-specification deliverables where current agents regress.
The four-tier evidence framework will let researchers rigorously justify the quality of any training corpus before fine-tuning.
Stimulated-recall trajectories and instrumented simulated companies will supply the empirical basis for answering four open questions in agent training.
The field's near-term research agenda should include explicit collection of triadic data within 12-18 months using mature methods from adjacent fields.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Collecting full triadic records could shift data efforts away from scaling solo trajectories toward instrumenting real collaborative environments.
The same synchronized-capture approach might apply to other long-horizon collaborative domains that rely on shared context.
Partial versions of triadic data could be tested to determine whether the complete triad is required or if subsets already produce gains.
New synchronization tools for conversations, actions, and deliverables will likely be needed to build these corpora at scale.

Load-bearing premise

That collecting and applying triadic data under the four-tier evidence framework will close the performance gap on long-horizon tasks instead of merely describing a plausible but untested data strategy.

What would settle it

A pre-registered blind evaluation on long-horizon software engineering benchmarks in which agents fine-tuned on triadic data show no measurable improvement over agents trained on standard GitHub scrapes or solo trajectories.

Figures

Figures reproduced from arXiv: 2605.02244 by Yelin Kim.

**Figure 1.** Figure 1: Three triadic configurations. The arrow into the AI agent in each panel is the locus of current public corpora; the dashed arrows between humans are the locus of the missing training signal. What governs the success of this interaction is not the quality of the human–AI dialogue. It is the prior state of the engineer’s own context. If they have just come from a meeting where the cluster-scale change was di… view at source ↗

**Figure 2.** Figure 2: Four-tier evidence framework. Item-level mechanical checks (T1) and statistical corpus characterization (T2) are necessary baselines. Probe experiments (T3) and pre-registered blind evaluation (T4) are where most public corpora fail—and where credible producers distinguish themselves. corpus, must justify its quality quantitatively. The framework is orthogonal to the triadic claim: it applies equally to Gi… view at source ↗

read the original abstract

Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs. It is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We further specify a four-tier evidence framework through which any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. We argue that this data is capturable in 12-18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field's near-term research agenda should include it explicitly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper correctly flags that solo trajectories and GitHub scrapes won't get agents past short-horizon benchmarks, but its claim that triadic data is the missing piece rests on an untested causal assumption.

read the letter

The main thing here is that current SWE-agent data sources are missing the multi-party context that real engineering work runs on. The paper names this gap plainly and introduces the term triadic data plus a four-tier evaluation checklist (mechanical verification, corpus stats, probes, blind eval) to judge any new collection effort. That framing is new enough in the agent-training literature to be worth noting, and the stimulated-recall plus simulated-company idea is a concrete enough suggestion that groups already running human studies could pick it up without starting from scratch. The argument also aligns with what many of us have seen in practice: agents drift on ambiguous specs because they never saw how the spec was negotiated in the first place. The soft spot is exactly what the stress-test note flags. There is no derivation or even a sketch of how triadic logs would be turned into training signals that outperform next-token prediction on existing corpora, nor any pilot data showing reduced context drift or better coordination. The four-tier framework is defined but never applied, so the assertion that this data is “the empirical key” stays a hypothesis. No math, no tables, no citations to prior failed attempts at richer logging, which keeps the piece light. This is for researchers who decide what data gets collected for the next round of agent fine-tuning rather than for people running ablations today. It deserves a serious referee because the diagnosis is sharp and the proposed collection strategy is feasible within existing HCI and software-engineering methods; a good review would just press for a clearer training-objective section and a small-scale feasibility check. I would bring it to a reading group to argue about data priorities, but I would not cite it as evidence until someone actually builds and measures a triadic corpus.

Referee Report

2 major / 2 minor

Summary. This position paper claims that frontier SWE agents have saturated short-horizon benchmarks but are regressing on long-horizon, multi-engineer deliverables with ambiguous specifications. It argues that neither larger GitHub scrapes, solo-agent trajectories, nor open human-AI dialogue logs are sufficient; the required substrate is triadic data consisting of synchronized human-human conversations (where context is formed), human-AI sessions (where context is consumed), and the surrounding multi-week cross-functional work. The paper proposes two canonical capture methods—long-horizon expert trajectories via stimulated-recall protocols and instrumented simulated cross-functional companies—and specifies a four-tier evidence framework (mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation) that any corpus must satisfy to justify its quality for fine-tuning.

Significance. If the proposed triadic data collection strategy and four-tier framework can be realized and shown to improve long-horizon performance, the work would usefully redirect data-centric research in software engineering agents away from purely observational corpora toward synchronized multi-party records. The explicit four-tier validation structure is a constructive contribution that supplies a concrete checklist for future data papers, and the 12-18 month capturability claim (drawing on mature methods from adjacent fields) provides a realistic timeline that could accelerate empirical follow-up.

major comments (2)

[Abstract] Abstract: the claim that triadic data 'is the empirical key to four open questions in agent training' is load-bearing for the position yet provides no explicit mapping from the three synchronized data streams to concrete training objectives (e.g., next-token prediction on conversation threads, preference pairs derived from stimulated recall, or retrieval-augmented context for coordination failures). Without this linkage the necessity argument remains an untested hypothesis rather than a secured requirement.
[Four-tier evidence framework] Section defining the four-tier evidence framework: the framework is presented as the required justification standard for any corpus, but it is never applied even to a toy example or existing public dataset. This omission leaves open whether the tiers are operationalizable at the scale needed for long-horizon SWE data and therefore weakens the assertion that triadic data can be 'justified to a fine-tuning researcher' through these tiers.

minor comments (2)

The distinction between the proposed triadic capture and existing open human-AI dialogue logs could be sharpened with a short table contrasting failure modes (context drift, coordination loss, specification ambiguity) that each data type has been observed to address or fail to address.
The manuscript would benefit from a brief forward-looking paragraph on potential ethical and privacy considerations of instrumenting cross-functional teams, even if only to note that the 12-18 month timeline assumes these can be managed with existing protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our position paper. The feedback has prompted us to clarify key aspects of our argument and strengthen the presentation of the proposed framework. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that triadic data 'is the empirical key to four open questions in agent training' is load-bearing for the position yet provides no explicit mapping from the three synchronized data streams to concrete training objectives (e.g., next-token prediction on conversation threads, preference pairs derived from stimulated recall, or retrieval-augmented context for coordination failures). Without this linkage the necessity argument remains an untested hypothesis rather than a secured requirement.

Authors: We agree that an explicit mapping from the triadic data streams to training objectives would make the necessity argument more concrete. Although the body of the manuscript discusses these connections (human-human conversations supplying context formation absent from solo trajectories, human-AI sessions showing consumption patterns, and cross-functional records capturing coordination failures), the abstract does not spell them out. In the revised version we have updated the abstract with a brief indication of the linkages and added a dedicated subsection containing a table that maps each stream to specific objectives: next-token prediction on synchronized human-human threads for context modeling, preference pairs derived from stimulated-recall annotations for coordination decisions, and retrieval-augmented generation over cross-functional logs to address multi-party failures. This change turns the claim into a more testable proposal. revision: yes
Referee: [Four-tier evidence framework] Section defining the four-tier evidence framework: the framework is presented as the required justification standard for any corpus, but it is never applied even to a toy example or existing public dataset. This omission leaves open whether the tiers are operationalizable at the scale needed for long-horizon SWE data and therefore weakens the assertion that triadic data can be 'justified to a fine-tuning researcher' through these tiers.

Authors: The four-tier framework is proposed as a forward-looking validation standard for future corpora rather than a retrospective check on existing data, since no large-scale synchronized triadic SWE corpus yet exists. To address operationalizability, the revised manuscript adds a new section with a worked toy example on a small, artificially constructed dataset of short-horizon tasks. The example illustrates each tier in sequence: mechanical verification via timestamp and role synchronization checks, statistical corpus characterization via length and participant-role distributions, probe experiments via small-scale fine-tuning and context-retention metrics, and pre-registered blind evaluation via blinded expert scoring of agent outputs. While the toy instance is necessarily limited in scale, it demonstrates that the tiers are concrete and extensible, thereby supporting the claim that triadic data can be justified to fine-tuning researchers. revision: yes

Circularity Check

0 steps flagged

No circularity; position paper with no derivations or self-referential reductions

full rationale

This is a forward-looking position paper arguing for triadic data as substrate for long-horizon SWE agents and proposing a four-tier evidence framework. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claims are presented as arguments and hypotheses ('We argue that...', 'We further specify...') rather than reductions of outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify premises. The paper is self-contained as a conceptual proposal without any load-bearing step that collapses to its own definitions or fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about current data limitations and the efficacy of the proposed triadic approach, plus invented concepts introduced without independent evidence.

axioms (2)

domain assumption Frontier SWE agents have saturated short-horizon benchmarks while regressing on long-horizon work
Opening premise of the abstract used to motivate the need for new data.
ad hoc to paper Triadic data is the empirical key to four open questions in agent training
Load-bearing assertion that the proposed data will resolve specific open problems.

invented entities (3)

triadic data no independent evidence
purpose: Synchronized capture of human-human conversations, human-AI sessions, and multi-week cross-functional work
Core new concept introduced to define the required training substrate.
stimulated-recall protocols no independent evidence
purpose: Method for capturing long-horizon expert trajectories
One of the two canonical products proposed.
simulated cross-functional companies no independent evidence
purpose: Instrumented teams working through ambiguous deliverables on shared infrastructure
Second canonical product proposed as a scalable data source.

pith-pipeline@v0.9.0 · 5545 in / 1751 out tokens · 62569 ms · 2026-05-08T18:26:16.740926+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Effective Harnesses for Long-Running Agents , year =
[2]

Scaling Managed Agents: Decoupling the Brain from the Hands , year =
[3]

Natural emergent misalignment from reward hacking in production rl, 2025

Monte MacDiarmid and Benjamin Wright and Jonathan Uesato and Joe Benton and Jan Kutasov and Sara Price and Nicholas Bouscal and Sam Bowman and Trenton Bricken and Alex Cloud and Carson Denison and Johannes Gasteiger and Ryan Greenblatt and Jan Leike and Jack Lindsey and Vladimir Mikulik and Ethan Perez and Anson Rodrigues and Drake Thomas and Albert Webso...

work page arXiv
[8]

2026 , howpublished =

2026
[13]

2026 , month = jan, howpublished =

Time. 2026 , month = jan, howpublished =

2026
[14]

2026 , howpublished =

Zora Zhiruo Wang , title =. 2026 , howpublished =

2026
[15]

Anders Ericsson and Herbert A

K. Anders Ericsson and Herbert A. Simon , title =. 1993 , address =

1993
[16]

Proceedings of the

Kristen Grauman and Andrew Westbury and Eugene Byrne and Zachary Chavis and Antonino Furnari and Rohit Girdhar and Jackson Hamburger and Hao Jiang and Miao Liu and others , title =. Proceedings of the. 2022 , url =

2022
[17]

Proceedings of the European Conference on Computer Vision (ECCV) , year =

Dima Damen and Hazel Doughty and Giovanni Maria Farinella and Sanja Fidler and Antonino Furnari and Evangelos Kazakos and Davide Moltisanti and Jonathan Munro and Toby Perrett and Will Price and Michael Wray , title =. Proceedings of the European Conference on Computer Vision (ECCV) , year =
[18]

Proceedings of the 2023

Neil Perry and Megha Srivastava and Deepak Kumar and Dan Boneh , title =. Proceedings of the 2023. 2023 , url =

2023
[19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[21]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chunting Zhou and Pengfei Liu and Puxin Xu and Srinivasan Iyer and Jiao Sun and Yuning Mao and Xuezhe Ma and Avia Efrat and Ping Yu and Lili Yu and Susan Zhang and Gargi Ghosh and Mike Lewis and Luke Zettlemoyer and Omer Levy , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[22]

2024 , month = jan, howpublished =

William Harding and Matthew Kloster , title =. 2024 , month = jan, howpublished =

2024
[23]

2026 , howpublished =

Yelin Kim , title =. 2026 , howpublished =

2026
[24]

arXiv preprint arXiv:2410.06992 , year=

R. Aleithan, H. Xue, M. M. Mohajer, E. Nnorom, G. Uddin, and S. Wang. SWE-Bench+ : Enhanced coding benchmark for LLM s. arXiv preprint arXiv:2410.06992, 2024. URL https://arxiv.org/abs/2410.06992

work page arXiv 2024
[25]

Effective harnesses for long-running agents

Anthropic . Effective harnesses for long-running agents. Anthropic Engineering Blog, Nov. 2025. URL https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

2025
[26]

Scaling managed agents: Decoupling the brain from the hands

Anthropic . Scaling managed agents: Decoupling the brain from the hands. Anthropic Engineering Blog, Apr. 2026. URL https://www.anthropic.com/engineering/managed-agents

2026
[27]

Scaling egocentric vision: The epic-kitchens dataset, 2018

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The EPIC-KITCHENS dataset. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. URL https://arxiv.org/abs/1804.02748

work page arXiv 2018
[28]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. URL https://arxiv.org/abs/2501.12948. Also published in Nature 645:633--638, 2025

work page internal anchor Pith review arXiv 2025
[29]

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro : Can AI agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025. URL https://arxiv.org/abs/2509.16941

work page internal anchor Pith review arXiv 2025
[30]

K. A. Ericsson and H. A. Simon. Protocol Analysis: Verbal Reports as Data (Revised Edition). MIT Press, Cambridge, MA, 1993. ISBN 9780262050470

1993
[31]

Ego4d: Around the world in 3,000 hours of egocentric video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, et al. Ego4D : Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022. URL https://arxiv.org/abs/2110.07058

work page arXiv 2022
[32]

Harding and M

W. Harding and M. Kloster. Coding on copilot: 2023 data suggests downward pressure on code quality (incl.\ 2024 projections). GitClear research report, Jan. 2024. URL https://www.gitclear.com/coding_on_copilot_data_shows_ais_downward_pressure_on_code_quality

2023
[33]

Y. Kim. What quality means for next-generation SWE training data. Working memo, 2026. Available from the author at yelinkim@umich.edu

2026
[34]

Kottamasu, C

A. Kottamasu, C. Mahapatra, S. Lee, B. Pan, A. Barthwal, A. Datta, A. Gupta, P. Mehta, A. Arun, S. Alberti, A. Hiremath, B. Foody, and B. Vidgen. APEX : A benchmark for evaluating AI on senior software engineering tasks. arXiv preprint arXiv:2601.08806, 2026. URL https://arxiv.org/abs/2601.08806. Mercor and Cognition

work page arXiv 2026
[35]

T. Kwa, B. West, J. Becker, A. Deng, K. Garcia, M. Hasin, S. Jawhar, M. Kinniment, N. Rush, S. V. Arx, R. Bloom, T. Broadley, H. Du, B. Goodrich, N. Jurkovic, L. H. Miles, S. Nix, T. Lin, N. Parikh, D. Rein, L. J. K. Sato, H. Wijk, D. M. Ziegler, E. Barnes, and L. Chan. Measuring AI ability to complete long software tasks. arXiv preprint arXiv:2503.14499,...

work page arXiv 2025
[36]

Terminal-Bench 2.0 leaderboard

Laude Institute . Terminal-Bench 2.0 leaderboard. tbench.ai, 2026. URL https://www.tbench.ai/leaderboard/terminal-bench/2.0

2026
[37]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025

S. Miserendino, M. Wang, T. Patwardhan, and J. Heidecke. SWE-Lancer : Can frontier LLM s earn \ 1 million from real-world freelance software engineering? arXiv preprint arXiv:2502.12115, 2025. URL https://arxiv.org/abs/2502.12115

work page arXiv 2025
[38]

Perry, M

N. Perry, M. Srivastava, D. Kumar, and D. Boneh. Do users write more insecure code with AI assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS '23) , 2023. URL https://arxiv.org/abs/2211.03622

work page arXiv 2023
[39]

M. V. T. Thai, T. Le, D. N. Manh, H. P. Nhat, and N. D. Q. Bui. SWE-EVO : Benchmarking coding agents in long-horizon software evolution scenarios. arXiv preprint arXiv:2512.18470, 2025. URL https://arxiv.org/abs/2512.18470

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Vijayvargiya, X

S. Vijayvargiya, X. Zhou, A. Yerukola, M. Sap, and G. Neubig. Interactive agents to overcome ambiguity in software engineering. arXiv preprint arXiv:2502.13069, 2025. URL https://arxiv.org/abs/2502.13069. Accepted at ICLR 2026

work page arXiv 2025
[41]

Z. Z. Wang. Position: Humans are missing from AI coding agent research. Manuscript / preprint, 2026. URL https://zorazrw.github.io/files/position-haicode.pdf

2026
[42]

J. Yang, C. E. Jimenez, A. L. Zhang, K. Lieret, J. Yang, X. Wu, O. Press, N. Muennighoff, G. Synnaeve, K. R. Narasimhan, D. Yang, S. I. Wang, and O. Press. SWE-bench Multimodal : Do AI systems generalize to visual software domains? arXiv preprint arXiv:2410.03859, 2024. URL https://arxiv.org/abs/2410.03859

work page arXiv 2024
[43]

Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://arxiv.org/abs/2504.13837. Oral presentation

work page Pith review arXiv 2025
[44]

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. LIMA : Less is more for alignment. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.11206

work page arXiv 2023