Recognition: 2 theorem links
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
Pith reviewed 2026-05-08 18:26 UTC · model grok-4.3
The pith
Software engineering agents need triadic data that records human conversations, AI sessions, and full team workflows to handle long-horizon projects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the training substrate required for capable long-horizon software engineering agents is triadic data, formed by synchronized capture of human-human conversations where context is created, human-AI sessions where that context is used, and the multi-week cross-functional work that encompasses both. This substrate is realized through long-horizon expert trajectories under stimulated-recall protocols and simulated cross-functional companies with instrumented senior teams. Any candidate corpus must demonstrate quality via mechanical verification, statistical characterization, probe experiments, and pre-registered blind evaluation, and the authors state that such triadic
What carries the argument
Triadic data: synchronized capture of human-human conversations, human-AI sessions, and surrounding multi-week cross-functional work.
If this is right
- Agents trained on triadic data will succeed on long-horizon, multi-engineer, ambiguous-specification deliverables where current agents regress.
- The four-tier evidence framework will let researchers rigorously justify the quality of any training corpus before fine-tuning.
- Stimulated-recall trajectories and instrumented simulated companies will supply the empirical basis for answering four open questions in agent training.
- The field's near-term research agenda should include explicit collection of triadic data within 12-18 months using mature methods from adjacent fields.
Where Pith is reading between the lines
- Collecting full triadic records could shift data efforts away from scaling solo trajectories toward instrumenting real collaborative environments.
- The same synchronized-capture approach might apply to other long-horizon collaborative domains that rely on shared context.
- Partial versions of triadic data could be tested to determine whether the complete triad is required or if subsets already produce gains.
- New synchronization tools for conversations, actions, and deliverables will likely be needed to build these corpora at scale.
Load-bearing premise
That collecting and applying triadic data under the four-tier evidence framework will close the performance gap on long-horizon tasks instead of merely describing a plausible but untested data strategy.
What would settle it
A pre-registered blind evaluation on long-horizon software engineering benchmarks in which agents fine-tuned on triadic data show no measurable improvement over agents trained on standard GitHub scrapes or solo trajectories.
Figures
read the original abstract
Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs. It is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We further specify a four-tier evidence framework through which any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. We argue that this data is capturable in 12-18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field's near-term research agenda should include it explicitly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper claims that frontier SWE agents have saturated short-horizon benchmarks but are regressing on long-horizon, multi-engineer deliverables with ambiguous specifications. It argues that neither larger GitHub scrapes, solo-agent trajectories, nor open human-AI dialogue logs are sufficient; the required substrate is triadic data consisting of synchronized human-human conversations (where context is formed), human-AI sessions (where context is consumed), and the surrounding multi-week cross-functional work. The paper proposes two canonical capture methods—long-horizon expert trajectories via stimulated-recall protocols and instrumented simulated cross-functional companies—and specifies a four-tier evidence framework (mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation) that any corpus must satisfy to justify its quality for fine-tuning.
Significance. If the proposed triadic data collection strategy and four-tier framework can be realized and shown to improve long-horizon performance, the work would usefully redirect data-centric research in software engineering agents away from purely observational corpora toward synchronized multi-party records. The explicit four-tier validation structure is a constructive contribution that supplies a concrete checklist for future data papers, and the 12-18 month capturability claim (drawing on mature methods from adjacent fields) provides a realistic timeline that could accelerate empirical follow-up.
major comments (2)
- [Abstract] Abstract: the claim that triadic data 'is the empirical key to four open questions in agent training' is load-bearing for the position yet provides no explicit mapping from the three synchronized data streams to concrete training objectives (e.g., next-token prediction on conversation threads, preference pairs derived from stimulated recall, or retrieval-augmented context for coordination failures). Without this linkage the necessity argument remains an untested hypothesis rather than a secured requirement.
- [Four-tier evidence framework] Section defining the four-tier evidence framework: the framework is presented as the required justification standard for any corpus, but it is never applied even to a toy example or existing public dataset. This omission leaves open whether the tiers are operationalizable at the scale needed for long-horizon SWE data and therefore weakens the assertion that triadic data can be 'justified to a fine-tuning researcher' through these tiers.
minor comments (2)
- The distinction between the proposed triadic capture and existing open human-AI dialogue logs could be sharpened with a short table contrasting failure modes (context drift, coordination loss, specification ambiguity) that each data type has been observed to address or fail to address.
- The manuscript would benefit from a brief forward-looking paragraph on potential ethical and privacy considerations of instrumenting cross-functional teams, even if only to note that the 12-18 month timeline assumes these can be managed with existing protocols.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our position paper. The feedback has prompted us to clarify key aspects of our argument and strengthen the presentation of the proposed framework. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that triadic data 'is the empirical key to four open questions in agent training' is load-bearing for the position yet provides no explicit mapping from the three synchronized data streams to concrete training objectives (e.g., next-token prediction on conversation threads, preference pairs derived from stimulated recall, or retrieval-augmented context for coordination failures). Without this linkage the necessity argument remains an untested hypothesis rather than a secured requirement.
Authors: We agree that an explicit mapping from the triadic data streams to training objectives would make the necessity argument more concrete. Although the body of the manuscript discusses these connections (human-human conversations supplying context formation absent from solo trajectories, human-AI sessions showing consumption patterns, and cross-functional records capturing coordination failures), the abstract does not spell them out. In the revised version we have updated the abstract with a brief indication of the linkages and added a dedicated subsection containing a table that maps each stream to specific objectives: next-token prediction on synchronized human-human threads for context modeling, preference pairs derived from stimulated-recall annotations for coordination decisions, and retrieval-augmented generation over cross-functional logs to address multi-party failures. This change turns the claim into a more testable proposal. revision: yes
-
Referee: [Four-tier evidence framework] Section defining the four-tier evidence framework: the framework is presented as the required justification standard for any corpus, but it is never applied even to a toy example or existing public dataset. This omission leaves open whether the tiers are operationalizable at the scale needed for long-horizon SWE data and therefore weakens the assertion that triadic data can be 'justified to a fine-tuning researcher' through these tiers.
Authors: The four-tier framework is proposed as a forward-looking validation standard for future corpora rather than a retrospective check on existing data, since no large-scale synchronized triadic SWE corpus yet exists. To address operationalizability, the revised manuscript adds a new section with a worked toy example on a small, artificially constructed dataset of short-horizon tasks. The example illustrates each tier in sequence: mechanical verification via timestamp and role synchronization checks, statistical corpus characterization via length and participant-role distributions, probe experiments via small-scale fine-tuning and context-retention metrics, and pre-registered blind evaluation via blinded expert scoring of agent outputs. While the toy instance is necessarily limited in scale, it demonstrates that the tiers are concrete and extensible, thereby supporting the claim that triadic data can be justified to fine-tuning researchers. revision: yes
Circularity Check
No circularity; position paper with no derivations or self-referential reductions
full rationale
This is a forward-looking position paper arguing for triadic data as substrate for long-horizon SWE agents and proposing a four-tier evidence framework. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claims are presented as arguments and hypotheses ('We argue that...', 'We further specify...') rather than reductions of outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify premises. The paper is self-contained as a conceptual proposal without any load-bearing step that collapses to its own definitions or fitted quantities.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Frontier SWE agents have saturated short-horizon benchmarks while regressing on long-horizon work
- ad hoc to paper Triadic data is the empirical key to four open questions in agent training
invented entities (3)
-
triadic data
no independent evidence
-
stimulated-recall protocols
no independent evidence
-
simulated cross-functional companies
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Effective Harnesses for Long-Running Agents , year =
-
[2]
Scaling Managed Agents: Decoupling the Brain from the Hands , year =
-
[3]
Natural emergent misalignment from reward hacking in production rl, 2025
Monte MacDiarmid and Benjamin Wright and Jonathan Uesato and Joe Benton and Jan Kutasov and Sara Price and Nicholas Bouscal and Sam Bowman and Trenton Bricken and Alex Cloud and Carson Denison and Johannes Gasteiger and Ryan Greenblatt and Jan Leike and Jack Lindsey and Vladimir Mikulik and Ethan Perez and Anson Rodrigues and Drake Thomas and Albert Webso...
-
[8]
2026 , howpublished =
2026
-
[13]
2026 , month = jan, howpublished =
Time. 2026 , month = jan, howpublished =
2026
-
[14]
2026 , howpublished =
Zora Zhiruo Wang , title =. 2026 , howpublished =
2026
-
[15]
Anders Ericsson and Herbert A
K. Anders Ericsson and Herbert A. Simon , title =. 1993 , address =
1993
-
[16]
Proceedings of the
Kristen Grauman and Andrew Westbury and Eugene Byrne and Zachary Chavis and Antonino Furnari and Rohit Girdhar and Jackson Hamburger and Hao Jiang and Miao Liu and others , title =. Proceedings of the. 2022 , url =
2022
-
[17]
Proceedings of the European Conference on Computer Vision (ECCV) , year =
Dima Damen and Hazel Doughty and Giovanni Maria Farinella and Sanja Fidler and Antonino Furnari and Evangelos Kazakos and Davide Moltisanti and Jonathan Munro and Toby Perrett and Will Price and Michael Wray , title =. Proceedings of the European Conference on Computer Vision (ECCV) , year =
-
[18]
Proceedings of the 2023
Neil Perry and Megha Srivastava and Deepak Kumar and Dan Boneh , title =. Proceedings of the 2023. 2023 , url =
2023
-
[19]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[21]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Chunting Zhou and Pengfei Liu and Puxin Xu and Srinivasan Iyer and Jiao Sun and Yuning Mao and Xuezhe Ma and Avia Efrat and Ping Yu and Lili Yu and Susan Zhang and Gargi Ghosh and Mike Lewis and Luke Zettlemoyer and Omer Levy , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[22]
2024 , month = jan, howpublished =
William Harding and Matthew Kloster , title =. 2024 , month = jan, howpublished =
2024
-
[23]
2026 , howpublished =
Yelin Kim , title =. 2026 , howpublished =
2026
-
[24]
arXiv preprint arXiv:2410.06992 , year=
R. Aleithan, H. Xue, M. M. Mohajer, E. Nnorom, G. Uddin, and S. Wang. SWE-Bench+ : Enhanced coding benchmark for LLM s. arXiv preprint arXiv:2410.06992, 2024. URL https://arxiv.org/abs/2410.06992
-
[25]
Effective harnesses for long-running agents
Anthropic . Effective harnesses for long-running agents. Anthropic Engineering Blog, Nov. 2025. URL https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
2025
-
[26]
Scaling managed agents: Decoupling the brain from the hands
Anthropic . Scaling managed agents: Decoupling the brain from the hands. Anthropic Engineering Blog, Apr. 2026. URL https://www.anthropic.com/engineering/managed-agents
2026
-
[27]
Scaling egocentric vision: The epic-kitchens dataset, 2018
D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The EPIC-KITCHENS dataset. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. URL https://arxiv.org/abs/1804.02748
-
[28]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLM s via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. URL https://arxiv.org/abs/2501.12948. Also published in Nature 645:633--638, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro : Can AI agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025. URL https://arxiv.org/abs/2509.16941
work page internal anchor Pith review arXiv 2025
-
[30]
K. A. Ericsson and H. A. Simon. Protocol Analysis: Verbal Reports as Data (Revised Edition). MIT Press, Cambridge, MA, 1993. ISBN 9780262050470
1993
-
[31]
Ego4d: Around the world in 3,000 hours of egocentric video,
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, et al. Ego4D : Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2022. URL https://arxiv.org/abs/2110.07058
-
[32]
Harding and M
W. Harding and M. Kloster. Coding on copilot: 2023 data suggests downward pressure on code quality (incl.\ 2024 projections). GitClear research report, Jan. 2024. URL https://www.gitclear.com/coding_on_copilot_data_shows_ais_downward_pressure_on_code_quality
2023
-
[33]
Y. Kim. What quality means for next-generation SWE training data. Working memo, 2026. Available from the author at yelinkim@umich.edu
2026
-
[34]
A. Kottamasu, C. Mahapatra, S. Lee, B. Pan, A. Barthwal, A. Datta, A. Gupta, P. Mehta, A. Arun, S. Alberti, A. Hiremath, B. Foody, and B. Vidgen. APEX : A benchmark for evaluating AI on senior software engineering tasks. arXiv preprint arXiv:2601.08806, 2026. URL https://arxiv.org/abs/2601.08806. Mercor and Cognition
-
[35]
T. Kwa, B. West, J. Becker, A. Deng, K. Garcia, M. Hasin, S. Jawhar, M. Kinniment, N. Rush, S. V. Arx, R. Bloom, T. Broadley, H. Du, B. Goodrich, N. Jurkovic, L. H. Miles, S. Nix, T. Lin, N. Parikh, D. Rein, L. J. K. Sato, H. Wijk, D. M. Ziegler, E. Barnes, and L. Chan. Measuring AI ability to complete long software tasks. arXiv preprint arXiv:2503.14499,...
-
[36]
Terminal-Bench 2.0 leaderboard
Laude Institute . Terminal-Bench 2.0 leaderboard. tbench.ai, 2026. URL https://www.tbench.ai/leaderboard/terminal-bench/2.0
2026
-
[37]
S. Miserendino, M. Wang, T. Patwardhan, and J. Heidecke. SWE-Lancer : Can frontier LLM s earn \ 1 million from real-world freelance software engineering? arXiv preprint arXiv:2502.12115, 2025. URL https://arxiv.org/abs/2502.12115
- [38]
-
[39]
M. V. T. Thai, T. Le, D. N. Manh, H. P. Nhat, and N. D. Q. Bui. SWE-EVO : Benchmarking coding agents in long-horizon software evolution scenarios. arXiv preprint arXiv:2512.18470, 2025. URL https://arxiv.org/abs/2512.18470
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
S. Vijayvargiya, X. Zhou, A. Yerukola, M. Sap, and G. Neubig. Interactive agents to overcome ambiguity in software engineering. arXiv preprint arXiv:2502.13069, 2025. URL https://arxiv.org/abs/2502.13069. Accepted at ICLR 2026
-
[41]
Z. Z. Wang. Position: Humans are missing from AI coding agent research. Manuscript / preprint, 2026. URL https://zorazrw.github.io/files/position-haicode.pdf
2026
-
[42]
J. Yang, C. E. Jimenez, A. L. Zhang, K. Lieret, J. Yang, X. Wu, O. Press, N. Muennighoff, G. Synnaeve, K. R. Narasimhan, D. Yang, S. I. Wang, and O. Press. SWE-bench Multimodal : Do AI systems generalize to visual software domains? arXiv preprint arXiv:2410.03859, 2024. URL https://arxiv.org/abs/2410.03859
-
[43]
Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://arxiv.org/abs/2504.13837. Oral presentation
work page Pith review arXiv 2025
-
[44]
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. LIMA : Less is more for alignment. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.11206
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.