pith. sign in

arxiv: 2606.20158 · v1 · pith:2HFQYXE7new · submitted 2026-06-18 · 💻 cs.SE

N-Version Programming with Coding Agents

Pith reviewed 2026-06-26 16:25 UTC · model grok-4.3

classification 💻 cs.SE
keywords N-version programmingcoding agentsAI code generationsoftware diversitymajority votingprogram reliabilityfailure modesLaunch Interceptor Program
0
0 comments X

The pith

Diversity from AI coding agents reduces mean failures in three-version units from 387 to 131.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether N-version programming works when versions come from different AI coding agents rather than human programmers. The authors generate 48 implementations of the Launch Interceptor Program using varied agents, models, and languages, then run all of them against one million randomized inputs and a shared oracle. Substantial common-mode failures appear, many traceable to ambiguities in the original specification. Despite those overlaps, grouping the versions into three-version units that decide by majority vote lowers the average failure count from 387.44 to 130.99, and 11,844 such units record zero failures across the entire test campaign.

Core claim

Although agent-generated implementations exhibit substantial common-mode failures traceable to ambiguities in the specification, majority voting across three-version units reduces the mean failure count from 387.44 to 130.99, with 11,844 units showing zero observed failures across the one-million-input campaign.

What carries the argument

Three-version majority-voting units assembled from outputs of diverse AI coding agents

If this is right

  • Majority voting can mitigate individual agent errors even when some failures are shared.
  • Diversity in agents, models, and languages contributes to fewer co-occurring failures than single versions.
  • A large fraction of possible agent combinations can produce units with no observed failures on the tested inputs.
  • N-version programming regains relevance as an engineering technique when versions are generated automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Selection of agent triples could be optimized by measuring pairwise failure overlap before deployment.
  • The same pattern may appear in other safety-critical domains where specifications contain hard-to-interpret sections.
  • Specification clarity remains a bottleneck that diversity alone does not remove.
  • Automated N-version construction could become routine for code that must meet high reliability targets.

Load-bearing premise

The 1,000,000 randomized test inputs sufficiently exercise the Launch Interceptor Program Specification and the shared oracle correctly classifies all behaviors as pass or fail.

What would settle it

Re-running the same three-version units on a fresh set of test inputs drawn from the same specification but outside the original million-input campaign and finding no drop in mean failures.

Figures

Figures reproduced from arXiv: 2606.20158 by Benoit Baudry, Javier Ron, Martin Monperrus.

Figure 1
Figure 1. Figure 1: N-version units are generated by AI coding agents, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental workflow for revisiting Knight–Leveson with agentic coding: (1) version generation by each agent against [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RQ1. Acceptance testing summary. Left: the total [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure-count distribution across the 48 admitted [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Filtered pairwise co-failure heatmap for the 21 versions with at least one observed campaign failure. Each cell shows [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RQ3: Bucketed distributions of pairwise ϕ correlations under two diversity axes. For each bucket, the left stacked bar counts cross-language pairs by language pair, while the right stacked bar counts all version pairs by same-agent versus cross-agent relation. The exact-match bucket ϕ = 1 is separated from the near-perfect bucket (0.9, 1), showing that the high-correlation mass is dominated by correlated f… view at source ↗
Figure 7
Figure 7. Figure 7: RQ4: failure counts per specification items (aka LIC [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: RQ5: Bucketed failure-count distributions for single [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript revisits classical N-version programming using contemporary AI coding agents on the Knight-Leveson Launch Interceptor Program specification. It generates 48 agent-produced implementations across models and languages, evaluates them against a shared oracle on 1,000,000 randomized test inputs, reports substantial common-mode failures traceable to specification ambiguities, and claims that majority voting in three-version units reduces mean failure count from 387.44 to 130.99 while yielding 11,844 zero-failure units, concluding this constitutes the strongest evidence that NVP with coding agents is a useful engineering strategy.

Significance. If the empirical claims hold after addressing test validity, the work would be significant for software engineering: it supplies large-scale data (48 versions, 1M tests) extending the 1980s Knight-Leveson findings to AI-generated code, quantifies both the persistence of common-mode failures and the practical benefit of agent-induced diversity, and offers a concrete engineering strategy for improving reliability of LLM-produced software.

major comments (2)
  1. [Abstract] Abstract: The headline claims of a drop in mean failures from 387.44 to 130.99 and 11,844 zero-failure three-version units are load-bearing for the central conclusion. These counts rest on the unverified assumption that the 1,000,000 randomized inputs exercise all relevant behaviors of the Launch Interceptor Program (including the ambiguous regions identified in the post-hoc analysis) and that the shared oracle correctly classifies every outcome; no coverage metrics, comparison against the original Knight-Leveson test suite, or independent oracle validation are referenced.
  2. [Abstract] Abstract / Results: The reported mean failure counts are given as point values (387.44 and 130.99) without error bars, confidence intervals, or any statistical test of the observed reduction, undermining the ability to assess whether the diversity benefit is robust or an artifact of the particular test distribution.
minor comments (2)
  1. [Abstract] Abstract: grammatical error — 'Our original results is the strongest evidence' should read 'Our results are the strongest evidence'.
  2. [Abstract] Abstract: phrasing error — 'where is specification is particularly hard or ambiguous' should read 'where the specification is particularly hard or ambiguous'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on test validity and statistical reporting. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of a drop in mean failures from 387.44 to 130.99 and 11,844 zero-failure three-version units are load-bearing for the central conclusion. These counts rest on the unverified assumption that the 1,000,000 randomized inputs exercise all relevant behaviors of the Launch Interceptor Program (including the ambiguous regions identified in the post-hoc analysis) and that the shared oracle correctly classifies every outcome; no coverage metrics, comparison against the original Knight-Leveson test suite, or independent oracle validation are referenced.

    Authors: The comment correctly identifies that coverage metrics, explicit comparison to the original Knight-Leveson test suite, and independent oracle validation are not referenced in the manuscript. The 1M inputs were drawn uniformly from the specification-defined domains (matching the random-test component of the 1986 study), and the post-hoc failure analysis shows that ambiguous regions were exercised. However, formal coverage metrics and a side-by-side comparison with the original test vectors are absent. We will add a subsection on input generation and coverage considerations plus a direct methodological comparison to Knight-Leveson; we will also state the limitation that the oracle, while derived from the published specification, was not independently re-validated by a third party. revision: partial

  2. Referee: [Abstract] Abstract / Results: The reported mean failure counts are given as point values (387.44 and 130.99) without error bars, confidence intervals, or any statistical test of the observed reduction, undermining the ability to assess whether the diversity benefit is robust or an artifact of the particular test distribution.

    Authors: The observation is accurate: the manuscript reports only point estimates. With 1M tests per version the sampling variability is low, yet we agree that readers cannot judge robustness without uncertainty quantification. We will compute and report bootstrap confidence intervals on the per-version failure counts together with a paired statistical test of the reduction between single versions and triples, and we will update both the abstract and results section accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements

full rationale

The paper reports failure counts obtained by executing 48 agent-generated implementations on 1,000,000 randomized inputs against a shared oracle for the Launch Interceptor Program. These counts (e.g., mean failures dropping from 387.44 to 130.99 under majority voting) are produced by the experimental procedure itself and do not reduce to any fitted parameter, self-definition, or self-citation chain. The derivation is the test campaign, which remains externally falsifiable by the input distribution and oracle. No equations or steps in the provided material exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that randomized testing with 1M inputs adequately samples the input space of the Launch Interceptor Program and that the oracle is ground truth; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The shared oracle correctly classifies all program behaviors on the 1M test inputs.
    The evaluation campaign depends on this oracle to count failures; any misclassification would directly affect the reported mean failure counts and zero-failure units.

pith-pipeline@v0.9.1-grok · 5697 in / 1212 out tokens · 22066 ms · 2026-06-26T16:25:02.723428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 1 canonical work pages

  1. [1]

    Anthropic. 2024. Claude Code: Agentic Coding in the Terminal. https://docs.anthropic.com/en/docs/ claude-code. Accessed: 2026

  2. [2]

    Anysphere Inc. 2024. Cursor: The AI Code Editor. https: //www.cursor.com. Accessed: 2026

  3. [3]

    Algirdas Avizienis. 1985. The N-Version Approach to Fault-Tolerant Software.IEEE Transactions on Software Engineering11, 12 (1985), 1491–1501

  4. [4]

    Peter G. Bishop. 1995. Review of Software Design Diversity. InSoftware Fault Tolerance, Michael R. Lyu (Ed.). John Wiley & Sons, New York, NY , 211–229

  5. [5]

    Brilliant, John C

    Susan S. Brilliant, John C. Knight, and Nancy G. Leve- son. 1990. Analysis of Faults in an N-Version Software Experiment.IEEE Transactions on Software Engineering 16, 2 (1990), 238–247

  6. [6]

    Liming Chen and Algirdas Avizienis. 1978. N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation. InDigest of Papers, Eighth Annual International Symposium on Fault-Tolerant Com- puting (FTCS-8). Toulouse, France, 3–9

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond ´e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

  8. [8]

    Dunham and John L

    Janet R. Dunham and John L. Pierce. 1986.An Ex- periment in Software Reliability. NASA Contractor Re- port NASA-CR-172553. NASA Langley Research Cen- ter, Hampton, V A. https://ntrs.nasa.gov/api/citations/ 19860020075/downloads/19860020075.pdf Prepared by Research Triangle Institute under Contract NAS1-16489; originally dated March 1985, revised May 1986

  9. [9]

    Eckhardt and Larry D

    David E. Eckhardt and Larry D. Lee. 1985. A Theoretical Basis for the Analysis of Multiversion Software Subject to Coincident Errors.IEEE Transactions on Software Engineering11, 12 (1985), 1511–1517

  10. [10]

    Google DeepMind. 2024. Gemini CLI: An Open-Source AI Agent. https://github.com/google-gemini/gemini-cli. Accessed: 2026

  11. [11]

    Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

    Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216 [cs.SE] https: //arxiv.org/abs/2509.06216

  12. [12]

    Les Hatton. 1997. N-Version Design Versus One Good Version.IEEE Software14, 6 (1997), 71–76

  13. [13]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

  14. [14]

    InProceedings of the 12th International Conference on Learning Representations (ICLR)

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria

  15. [15]

    Marcus Kessel and Colin Atkinson. 2024. N-version as- sessment and enhancement of generative AI: differential GAI.IEEE Software42, 2 (2024), 76–83

  16. [16]

    Knight and Nancy G

    John C. Knight and Nancy G. Leveson. 1986. An Exper- imental Evaluation of the Assumption of Independence in Multiversion Programming.IEEE Transactions on Software Engineering12, 1 (1986), 96–109

  17. [17]

    Abhishek Kodati, Foutse Khomh, and Ashkan Sami. [n. d.]. MAC: Multi-Agent LLM Coder is All You Need. Available at SSRN 5887028([n. d.])

  18. [18]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R ´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-Level Code Generation with AlphaCode. Science378, 6624 (2022), 1092–1097

  19. [19]

    Bev Littlewood and Douglas R. Miller. 1989. Conceptual Modeling of Coincident Failures in Multiversion Soft- ware.IEEE Transactions on Software Engineering15, 12 (1989), 1596–1614

  20. [20]

    Dimakis, Sylvia Ratnasamy, Matei Zaharia, Aditya Parameswaran, and Ion Stoica

    Shu Liu, Alexander Krentsel, Shubham Agarwal, Mert Cemri, Ziming Mao, Soujanya Ponnapalli, Alexandros G. Dimakis, Sylvia Ratnasamy, Matei Zaharia, Aditya Parameswaran, and Ion Stoica. 2026. The Time is Here for Just-in-Time Systems: Challenges and Opportuni- ties. arXiv:2605.24096 [cs.DB] https://arxiv.org/abs/ 2605.24096

  21. [21]

    P ˘as˘areanu, and Guowei Yang

    Tarek Mahmud, Bin Duan, Corina S. P ˘as˘areanu, and Guowei Yang. 2025. Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach. arXiv preprint arXiv:2503.15838(2025)

  22. [22]

    OpenCode Contributors. 2024. OpenCode: The AI Cod- ing Agent Built for the Terminal. https://opencode.ai. Accessed: 2026

  23. [23]

    Javier Ron, Diogo Gaspar, Javier Cabrera-Arteaga, Benoit Baudry, and Martin Monperrus. 2025. Galapagos: Automated N-Version Programming with LLMs.ACM Transactions on Software Engineering and Methodology (2025). doi:10.1145/3785363

  24. [24]

    Thomas Valentin, Ardi Madadi, Gaetano Sapia, and Marcel B ¨ohme. 2025. Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation. arXiv:2507.00057 [cs.PL] https://arxiv.org/abs/2507. 12 00057

  25. [25]

    Chi, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. 2023. Self-Consistency Im- proves Chain of Thought Reasoning in Language Mod- els. InProceedings of the 11th International Conference on Learning Representations (ICLR). OpenReview.net

  26. [26]

    Junjun Zheng, Hiroyuki Okamura, and Tadashi Dohi

  27. [27]

    InInternational Symposium on Software Fault Prevention, Verification, and Validation

    Can Generative AI Enhance the Effectiveness of N- Version Programming?. InInternational Symposium on Software Fault Prevention, Verification, and Validation. Springer, 1–16