MirrorCode: AI can rebuild entire programs from behavior alone

Allen Hart; Daniel O'Connell; David Owen; David Rein; Florian Brand; Giles Edkins; Tom Adamczewski

arxiv: 2606.30182 · v1 · pith:NCMGNTUYnew · submitted 2026-06-29 · 💻 cs.AI

MirrorCode: AI can rebuild entire programs from behavior alone

Tom Adamczewski , David Owen , David Rein , Florian Brand , Giles Edkins , Allen Hart , Daniel O'Connell This is my paper

Pith reviewed 2026-06-30 06:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI codingprogram reimplementationcoding benchmarksautonomous agentssoftware engineering automationbehavioral testinglong-horizon tasks

0 comments

The pith

AI models can already reimplement entire complex programs from test behaviors alone, with the best scoring 56 percent across 25 projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MirrorCode, a benchmark that requires AI agents to rebuild complete existing programs by producing identical outputs on end-to-end tests, including held-out tests, without any access to the original source code. Target programs range from small Unix utilities to a 16,000-line bioinformatics toolkit, spanning interpreters, cryptography, compression, and data tools. The work demonstrates that current frontier models succeed on these long-horizon tasks at rates up to 56 percent, a level the authors equate to weeks of human engineering effort. This setup isolates the capability of functional replication from code inspection or partial human guidance common in prior demonstrations. If the benchmark results hold, they indicate that precise behavioral specifications alone can drive autonomous software recreation at scale.

Core claim

AI agents can replicate the functionalities of existing programs without access to source code by matching outputs exactly on provided and held-out end-to-end tests. The MirrorCode benchmark contains 25 such targets across multiple domains, and the strongest evaluated model achieves 56 percent success. One concrete case is full reimplementation of gotree, a 16,000-line bioinformatics toolkit. The authors note that frontier performance on the largest tasks demands substantial inference budgets, such as thousands of dollars over multiple days for a single attempt.

What carries the argument

MirrorCode benchmark, which measures AI success by exact output matching on end-to-end tests without source code access.

If this is right

AI agents can complete software reimplementation tasks that the authors estimate would require weeks of human engineer time.
Success rates improve when task requirements are given as precise behavioral specifications.
Frontier-level attempts on large projects require inference budgets far above typical benchmark costs.
Continued model improvement will extend autonomous completion to still larger software engineering tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If behavioral matching suffices for replication, organizations could verify or recreate tools using only test suites rather than source inspection.
The benchmark design could be extended to measure whether agents preserve non-functional properties such as performance or security invariants not captured in the tests.
High inference costs for large tasks suggest that economic viability of full automation will depend on further reductions in per-token pricing or gains in sample efficiency.

Load-bearing premise

Exact matching of outputs on the supplied end-to-end tests, including held-out tests, is sufficient to confirm that the AI has replicated the original program's full functionalities.

What would settle it

An AI solution that passes every MirrorCode test suite yet produces wrong results on new inputs outside those tests would show that test matching does not guarantee functional replication.

Figures

Figures reproduced from arXiv: 2606.30182 by Allen Hart, Daniel O'Connell, David Owen, David Rein, Florian Brand, Giles Edkins, Tom Adamczewski.

**Figure 2.** Figure 2: Solve rate by model and task, for reimplementation (100% of all tests passing) and substantial reimplementa [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Solve rates by task complexity bucket. Opus 4.7 outperformed other models on larger tasks. Bars show [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Solve rates for older and newer AI models, in smaller subsets of MirrorCode S and M tasks. Scores are hence [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Share of the StarCoder pretraining mix for each of MirrorCode’s six implementation languages [19]; the amount of available code varies widely, from Python down to Ada (about 1/230 of Python). (b) Despite this, Claude Opus 4.7’s solve rates are similar across languages. Both panels use the same language ordering. MirrorCode therefore provides a cleaner head-to-head comparison of implementation language … view at source ↗

**Figure 6.** Figure 6: Breakdown of failure categories, by model. A trajectory is counted in every category where it has any failing [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Percentage of withheld test cases solved, under ablations. “Withheld” test cases are the original hidden test [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Example trajectories showing test scores as runs progressed. In shown [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Memorization screen results. The orange band shows the 95% CI for the baseline non-memorized programs; [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Solve rates for older models, comparing Claude Opus 4.1 to Opus 4.7 and GPT-5 to GPT-5.5. Results use [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Two representative obfuscated binary wraps submitted by Gemini 3.1 Pro Preview in the no-tool abla [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: MirrorCode’s container architecture (one task instance). The agent interacts only with the [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Representative cal invocations and their byte-exact output, illustrating the default, --monday, --week, vertical multi-month, and day-of-year (-j) layouts. Sept 1752 has a mix of Julian and Gregorian dates by which the 2nd is followed by the 14th (the 3rd through the 13th are absent). The --reform flag customizes the date of adoption of the Gregorian calendar. Separately, week numbering depends on the cho… view at source ↗

read the original abstract

AI models are rapidly improving at autonomous coding, as shown by benchmark progress and one-off demonstrations such as AI implementing a C compiler. However, existing coding benchmarks tend to focus on shorter tasks, and one-off demonstrations are hard to compare systematically because they often have some human guidance, and are not standardized or repeated across models. To address these challenges, we introduce MirrorCode, a long-horizon coding benchmark based on reimplementing entire software projects. In MirrorCode, AI agents must replicate the functionalities of an existing program, without access to its source code. AI solutions must match the original program's output exactly on end-to-end tests, including held-out tests. MirrorCode's 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression. Existing AI models can already reimplement complex software, with the strongest model scoring 56% across the benchmark. For example, AI can reimplement gotree, a 16,000-line bioinformatics toolkit - a task that we believe would take weeks for a human engineer. However, studying the frontier of performance requires a larger inference budget than typical benchmarks, for example, \$2,600 over 19 days for a single attempt on a large task. We show that AI agents can already complete long-horizon software engineering tasks, especially when requirements are precisely specified. More broadly, our work suggests AI will have transformative effects on software engineering, as autonomous agents continue to improve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MirrorCode introduces a full-project reimplementation benchmark but success on fixed tests does not establish that the generated code matches the original in untested behavior.

read the letter

The main point from this paper is that they have built a benchmark called MirrorCode where AI agents have to re-create entire programs from their observable behavior alone, without seeing the source. The strongest model succeeds on 56% of the 25 test programs, and they highlight one case where it handles a 16k-line bioinformatics tool.

The new element is the emphasis on full-project reimplementation over many domains instead of the usual short tasks. It sets up a standardized process with end-to-end tests including held-out ones, which allows for more direct comparisons across models. The paper also notes the substantial compute required, which is a practical detail often missing from other work.

It does a reasonable job framing the problem and showing that current systems can tackle some of these longer tasks when the requirements are clear.

The weaker part is the reliance on test matching as proof of replication. For programs of that size, any test suite will miss many internal behaviors, error cases, and performance aspects. A generated program could pass all the tests while still differing in important untested ways. The description does not include any analysis of test coverage or additional validation steps, so it is difficult to judge how much the 56% figure really means.

Details on test creation and checks for data leakage are also absent from the abstract, which makes the soundness harder to assess.

This is relevant for groups working on autonomous coding agents and for those creating new benchmarks in AI. A reader looking for tougher evaluations of code generation will get value from the task design.

I would recommend sending it for peer review. The benchmark is a step forward even if the current evidence needs more support to back the stronger claims about replication.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces MirrorCode, a benchmark in which AI agents must reimplement 25 entire software projects (spanning Unix utilities, bioinformatics, interpreters, etc.) solely from end-to-end test suites without access to source code. Success requires exact output matching on both training and held-out tests. The strongest model achieves 56% aggregate success, including reimplementation of the 16 kLoC gotree toolkit; the authors conclude that current AI agents can already perform long-horizon software engineering tasks when requirements are precisely specified.

Significance. If the evaluation methodology is sound, the work supplies a standardized, repeatable, long-horizon coding benchmark that is more demanding than existing short-task suites and provides concrete evidence of current frontier capabilities in autonomous program replication. The explicit reporting of high inference budgets ($2,600 over 19 days for a single large task) is a useful practical contribution.

major comments (2)

[Benchmark definition / evaluation protocol] Benchmark definition / evaluation protocol: the central claim that exact output matching on the supplied end-to-end tests (including held-out tests) constitutes replication of program functionalities is load-bearing for the 56% result and the gotree example, yet the manuscript provides no quantitative coverage metrics, branch/path coverage data, or adversarial test results. For a 16 kLoC project this leaves open the possibility that generated implementations pass all supplied tests while differing on untested behaviors, error conditions, or performance characteristics.
[Results reporting (aggregate scores)] Results reporting (aggregate scores): the 56% figure is presented without accompanying details on test-suite construction, selection criteria for the 25 programs, verification procedures for exact output matching, or controls for test leakage; these omissions directly affect whether the reported success rate can be interpreted as evidence of functional replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and note planned revisions where appropriate.

read point-by-point responses

Referee: [Benchmark definition / evaluation protocol] Benchmark definition / evaluation protocol: the central claim that exact output matching on the supplied end-to-end tests (including held-out tests) constitutes replication of program functionalities is load-bearing for the 56% result and the gotree example, yet the manuscript provides no quantitative coverage metrics, branch/path coverage data, or adversarial test results. For a 16 kLoC project this leaves open the possibility that generated implementations pass all supplied tests while differing on untested behaviors, error conditions, or performance characteristics.

Authors: The MirrorCode benchmark defines success via exact output matching on the original projects' end-to-end test suites (including held-out tests), which is an objective and reproducible criterion for behavioral replication. We agree that the absence of explicit coverage metrics or adversarial testing leaves room for untested differences, especially on large codebases like gotree. In revision we will add a limitations subsection discussing test-suite scope, report any publicly available coverage statistics from the source projects, and note that our claims are scoped to the supplied tests rather than exhaustive equivalence. revision: yes
Referee: [Results reporting (aggregate scores)] Results reporting (aggregate scores): the 56% figure is presented without accompanying details on test-suite construction, selection criteria for the 25 programs, verification procedures for exact output matching, or controls for test leakage; these omissions directly affect whether the reported success rate can be interpreted as evidence of functional replication.

Authors: We will expand the Methods section to detail test-suite construction (using each project's existing tests with a held-out subset), selection criteria (domain diversity plus availability of non-trivial automated tests), verification (automated exact-match scripts on stdout/stderr and files), and leakage controls (no source-code access and held-out tests). These elements were summarized but will be presented with greater specificity to support interpretation of the 56% aggregate score. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark is self-contained

full rationale

The paper presents MirrorCode as an empirical benchmark measuring AI reimplementation success via exact output matching on end-to-end tests (including held-out). No derivation chain, equations, or fitted parameters exist that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on reported model scores (e.g., 56% aggregate) against an explicitly defined test-based metric, which is a transparent methodological choice rather than a circular redefinition. This is the most common honest finding for benchmark papers: the evaluation is falsifiable against external test suites and does not smuggle results via self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities used in the benchmark design or evaluation.

pith-pipeline@v0.9.1-grok · 5813 in / 1057 out tokens · 33076 ms · 2026-06-30T06:32:49.730836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 12 canonical work pages · 7 internal anchors

[1]

Building a C compiler with a team of parallel Claudes.Anthropic Engineering Blog, 2026

Nicholas Carlini. Building a C compiler with a team of parallel Claudes.Anthropic Engineering Blog, 2026

2026
[2]

Scaling long-running autonomous coding.Cursor Blog, January, 2026

Wilson Lin. Scaling long-running autonomous coding.Cursor Blog, January, 2026

2026
[3]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. SWE-bench Pro: Can AI agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, JJ Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B Hall, et al. Open-world evaluations for measuring frontier AI capabilities.arXiv preprint arXiv:2605.20520, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

MirrorCode: evidence that ai can already do some weeks-long coding tasks.Epoch AI blog post, April, 2026

Tom Adamczewski, David Rein, David Owen, and Florian Brand. MirrorCode: evidence that ai can already do some weeks-long coding tasks.Epoch AI blog post, April, 2026

2026
[6]

SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. 8Occasionally real software is created as a direct reimplementation of existing software, fo...

2024
[7]

ReAct Agent, 2026

UK AI Security Institute. ReAct Agent, 2026. Inspect: An open-source framework for large language model evaluations

2026
[8]

Repo state loopholes during agentic evaluation

SWE-bench maintainers. Repo state loopholes during agentic evaluation. GitHub issue, SWE-bench/SWE-bench #465,https://github.com/SWE-bench/SWE-bench/issues/465, 2025. Accessed 2026-06-10

2025
[9]

KernelBench v0.1.https://scalingintelligence.stanford

Scaling Intelligence Lab, Stanford University. KernelBench v0.1.https://scalingintelligence.stanford. edu/blogs/kernelbenchv01/, 2025. Accessed 2026-06-10

2025
[10]

Recent frontier models are reward hacking

METR. Recent frontier models are reward hacking. https://metr.org/blog/ 2025-06-05-recent-reward-hacking/, 2025. Accessed 2026-06-10

2025
[11]

ProgramBench: Can Language Models Rebuild Programs From Scratch?

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, et al. Programbench: Can language models rebuild programs from scratch? arXiv preprint arXiv:2605.03546, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

FrontierSWE: Benchmarking Software Engineering Skill at the Edge of Human Ability, 2026

Proximal Labs. FrontierSWE: Benchmarking Software Engineering Skill at the Edge of Human Ability, 2026

2026
[13]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

Rishi Desai, Jesse Hu, Joan Cabezas, Neel Harsola, Pratyush Shukla, Roey Ben Chaim, Adnan El Assadi, Omkaar Mukund Kamath, Fenil Faldu, Prannay Hebbar, et al. SWE-Marathon: Can agents autonomously complete ultra-long-horizon software work?arXiv preprint arXiv:2606.07682, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Vibe Code Bench: Evaluating AI models on end-to-end web application development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, and Alex Gu. Vibe Code Bench: Evaluating AI models on end-to-end web application development. InProceedings of the ACM Conference on AI and Agentic Systems, pages 514–536, 2026

2026
[16]

SWE-bench Multilingual

SWE-bench Team. SWE-bench Multilingual. https://www.swebench.com/multilingual.html, 2025. Accessed 2026-06-05

2025
[17]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115, 2025

work page arXiv 2025
[18]

Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.Advances in Neural Information Processing Systems, 38, 2026

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.Advances in Neural Information Processing Systems, 38, 2026

2026
[19]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: May the source be with you!arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution,

Ruiyang Xu, Jialun Cao, Yaojie Lu, Ming Wen, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, and Le Sun. Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution.arXiv preprint arXiv:2408.13001, 2024

work page arXiv 2024
[21]

A multi-language object-oriented programming benchmark for large language models.arXiv preprint arXiv:2509.26111, 2025

Shuai Wang, Liang Ding, Li Shen, Yong Luo, Han Hu, Lefei Zhang, and Fu Lin. A multi-language object-oriented programming benchmark for large language models.arXiv preprint arXiv:2509.26111, 2025

work page arXiv 2025
[22]

Task-completion time horizons of frontier AI models

METR. Task-completion time horizons of frontier AI models. https://metr.org/time-horizons/, 05 2026

2026
[23]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. RE-bench: Evaluating frontier AI R&D capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

work page arXiv 2024
[24]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

autoresearch: Ai agents running research on single-gpu nanochat training automatically

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/autoresearch, 2026. GitHub repository

2026
[26]

A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

2024
[27]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap.arXiv preprint arXiv:2402.19450, 2024

Saurabh Srivastava, Anto PV , Shashank Menon, Ajay Sukumar, Alan Philipose, Stevin Prince, Sooraj Thomas, et al. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap.arXiv preprint arXiv:2402.19450, 2024. 16 MirrorCode: AI can rebuild entire programs from behavior alone

work page arXiv 2024
[28]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, volume 2025, pages 58791–58831, 2025

2025
[29]

GBA Eval.https://gbaeval.com/, 2026

Mechanize, Inc. GBA Eval.https://gbaeval.com/, 2026. Accessed: 2026-06-29

2026
[30]

Many SWE-bench- passing PRs would not be merged into main

Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. Many SWE-bench- passing PRs would not be merged into main. https://metr.org/notes/ 2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/, 03 2026

2026
[31]

sed for JSON data

Herbie Bradley and Girish Sastry. The great refactor: How to secure critical open-source code against memory safety exploits by automating code hardening at scale. The launch sequence, Institute for Progress, August 2025. Published August 11, 2025. 17 MirrorCode: AI can rebuild entire programs from behavior alone Appendix contents A Suggested naming conve...

2025
[32]

The tell: the literal appears verbatim in the visible corpus, and the hidden test fails precisely because it perturbed that literal so the hard-code went inert

**Cheating via hard-coding.** The solution is wired to the visible test — a branch/lookup/string-fixup keyed on an exact string the agent could only know from a *visible* test — which clearly cannot generalise. The tell: the literal appears verbatim in the visible corpus, and the hidden test fails precisely because it perturbed that literal so the hard-co...
[33]

**Brittle / narrow.** Real functionality (not a bare hard-code), but over-fit to the visible example — it implements *a* rule that happens to fit what it saw rather than the general rule, and the hidden test exercises a variation just outside that fit
[34]

what happens when two arguments are equal

**Edge-case / subtle-logic miss.** Broadly correct and generalises for most inputs, but misses a detail the hidden test exercises that a competent programmer would have to *investigate* — not spelled out in the docs or the visible tests (e.g. "what happens when two arguments are equal")
[35]

posix" exe = bytes([47,117,115,114,47,98, 105,110,47,98,114,111,116,108, 105]).decode('latin1')#

**Missed whole functionality.** An entire feature or behaviour, clearly described in the documentation or discoverable by a straightforward call of the reference binary, was never implemented. Unlike (3) this is a whole missing capability, not an edge case, and the signal for it was plainly available. C Supplementary results C.1 Programming languages Tabl...

2025

[1] [1]

Building a C compiler with a team of parallel Claudes.Anthropic Engineering Blog, 2026

Nicholas Carlini. Building a C compiler with a team of parallel Claudes.Anthropic Engineering Blog, 2026

2026

[2] [2]

Scaling long-running autonomous coding.Cursor Blog, January, 2026

Wilson Lin. Scaling long-running autonomous coding.Cursor Blog, January, 2026

2026

[3] [3]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. SWE-bench Pro: Can AI agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, JJ Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B Hall, et al. Open-world evaluations for measuring frontier AI capabilities.arXiv preprint arXiv:2605.20520, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

MirrorCode: evidence that ai can already do some weeks-long coding tasks.Epoch AI blog post, April, 2026

Tom Adamczewski, David Rein, David Owen, and Florian Brand. MirrorCode: evidence that ai can already do some weeks-long coding tasks.Epoch AI blog post, April, 2026

2026

[6] [6]

SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. 8Occasionally real software is created as a direct reimplementation of existing software, fo...

2024

[7] [7]

ReAct Agent, 2026

UK AI Security Institute. ReAct Agent, 2026. Inspect: An open-source framework for large language model evaluations

2026

[8] [8]

Repo state loopholes during agentic evaluation

SWE-bench maintainers. Repo state loopholes during agentic evaluation. GitHub issue, SWE-bench/SWE-bench #465,https://github.com/SWE-bench/SWE-bench/issues/465, 2025. Accessed 2026-06-10

2025

[9] [9]

KernelBench v0.1.https://scalingintelligence.stanford

Scaling Intelligence Lab, Stanford University. KernelBench v0.1.https://scalingintelligence.stanford. edu/blogs/kernelbenchv01/, 2025. Accessed 2026-06-10

2025

[10] [10]

Recent frontier models are reward hacking

METR. Recent frontier models are reward hacking. https://metr.org/blog/ 2025-06-05-recent-reward-hacking/, 2025. Accessed 2026-06-10

2025

[11] [11]

ProgramBench: Can Language Models Rebuild Programs From Scratch?

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, et al. Programbench: Can language models rebuild programs from scratch? arXiv preprint arXiv:2605.03546, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

FrontierSWE: Benchmarking Software Engineering Skill at the Edge of Human Ability, 2026

Proximal Labs. FrontierSWE: Benchmarking Software Engineering Skill at the Edge of Human Ability, 2026

2026

[13] [13]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

Rishi Desai, Jesse Hu, Joan Cabezas, Neel Harsola, Pratyush Shukla, Roey Ben Chaim, Adnan El Assadi, Omkaar Mukund Kamath, Fenil Faldu, Prannay Hebbar, et al. SWE-Marathon: Can agents autonomously complete ultra-long-horizon software work?arXiv preprint arXiv:2606.07682, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Vibe Code Bench: Evaluating AI models on end-to-end web application development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, and Alex Gu. Vibe Code Bench: Evaluating AI models on end-to-end web application development. InProceedings of the ACM Conference on AI and Agentic Systems, pages 514–536, 2026

2026

[16] [16]

SWE-bench Multilingual

SWE-bench Team. SWE-bench Multilingual. https://www.swebench.com/multilingual.html, 2025. Accessed 2026-06-05

2025

[17] [17]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?, 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-lancer: Can frontier LLMs earn $1 million from real-world freelance software engineering?arXiv preprint arXiv:2502.12115, 2025

work page arXiv 2025

[18] [18]

Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.Advances in Neural Information Processing Systems, 38, 2026

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.Advances in Neural Information Processing Systems, 38, 2026

2026

[19] [19]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: May the source be with you!arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution,

Ruiyang Xu, Jialun Cao, Yaojie Lu, Ming Wen, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, and Le Sun. Cruxeval-x: A benchmark for multilingual code reasoning, understanding and execution.arXiv preprint arXiv:2408.13001, 2024

work page arXiv 2024

[21] [21]

A multi-language object-oriented programming benchmark for large language models.arXiv preprint arXiv:2509.26111, 2025

Shuai Wang, Liang Ding, Li Shen, Yong Luo, Han Hu, Lefei Zhang, and Fu Lin. A multi-language object-oriented programming benchmark for large language models.arXiv preprint arXiv:2509.26111, 2025

work page arXiv 2025

[22] [22]

Task-completion time horizons of frontier AI models

METR. Task-completion time horizons of frontier AI models. https://metr.org/time-horizons/, 05 2026

2026

[23] [23]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. RE-bench: Evaluating frontier AI R&D capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

work page arXiv 2024

[24] [24]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. KernelBench: Can LLMs write efficient GPU kernels?arXiv preprint arXiv:2502.10517, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

autoresearch: Ai agents running research on single-gpu nanochat training automatically

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/autoresearch, 2026. GitHub repository

2026

[26] [26]

A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024

2024

[27] [27]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap.arXiv preprint arXiv:2402.19450, 2024

Saurabh Srivastava, Anto PV , Shashank Menon, Ajay Sukumar, Alan Philipose, Stevin Prince, Sooraj Thomas, et al. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap.arXiv preprint arXiv:2402.19450, 2024. 16 MirrorCode: AI can rebuild entire programs from behavior alone

work page arXiv 2024

[28] [28]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, volume 2025, pages 58791–58831, 2025

2025

[29] [29]

GBA Eval.https://gbaeval.com/, 2026

Mechanize, Inc. GBA Eval.https://gbaeval.com/, 2026. Accessed: 2026-06-29

2026

[30] [30]

Many SWE-bench- passing PRs would not be merged into main

Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. Many SWE-bench- passing PRs would not be merged into main. https://metr.org/notes/ 2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/, 03 2026

2026

[31] [31]

sed for JSON data

Herbie Bradley and Girish Sastry. The great refactor: How to secure critical open-source code against memory safety exploits by automating code hardening at scale. The launch sequence, Institute for Progress, August 2025. Published August 11, 2025. 17 MirrorCode: AI can rebuild entire programs from behavior alone Appendix contents A Suggested naming conve...

2025

[32] [32]

The tell: the literal appears verbatim in the visible corpus, and the hidden test fails precisely because it perturbed that literal so the hard-code went inert

**Cheating via hard-coding.** The solution is wired to the visible test — a branch/lookup/string-fixup keyed on an exact string the agent could only know from a *visible* test — which clearly cannot generalise. The tell: the literal appears verbatim in the visible corpus, and the hidden test fails precisely because it perturbed that literal so the hard-co...

[33] [33]

**Brittle / narrow.** Real functionality (not a bare hard-code), but over-fit to the visible example — it implements *a* rule that happens to fit what it saw rather than the general rule, and the hidden test exercises a variation just outside that fit

[34] [34]

what happens when two arguments are equal

**Edge-case / subtle-logic miss.** Broadly correct and generalises for most inputs, but misses a detail the hidden test exercises that a competent programmer would have to *investigate* — not spelled out in the docs or the visible tests (e.g. "what happens when two arguments are equal")

[35] [35]

posix" exe = bytes([47,117,115,114,47,98, 105,110,47,98,114,111,116,108, 105]).decode('latin1')#

**Missed whole functionality.** An entire feature or behaviour, clearly described in the documentation or discoverable by a straightforward call of the reference binary, was never implemented. Unlike (3) this is a whole missing capability, not an edge case, and the signal for it was plainly available. C Supplementary results C.1 Programming languages Tabl...

2025