pith. sign in

arxiv: 2605.20520 · v1 · pith:QIAYVDZQnew · submitted 2026-05-19 · 💻 cs.AI

Open-World Evaluations for Measuring Frontier AI Capabilities

Pith reviewed 2026-05-21 06:35 UTC · model grok-4.3

classification 💻 cs.AI
keywords open-world evaluationsfrontier AI capabilitiesAI benchmarksqualitative analysislong-horizon tasksiOS app developmentCRUX project
0
0 comments X

The pith

An AI agent developed and published a simple iOS app with only one manual intervention, showing open-world evaluations can flag frontier capabilities before benchmarks catch them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard benchmarks overstate or understate real AI capabilities because they rely on tasks that are easy to specify, automate, and score in short timeframes. It proposes open-world evaluations as a complement: long-horizon, messy, real-world tasks judged through small-sample qualitative review instead of large-scale automation. A concrete case shows an agent nearly completing the full process of building and releasing an iOS app to the App Store. Readers should care because these evaluations can give earlier signals that advanced abilities are about to become common. The authors introduce the CRUX project to run such tests regularly and offer practical recommendations for their design and reporting.

Core claim

The paper claims that open-world evaluations—long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than automated benchmarks—can detect emerging frontier AI capabilities that traditional benchmarks miss. In the reported instance, an AI agent was given the task of developing and publishing a simple iOS application to the Apple App Store and completed it with only a single avoidable manual intervention. This outcome is presented as evidence that such evaluations can serve as early warnings for capabilities that may soon become widespread. The work surveys prior open-world evaluations, notes their strengths and limits, launches the CRUX project for定期运行,

What carries the argument

open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation; this mechanism supplies a more realistic signal of deployed capability by removing the constraints of precise specification, automatic grading, and short horizons.

If this is right

  • Benchmarks alone are insufficient for tracking frontier AI progress and should be supplemented with open-world tasks.
  • Regular open-world evaluations through a project like CRUX can generate timely signals about capabilities approaching deployment.
  • AI agents are nearing the ability to handle end-to-end real-world software development with minimal human oversight.
  • Design and reporting recommendations for open-world evaluations should be adopted to improve their usefulness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If open-world evaluations become routine, they could inform capability thresholds used in AI governance or deployment policies.
  • The same qualitative approach could be applied to other messy domains such as scientific experimentation or physical-world planning to broaden capability tracking.
  • Success on app-store publication suggests open-world methods may soon highlight automation risks in creative and commercial software work.

Load-bearing premise

Small-sample qualitative analysis of one long-horizon task supplies a reliable and generalizable signal of frontier capabilities.

What would settle it

Multiple independent replications on similar long-horizon software tasks in which agents require many manual interventions or fail outright would undermine the claim that open-world evaluations give dependable early warnings.

Figures

Figures reproduced from arXiv: 2605.20520 by Andrew B. Hall, Andrew Schwartz, Arvind Narayanan, Cozmin Ududec, Dimitris Papailiopoulos, Gillian K Hadfield, Harry Coppock, Helen Toner, J.J. Allaire, Magda Dubois, Peter Kirgis, Rishi Bommasani, Sara Hooker, Sayash Kapoor, Seth Lazar, Shoshannah Tekofsky, Stephan Rabanser, Steve Newman.

Figure 1
Figure 1. Figure 1: Several popular benchmarks (SWE-Bench, ARC-AGI, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A gradient of evaluation methodologies (short single-turn Q&A [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative API cost and timeline of CRUX #1. Total cost was approximately $991 over [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The App Store screenshots uploaded by the agent had visible formatting errors. The [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
read the original abstract

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that benchmark-based evaluations can both overstate and understate frontier AI capabilities because they favor precisely specified, automatically gradable, short-horizon tasks. It advocates for complementary 'open-world evaluations' consisting of long-horizon, messy real-world tasks assessed via small-sample qualitative analysis. The paper surveys recent open-world evaluations, introduces the CRUX project for conducting them regularly, and presents a first instance in which an AI agent develops and publishes a simple iOS application to the Apple App Store, succeeding with only a single avoidable manual intervention. This outcome is offered as suggestive evidence that open-world evaluations can furnish early warnings of capabilities that may soon become widespread. The paper closes with recommendations for designing and reporting such evaluations.

Significance. If the methodological recommendations can be placed on firmer empirical footing, the work could usefully broaden the evaluation toolkit beyond automated benchmarks, particularly for long-horizon tasks that are difficult to specify or score automatically. The survey of existing open-world efforts is a constructive contribution, and the CRUX framing provides a concrete organizational proposal for ongoing qualitative assessment. The single iOS case study usefully illustrates the intended depth of analysis, though its limited scope constrains immediate claims about generalizability or early-warning reliability.

major comments (2)
  1. [§4] §4 (CRUX iOS App Experiment): The inference that the agent's completion of the task with only one avoidable manual intervention demonstrates that open-world evaluations can provide early warning of capabilities soon to become widespread rests on a single qualitative run. No replication across independent trials, no comparison to prior model versions, and no control conditions (e.g., different prompt phrasings or task variants) are reported, leaving the robustness of the outcome and the strength of the broader methodological recommendation under-supported.
  2. [§5] §5 (Recommendations for Design and Reporting): The guidelines for conducting open-world evaluations stress qualitative judgment but do not specify procedures for establishing inter-rater reliability or for documenting how 'avoidable' interventions are distinguished from necessary ones. Because these judgments are central to the validity of the qualitative signal, their absence weakens the claim that the method can be routinely applied in a reproducible manner.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly distinguish the proposed small-sample qualitative approach from existing case-study practices in the AI evaluation literature to clarify the intended novelty.
  2. [Figure 1] Figure 1 (or equivalent diagram of the evaluation pipeline) would benefit from explicit labeling of the qualitative assessment step and any decision criteria used to classify interventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped clarify the scope and limitations of our initial open-world evaluation example. We address each major comment below and have revised the manuscript accordingly where feasible.

read point-by-point responses
  1. Referee: [§4] §4 (CRUX iOS App Experiment): The inference that the agent's completion of the task with only one avoidable manual intervention demonstrates that open-world evaluations can provide early warning of capabilities soon to become widespread rests on a single qualitative run. No replication across independent trials, no comparison to prior model versions, and no control conditions (e.g., different prompt phrasings or task variants) are reported, leaving the robustness of the outcome and the strength of the broader methodological recommendation under-supported.

    Authors: We agree that a single qualitative case cannot robustly support broad inferences about early-warning reliability or widespread capabilities. The iOS experiment is presented as an initial illustrative instance of the open-world approach rather than a controlled empirical study. In the revised manuscript we have explicitly reframed the example to emphasize its role in demonstrating the depth of qualitative analysis possible in real-world tasks, while removing language that could imply generalizability. We maintain that even one long-horizon success in an uncontrolled environment can surface capability signals missed by automated benchmarks, but we now clearly state that systematic replication and controls are needed for stronger claims and are planned for future CRUX evaluations. revision: partial

  2. Referee: [§5] §5 (Recommendations for Design and Reporting): The guidelines for conducting open-world evaluations stress qualitative judgment but do not specify procedures for establishing inter-rater reliability or for documenting how 'avoidable' interventions are distinguished from necessary ones. Because these judgments are central to the validity of the qualitative signal, their absence weakens the claim that the method can be routinely applied in a reproducible manner.

    Authors: We accept this critique and have expanded the recommendations in the revised §5. We now include explicit guidance on inter-rater reliability, such as using multiple independent reviewers and reporting agreement statistics where practical. We have also added a protocol for classifying interventions, requiring documentation of the precise action taken, the prompt or context that preceded it, and a rationale for whether it was avoidable (e.g., could have been resolved by re-prompting or improved agent scaffolding). These additions directly address reproducibility concerns while preserving the qualitative nature of the method. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is a conceptual survey advocating open-world evaluations plus one qualitative case study of an AI agent building an iOS app. No mathematical derivations, fitted parameters, or predictions appear that reduce by construction to the paper's own inputs or self-citations. The central suggestion that such evaluations can provide early warning rests on the reported empirical instance rather than any self-referential definition or renaming of prior results. The work is self-contained against external benchmarks in the sense that its claims are presented as observational recommendations, not as forced outputs of internal equations or uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that qualitative judgment of messy tasks is a valid complement to automation, without introducing fitted parameters or new postulated entities.

axioms (1)
  • domain assumption Qualitative small-sample analysis can reliably indicate broader AI capabilities on open-world tasks
    Invoked when using the single iOS example to suggest early warning value for the method class.

pith-pipeline@v0.9.0 · 5765 in / 1105 out tokens · 44749 ms · 2026-05-21T06:35:36.897133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 16 internal anchors

  1. [1]

    Measuring ai ability to complete long tasks.arXiv preprint arXiv:2503.14499,

    Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

  2. [2]

    Building a C compiler with a team of parallel Claudes, 2026

    Nicholas Carlini. Building a C compiler with a team of parallel Claudes, 2026. URL https: //www.anthropic.com/engineering/building-c-compiler. Anthropic

  3. [3]

    Project Vend: Phase two, 2025

    Anthropic. Project Vend: Phase two, 2025. URL https://www.anthropic.com/research/ project-vend-2. Anthropic

  4. [4]

    Assessing Claude Mythos Preview’s cybersecurity capabilities, 2026

    Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, Milad Nasr, Vinay Prab- hushankar, Winnie Xiao, Hakeem Angulu, Evyatar Ben Asher, Jackie Bow, Keir Bradwell, Ben Buchanan, David Forsythe, Daniel Freeman, Alex Gaynor, Xinyang Ge, Logan Graham, Kyla Guru, Hasnain Lakhani, Matt McNiece, Mojtaba Mehrara, Renee Nichol, Adnan Pirzada, Sophia Porter, ...

  5. [5]

    Common Ground between AI 2027 & AI as Normal Technology, 2025

    Sayash Kapoor, Arvind Narayanan, Daniel Kokotajlo, Eli Lifland, and Thomas Larsen. Common Ground between AI 2027 & AI as Normal Technology, 2025. URL https://asteriskmag. substack.com/p/common-ground-between-ai-2027-and. Asterisk Magazine

  6. [6]

    Dynabench: Rethinking benchmarking in NLP

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...

  7. [7]

    Manning, Christopher Ré, Diana Acosta-Navas, Drew A

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

  8. [8]

    Jacobs and Hanna Wallach

    Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), pages 375–385,

  9. [9]

    doi: 10.1145/3442188.3445901

  10. [10]

    Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

    Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. AI and the everything in the whole wide world benchmark. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URL https://arxiv.org/abs/2111.15366

  11. [11]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,

  12. [12]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    URLhttps://arxiv.org/abs/2310.06770. arXiv.org

  13. [13]

    On the Measure of Intelligence

    François Chollet. On the Measure of Intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547. arXiv.org

  14. [14]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024. URL https://arxiv.org/ abs/2406.12045. arXiv.org

  15. [15]

    Terminal-Bench

    Terminal-Bench Team. Terminal-Bench. URLhttps://www.tbench.ai/

  16. [16]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment, 2025. URL https:// arxiv.org/abs/2506.07982. arXiv.org

  17. [17]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. ARC- AGI-2: A New Challenge for Frontier AI Reasoning Systems, 2025. URL https://arxiv. org/abs/2505.11831

  18. [18]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  19. [19]

    Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench Verified,

  20. [20]

    Ope- nAI

    URL https://openai.com/index/introducing-swe-bench-verified/ . Ope- nAI

  21. [21]

    Time Horizon 1.1, 2026

    METR. Time Horizon 1.1, 2026. URL https://metr.org/blog/2026-1-29-time- horizon-1-1/. METR

  22. [22]

    SWE-bench Multilingual

    SWE-bench. SWE-bench Multilingual. URL https://www.swebench.com/multilingual- leaderboard.html. SWE-bench

  23. [23]

    τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026

    Sierra. τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026. URL https: //sierra.ai/resources/research/tau-3-bench. Sierra

  24. [24]

    ARC-AGI-3

    ARC Prize. ARC-AGI-3. URLhttps://arcprize.org/arc-agi/3. ARC Prize. 11

  25. [25]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?,

  26. [26]

    arXiv.org

    URLhttps://arxiv.org/abs/2410.03859. arXiv.org

  27. [27]

    GitHub - harbor-framework/harborz

    Harbor Framework Team. GitHub - harbor-framework/harborz. URL https://github.com/ harbor-framework/harbor

  28. [28]

    To- wards a science of AI agent reliability,

    Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan. Towards a Science of AI Agent Reliability, 2026. URL https://arxiv.org/ abs/2602.16666. arXiv.org

  29. [29]

    Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026

    Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026. URL https://metr.org/notes/2026-03-10- many-swe-bench-passing-prs-would-not-be-merged-into-main/. METR

  30. [30]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding, 2020. URL https: //arxiv.org/abs/2009.03300. arXiv.org

  31. [31]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022. arXiv.org

  32. [32]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  33. [33]

    Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024 a

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, 2024. URLhttps://arxiv.org/ abs/2406.04770. arXiv.org

  34. [34]

    GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark

    lmarena. GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark. URLhttps://github.com/lmarena/arena-hard-auto. GitHub

  35. [35]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, 2023. URL https://arxiv.org/abs/ 2307.13854. arXiv.org

  36. [36]

    Seven simple steps for log analysis in AI systems ,

    Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Satos, Sayash Kapoor, Sunischal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, and Cozmin Ududec. Seven simple steps for log analysis in AI systems ,

  37. [37]

    URLhttps://arxiv.org/html/2604.09563v1

  38. [38]

    GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI Model Performance on Real-Worl...

  39. [39]

    GDPval-AA Leaderboard

    Artificial Analysis. GDPval-AA Leaderboard. URL https://artificialanalysis.ai/ evaluations/gdpval-aa. Artificial Analysis

  40. [40]

    Partnering with Mozilla to improve Firefox’s security, 2026

    Anthropic. Partnering with Mozilla to improve Firefox’s security, 2026. URL https://www. anthropic.com/news/mozilla-firefox-security. Anthropic

  41. [41]

    One of the World’s Most Advanced AI Agents Is Completely Stuck Trying to Beat a Pokémon Game for Children, 2025

    Frank Landymore. One of the World’s Most Advanced AI Agents Is Completely Stuck Trying to Beat a Pokémon Game for Children, 2025. URL https://futurism.com/advanced-ai- stuck-pokemon. Futurism

  42. [42]

    URLhttps://theaidigest.org/village

    AI Village. URLhttps://theaidigest.org/village. 12

  43. [43]

    Scaling long-running autonomous coding · Cursor, 2026

    Wilson Lin. Scaling long-running autonomous coding · Cursor, 2026. URL https://cursor. com/blog/scaling-agents. Cursor

  44. [44]

    How we rebuilt Next.js with AI in one week, 2026

    Steve Faulkner. How we rebuilt Next.js with AI in one week, 2026. URL https://blog. cloudflare.com/vinext/. The Cloudflare Blog

  45. [45]

    Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026

    Andrej Karpathy. Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026. URLhttps://x.com/karpathy/status/2031135152349524125. X

  46. [46]

    Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946

    Dimitris Papailiopoulos. Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946. X

  47. [47]

    How close is AI to taking my job?, 2026

    Anson Ho. How close is AI to taking my job?, 2026. URL https://epoch.ai/gradient- updates/how-close-is-ai-to-taking-my-job. Epoch AI

  48. [48]

    MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026

    Tom Adamczewski, David Rein, David Owen, and Florian Brand. MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026. URL https://epoch.ai/blog/ mirrorcode-preliminary-results. Epoch AI

  49. [49]

    Automated Weak-to- Strong Researcher, 2026

    Jiaxin Wen, Liang Qiu, Joe Benton, Jan Hendrik Kirchner, and Jan Leike. Automated Weak-to- Strong Researcher, 2026. URL https://alignment.anthropic.com/2026/automated- w2s-researcher/. Alignment Science Blog

  50. [50]

    URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html

    Letting AI Post-train AI, 2026. URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html. Thoughtful Lab

  51. [51]

    tinker-cookbook/tinker_cookbook/recipes/golf_forecasting at claude/golf-forecasting-setup-VIpRZ · dphuang2/tinker-cookbook

    Dylan Huang. tinker-cookbook/tinker_cookbook/recipes/golf_forecasting at claude/golf-forecasting-setup-VIpRZ · dphuang2/tinker-cookbook. URL https: //github.com/dphuang2/tinker-cookbook/tree/claude/golf-forecasting- setup-VIpRZ/tinker_cookbook/recipes/golf_forecasting. GitHub

  52. [52]

    50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017

    David Donoho. 50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017. doi: 10.1080/10618600.2017.1384734. URL https: //www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734. Taylor & Fran- cis

  53. [53]

    Advances in neural information processing systems, 36:11809–11822

    Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?, 2024. URLhttps://arxiv.org/abs/2407.15711. arXiv.org

  54. [54]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/ abs/2311.12983. arXiv.org

  55. [55]

    AI as Normal Technology, 2025

    Arvind Narayanan and Sayash Kapoor. AI as Normal Technology, 2025. URL https://www. normaltech.ai/p/ai-as-normal-technology. AI as Normal Technology

  56. [56]

    Making frontier cybersecurity capabilities available to defenders, 2026

    Anthropic. Making frontier cybersecurity capabilities available to defenders, 2026. URL https://www.anthropic.com/news/claude-code-security. Anthropic

  57. [57]

    Project Glasswing: Securing critical software for the AI era

    Anthropic. Project Glasswing: Securing critical software for the AI era. URL https://www. anthropic.com/glasswing. Anthropic

  58. [58]

    A Safe Harbor for AI Evaluation and Red Teaming, 2024

    Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson. A Safe Harb...

  59. [59]

    Claude Mythos Preview system card, 2026

    Anthropic. Claude Mythos Preview system card, 2026. URL https://www-cdn.anthropic. com/8b8380204f74670be75e81c820ca8dda846ab289.pdf. Anthropic. 13

  60. [60]

    Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026

    Michael Burkhardt. Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026. URL https://9to5mac.com/2026/03/29/vibe-coding- developers-report-long-app-store-review-queues/. 9to5Mac

  61. [61]

    iOS developers: How long is App Review taking for everyone these days?, 2026

    Nikita Bier. iOS developers: How long is App Review taking for everyone these days?, 2026. URLhttps://x.com/nikitabier/status/2033931821260648659. X

  62. [62]

    The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025

    Ariel. The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025. URL https://appfigures.com/resources/insights/20251205?f=2. Appfigures

  63. [63]

    The Apple App Store is seeing an unexpected phenomenon

    Jennifer Mattson. The Apple App Store is seeing an unexpected phenomenon. Is vibe coding behind it?, 2026. URL https://www.fastcompany.com/91522242/apple-app-store- vibe-coding-generative-ai-unexpected-phenomenon. Fast Company

  64. [64]

    OpenClaw - Personal AI Assistant

    OpenClaw. OpenClaw - Personal AI Assistant. URLhttps://openclaw.ai/. OpenClaw

  65. [65]

    Adaptive thinking

    Claude API Docs. Adaptive thinking. URL https://platform.claude.com/docs/en/ build-with-claude/adaptive-thinking. Claude API Docs

  66. [66]

    Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026

    Russell Coleman. Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026. URL https://www.anthropic.com/engineering/eval-awareness-browsecomp. An- thropic

  67. [67]

    Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025

    Apollo Research. Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025. URL https://www.apolloresearch.ai/blog/claude-sonnet-37-often- knows-when-its-in-alignment-evaluations/

  68. [68]

    Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025

    Marcus Williams, Cameron Raymond, and Micah Carroll. Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025. URLhttps://alignment. openai.com/prod-evals/. OpenAI Alignment Blog

  69. [69]

    Security Overview

    Apple. Security Overview. URL https://developer.apple.com/library/archive/ documentation/Security/Conceptual/Security_Overview/Architecture/ Architecture.html. Apple

  70. [70]

    Controlling app access to files in macOS, 2021

    Apple Support. Controlling app access to files in macOS, 2021. URL https: //support.apple.com/guide/security/controlling-app-access-to-files- secddd1d86a6/web. Apple Support

  71. [71]

    Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026

    Viacheslav Potoropin. Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026. URL https://github.com/anthropics/claudes-c-compiler/ issues/1. GitHub

  72. [72]

    build fails with 32 errors , no releases, no tags, no stable branch · Issue #98 · wilsonzlin/fastrender, 2026

    Youssef Tourki. build fails with 32 errors , no releases, no tags, no stable branch · Issue #98 · wilsonzlin/fastrender, 2026. URL https://github.com/wilsonzlin/fastrender/ issues/98. GitHub

  73. [73]

    Chung, B

    L. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos. Non-Functional Requirements in Software Engineering, 1999. URL https://personal.utdallas.edu/~chung/BOOK/book.html. Kluwer Academic Publishing

  74. [74]

    What did we learn from the AI Village in 2025?, 2026

    Shoshannah Tekofsky. What did we learn from the AI Village in 2025?, 2026. URL https: //theaidigest.org/village/blog/what-we-learned-2025

  75. [75]

    BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025

    Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, and Osbert Bastani. BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025. URL https://arxiv.org/abs/2510.02418. arXiv

  76. [76]

    Cheating On AI Agent Evaluations, 2025

    Maia Hamin and Benjamin Edelman. Cheating On AI Agent Evaluations, 2025. URL https: //www.nist.gov/caisi/cheating-ai-agent-evaluations. NIST

  77. [77]

    Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025

    Jacob Kahn. Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025. URLhttps://github.com/SWE-bench/SWE-bench/issues/465. 14

  78. [78]

    Project Vend: Can Claude run a small shop? (And why does that matter?), 2025

    Anthropic. Project Vend: Can Claude run a small shop? (And why does that matter?), 2025. URLhttps://www.anthropic.com/research/project-vend-1. Anthropic

  79. [79]

    We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026

    Andon Labs. We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026. URL https://andonlabs.com/blog/andon-market-launch. Andon Labs

  80. [80]

    SciCode: A Research Coding Benchmark Curated by Scientists, 2024

    Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...

Showing first 80 references.