Open-World Evaluations for Measuring Frontier AI Capabilities

Andrew B. Hall; Andrew Schwartz; Arvind Narayanan; Cozmin Ududec; Dimitris Papailiopoulos; Gillian K Hadfield; Harry Coppock; Helen Toner; J.J. Allaire; Magda Dubois

arxiv: 2605.20520 · v1 · pith:QIAYVDZQnew · submitted 2026-05-19 · 💻 cs.AI

Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor , Peter Kirgis , Andrew Schwartz , Stephan Rabanser , J.J. Allaire , Rishi Bommasani , Harry Coppock , Magda Dubois

show 10 more authors

Gillian K Hadfield Andrew B. Hall Sara Hooker Seth Lazar Steve Newman Dimitris Papailiopoulos Shoshannah Tekofsky Helen Toner Cozmin Ududec Arvind Narayanan

This is my paper

Pith reviewed 2026-05-21 06:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords open-world evaluationsfrontier AI capabilitiesAI benchmarksqualitative analysislong-horizon tasksiOS app developmentCRUX project

0 comments

The pith

An AI agent developed and published a simple iOS app with only one manual intervention, showing open-world evaluations can flag frontier capabilities before benchmarks catch them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard benchmarks overstate or understate real AI capabilities because they rely on tasks that are easy to specify, automate, and score in short timeframes. It proposes open-world evaluations as a complement: long-horizon, messy, real-world tasks judged through small-sample qualitative review instead of large-scale automation. A concrete case shows an agent nearly completing the full process of building and releasing an iOS app to the App Store. Readers should care because these evaluations can give earlier signals that advanced abilities are about to become common. The authors introduce the CRUX project to run such tests regularly and offer practical recommendations for their design and reporting.

Core claim

The paper claims that open-world evaluations—long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than automated benchmarks—can detect emerging frontier AI capabilities that traditional benchmarks miss. In the reported instance, an AI agent was given the task of developing and publishing a simple iOS application to the Apple App Store and completed it with only a single avoidable manual intervention. This outcome is presented as evidence that such evaluations can serve as early warnings for capabilities that may soon become widespread. The work surveys prior open-world evaluations, notes their strengths and limits, launches the CRUX project for定期运行,

What carries the argument

open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation; this mechanism supplies a more realistic signal of deployed capability by removing the constraints of precise specification, automatic grading, and short horizons.

If this is right

Benchmarks alone are insufficient for tracking frontier AI progress and should be supplemented with open-world tasks.
Regular open-world evaluations through a project like CRUX can generate timely signals about capabilities approaching deployment.
AI agents are nearing the ability to handle end-to-end real-world software development with minimal human oversight.
Design and reporting recommendations for open-world evaluations should be adopted to improve their usefulness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If open-world evaluations become routine, they could inform capability thresholds used in AI governance or deployment policies.
The same qualitative approach could be applied to other messy domains such as scientific experimentation or physical-world planning to broaden capability tracking.
Success on app-store publication suggests open-world methods may soon highlight automation risks in creative and commercial software work.

Load-bearing premise

Small-sample qualitative analysis of one long-horizon task supplies a reliable and generalizable signal of frontier capabilities.

What would settle it

Multiple independent replications on similar long-horizon software tasks in which agents require many manual interventions or fail outright would undermine the claim that open-world evaluations give dependable early warnings.

Figures

Figures reproduced from arXiv: 2605.20520 by Andrew B. Hall, Andrew Schwartz, Arvind Narayanan, Cozmin Ududec, Dimitris Papailiopoulos, Gillian K Hadfield, Harry Coppock, Helen Toner, J.J. Allaire, Magda Dubois, Peter Kirgis, Rishi Bommasani, Sara Hooker, Sayash Kapoor, Seth Lazar, Shoshannah Tekofsky, Stephan Rabanser, Steve Newman.

**Figure 2.** Figure 2: A gradient of evaluation methodologies (short single-turn Q&A [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative API cost and timeline of CRUX #1. Total cost was approximately $991 over [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The App Store screenshots uploaded by the agent had visible formatting errors. The [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sensibly pushes open-world evaluations as a benchmark complement and gives a fresh App Store example, but one qualitative case does not yet support routine early-warning use.

read the letter

The main point is that standard benchmarks can both overstate and understate what frontier models can do once deployed, so the authors want a regular class of open-world evaluations: long, messy real tasks assessed qualitatively in small samples. They survey existing work, flag its limits, propose the CRUX project to run these regularly, and show one concrete run where an agent nearly completes an end-to-end iOS App Store submission with only a single avoidable manual step.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that benchmark-based evaluations can both overstate and understate frontier AI capabilities because they favor precisely specified, automatically gradable, short-horizon tasks. It advocates for complementary 'open-world evaluations' consisting of long-horizon, messy real-world tasks assessed via small-sample qualitative analysis. The paper surveys recent open-world evaluations, introduces the CRUX project for conducting them regularly, and presents a first instance in which an AI agent develops and publishes a simple iOS application to the Apple App Store, succeeding with only a single avoidable manual intervention. This outcome is offered as suggestive evidence that open-world evaluations can furnish early warnings of capabilities that may soon become widespread. The paper closes with recommendations for designing and reporting such evaluations.

Significance. If the methodological recommendations can be placed on firmer empirical footing, the work could usefully broaden the evaluation toolkit beyond automated benchmarks, particularly for long-horizon tasks that are difficult to specify or score automatically. The survey of existing open-world efforts is a constructive contribution, and the CRUX framing provides a concrete organizational proposal for ongoing qualitative assessment. The single iOS case study usefully illustrates the intended depth of analysis, though its limited scope constrains immediate claims about generalizability or early-warning reliability.

major comments (2)

[§4] §4 (CRUX iOS App Experiment): The inference that the agent's completion of the task with only one avoidable manual intervention demonstrates that open-world evaluations can provide early warning of capabilities soon to become widespread rests on a single qualitative run. No replication across independent trials, no comparison to prior model versions, and no control conditions (e.g., different prompt phrasings or task variants) are reported, leaving the robustness of the outcome and the strength of the broader methodological recommendation under-supported.
[§5] §5 (Recommendations for Design and Reporting): The guidelines for conducting open-world evaluations stress qualitative judgment but do not specify procedures for establishing inter-rater reliability or for documenting how 'avoidable' interventions are distinguished from necessary ones. Because these judgments are central to the validity of the qualitative signal, their absence weakens the claim that the method can be routinely applied in a reproducible manner.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly distinguish the proposed small-sample qualitative approach from existing case-study practices in the AI evaluation literature to clarify the intended novelty.
[Figure 1] Figure 1 (or equivalent diagram of the evaluation pipeline) would benefit from explicit labeling of the qualitative assessment step and any decision criteria used to classify interventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped clarify the scope and limitations of our initial open-world evaluation example. We address each major comment below and have revised the manuscript accordingly where feasible.

read point-by-point responses

Referee: [§4] §4 (CRUX iOS App Experiment): The inference that the agent's completion of the task with only one avoidable manual intervention demonstrates that open-world evaluations can provide early warning of capabilities soon to become widespread rests on a single qualitative run. No replication across independent trials, no comparison to prior model versions, and no control conditions (e.g., different prompt phrasings or task variants) are reported, leaving the robustness of the outcome and the strength of the broader methodological recommendation under-supported.

Authors: We agree that a single qualitative case cannot robustly support broad inferences about early-warning reliability or widespread capabilities. The iOS experiment is presented as an initial illustrative instance of the open-world approach rather than a controlled empirical study. In the revised manuscript we have explicitly reframed the example to emphasize its role in demonstrating the depth of qualitative analysis possible in real-world tasks, while removing language that could imply generalizability. We maintain that even one long-horizon success in an uncontrolled environment can surface capability signals missed by automated benchmarks, but we now clearly state that systematic replication and controls are needed for stronger claims and are planned for future CRUX evaluations. revision: partial
Referee: [§5] §5 (Recommendations for Design and Reporting): The guidelines for conducting open-world evaluations stress qualitative judgment but do not specify procedures for establishing inter-rater reliability or for documenting how 'avoidable' interventions are distinguished from necessary ones. Because these judgments are central to the validity of the qualitative signal, their absence weakens the claim that the method can be routinely applied in a reproducible manner.

Authors: We accept this critique and have expanded the recommendations in the revised §5. We now include explicit guidance on inter-rater reliability, such as using multiple independent reviewers and reporting agreement statistics where practical. We have also added a protocol for classifying interventions, requiring documentation of the precise action taken, the prompt or context that preceded it, and a rationale for whether it was avoidable (e.g., could have been resolved by re-prompting or improved agent scaffolding). These additions directly address reproducibility concerns while preserving the qualitative nature of the method. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is a conceptual survey advocating open-world evaluations plus one qualitative case study of an AI agent building an iOS app. No mathematical derivations, fitted parameters, or predictions appear that reduce by construction to the paper's own inputs or self-citations. The central suggestion that such evaluations can provide early warning rests on the reported empirical instance rather than any self-referential definition or renaming of prior results. The work is self-contained against external benchmarks in the sense that its claims are presented as observational recommendations, not as forced outputs of internal equations or uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that qualitative judgment of messy tasks is a valid complement to automation, without introducing fitted parameters or new postulated entities.

axioms (1)

domain assumption Qualitative small-sample analysis can reliably indicate broader AI capabilities on open-world tasks
Invoked when using the single iOS example to suggest early warning value for the method class.

pith-pipeline@v0.9.0 · 5765 in / 1105 out tokens · 44749 ms · 2026-05-21T06:35:36.897133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 16 internal anchors

[1]

Measuring ai ability to complete long tasks.arXiv preprint arXiv:2503.14499,

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

work page arXiv 2025
[2]

Building a C compiler with a team of parallel Claudes, 2026

Nicholas Carlini. Building a C compiler with a team of parallel Claudes, 2026. URL https: //www.anthropic.com/engineering/building-c-compiler. Anthropic

work page 2026
[3]

Project Vend: Phase two, 2025

Anthropic. Project Vend: Phase two, 2025. URL https://www.anthropic.com/research/ project-vend-2. Anthropic

work page 2025
[4]

Assessing Claude Mythos Preview’s cybersecurity capabilities, 2026

Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, Milad Nasr, Vinay Prab- hushankar, Winnie Xiao, Hakeem Angulu, Evyatar Ben Asher, Jackie Bow, Keir Bradwell, Ben Buchanan, David Forsythe, Daniel Freeman, Alex Gaynor, Xinyang Ge, Logan Graham, Kyla Guru, Hasnain Lakhani, Matt McNiece, Mojtaba Mehrara, Renee Nichol, Adnan Pirzada, Sophia Porter, ...

work page 2026
[5]

Common Ground between AI 2027 & AI as Normal Technology, 2025

Sayash Kapoor, Arvind Narayanan, Daniel Kokotajlo, Eli Lifland, and Thomas Larsen. Common Ground between AI 2027 & AI as Normal Technology, 2025. URL https://asteriskmag. substack.com/p/common-ground-between-ai-2027-and. Asterisk Magazine

work page 2027
[6]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...

work page arXiv 2021
[7]

Manning, Christopher Ré, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page 2023
[8]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), pages 375–385,

work page 2021
[9]

doi: 10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901
[10]

Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. AI and the everything in the whole wide world benchmark. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URL https://arxiv.org/abs/2111.15366

work page arXiv 2021
[11]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,

work page
[12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

URLhttps://arxiv.org/abs/2310.06770. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv
[13]

On the Measure of Intelligence

François Chollet. On the Measure of Intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024. URL https://arxiv.org/ abs/2406.12045. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Terminal-Bench

Terminal-Bench Team. Terminal-Bench. URLhttps://www.tbench.ai/

work page
[16]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment, 2025. URL https:// arxiv.org/abs/2506.07982. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. ARC- AGI-2: A New Challenge for Frontier AI Reasoning Systems, 2025. URL https://arxiv. org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench Verified,

work page
[20]

Ope- nAI

URL https://openai.com/index/introducing-swe-bench-verified/ . Ope- nAI

work page
[21]

Time Horizon 1.1, 2026

METR. Time Horizon 1.1, 2026. URL https://metr.org/blog/2026-1-29-time- horizon-1-1/. METR

work page 2026
[22]

SWE-bench Multilingual

SWE-bench. SWE-bench Multilingual. URL https://www.swebench.com/multilingual- leaderboard.html. SWE-bench

work page
[23]

τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026

Sierra. τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026. URL https: //sierra.ai/resources/research/tau-3-bench. Sierra

work page 2026
[24]

ARC-AGI-3

ARC Prize. ARC-AGI-3. URLhttps://arcprize.org/arc-agi/3. ARC Prize. 11

work page
[25]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?,

work page
[26]

arXiv.org

URLhttps://arxiv.org/abs/2410.03859. arXiv.org

work page arXiv
[27]

GitHub - harbor-framework/harborz

Harbor Framework Team. GitHub - harbor-framework/harborz. URL https://github.com/ harbor-framework/harbor

work page
[28]

To- wards a science of AI agent reliability,

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan. Towards a Science of AI Agent Reliability, 2026. URL https://arxiv.org/ abs/2602.16666. arXiv.org

work page arXiv 2026
[29]

Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026

Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026. URL https://metr.org/notes/2026-03-10- many-swe-bench-passing-prs-would-not-be-merged-into-main/. METR

work page 2026
[30]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding, 2020. URL https: //arxiv.org/abs/2009.03300. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2020
[31]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024 a

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, 2024. URLhttps://arxiv.org/ abs/2406.04770. arXiv.org

work page arXiv 2024
[34]

GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark

lmarena. GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark. URLhttps://github.com/lmarena/arena-hard-auto. GitHub

work page
[35]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, 2023. URL https://arxiv.org/abs/ 2307.13854. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Seven simple steps for log analysis in AI systems ,

Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Satos, Sayash Kapoor, Sunischal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, and Cozmin Ududec. Seven simple steps for log analysis in AI systems ,

work page
[37]

URLhttps://arxiv.org/html/2604.09563v1

work page internal anchor Pith review Pith/arXiv arXiv
[38]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI Model Performance on Real-Worl...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

GDPval-AA Leaderboard

Artificial Analysis. GDPval-AA Leaderboard. URL https://artificialanalysis.ai/ evaluations/gdpval-aa. Artificial Analysis

work page
[40]

Partnering with Mozilla to improve Firefox’s security, 2026

Anthropic. Partnering with Mozilla to improve Firefox’s security, 2026. URL https://www. anthropic.com/news/mozilla-firefox-security. Anthropic

work page 2026
[41]

One of the World’s Most Advanced AI Agents Is Completely Stuck Trying to Beat a Pokémon Game for Children, 2025

Frank Landymore. One of the World’s Most Advanced AI Agents Is Completely Stuck Trying to Beat a Pokémon Game for Children, 2025. URL https://futurism.com/advanced-ai- stuck-pokemon. Futurism

work page 2025
[42]

URLhttps://theaidigest.org/village

AI Village. URLhttps://theaidigest.org/village. 12

work page
[43]

Scaling long-running autonomous coding · Cursor, 2026

Wilson Lin. Scaling long-running autonomous coding · Cursor, 2026. URL https://cursor. com/blog/scaling-agents. Cursor

work page 2026
[44]

How we rebuilt Next.js with AI in one week, 2026

Steve Faulkner. How we rebuilt Next.js with AI in one week, 2026. URL https://blog. cloudflare.com/vinext/. The Cloudflare Blog

work page 2026
[45]

Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026

Andrej Karpathy. Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026. URLhttps://x.com/karpathy/status/2031135152349524125. X

work page arXiv 2026
[46]

Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946

Dimitris Papailiopoulos. Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946. X

work page arXiv
[47]

How close is AI to taking my job?, 2026

Anson Ho. How close is AI to taking my job?, 2026. URL https://epoch.ai/gradient- updates/how-close-is-ai-to-taking-my-job. Epoch AI

work page 2026
[48]

MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026

Tom Adamczewski, David Rein, David Owen, and Florian Brand. MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026. URL https://epoch.ai/blog/ mirrorcode-preliminary-results. Epoch AI

work page 2026
[49]

Automated Weak-to- Strong Researcher, 2026

Jiaxin Wen, Liang Qiu, Joe Benton, Jan Hendrik Kirchner, and Jan Leike. Automated Weak-to- Strong Researcher, 2026. URL https://alignment.anthropic.com/2026/automated- w2s-researcher/. Alignment Science Blog

work page 2026
[50]

URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html

Letting AI Post-train AI, 2026. URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html. Thoughtful Lab

work page 2026
[51]

tinker-cookbook/tinker_cookbook/recipes/golf_forecasting at claude/golf-forecasting-setup-VIpRZ · dphuang2/tinker-cookbook

Dylan Huang. tinker-cookbook/tinker_cookbook/recipes/golf_forecasting at claude/golf-forecasting-setup-VIpRZ · dphuang2/tinker-cookbook. URL https: //github.com/dphuang2/tinker-cookbook/tree/claude/golf-forecasting- setup-VIpRZ/tinker_cookbook/recipes/golf_forecasting. GitHub

work page
[52]

50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017

David Donoho. 50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017. doi: 10.1080/10618600.2017.1384734. URL https: //www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734. Taylor & Fran- cis

work page doi:10.1080/10618600.2017.1384734 2017
[53]

Advances in neural information processing systems, 36:11809–11822

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?, 2024. URLhttps://arxiv.org/abs/2407.15711. arXiv.org

work page arXiv 2024
[54]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/ abs/2311.12983. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

AI as Normal Technology, 2025

Arvind Narayanan and Sayash Kapoor. AI as Normal Technology, 2025. URL https://www. normaltech.ai/p/ai-as-normal-technology. AI as Normal Technology

work page 2025
[56]

Making frontier cybersecurity capabilities available to defenders, 2026

Anthropic. Making frontier cybersecurity capabilities available to defenders, 2026. URL https://www.anthropic.com/news/claude-code-security. Anthropic

work page 2026
[57]

Project Glasswing: Securing critical software for the AI era

Anthropic. Project Glasswing: Securing critical software for the AI era. URL https://www. anthropic.com/glasswing. Anthropic

work page
[58]

A Safe Harbor for AI Evaluation and Red Teaming, 2024

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson. A Safe Harb...

work page arXiv 2024
[59]

Claude Mythos Preview system card, 2026

Anthropic. Claude Mythos Preview system card, 2026. URL https://www-cdn.anthropic. com/8b8380204f74670be75e81c820ca8dda846ab289.pdf. Anthropic. 13

work page 2026
[60]

Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026

Michael Burkhardt. Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026. URL https://9to5mac.com/2026/03/29/vibe-coding- developers-report-long-app-store-review-queues/. 9to5Mac

work page 2026
[61]

iOS developers: How long is App Review taking for everyone these days?, 2026

Nikita Bier. iOS developers: How long is App Review taking for everyone these days?, 2026. URLhttps://x.com/nikitabier/status/2033931821260648659. X

work page arXiv 2026
[62]

The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025

Ariel. The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025. URL https://appfigures.com/resources/insights/20251205?f=2. Appfigures

work page arXiv 2025
[63]

The Apple App Store is seeing an unexpected phenomenon

Jennifer Mattson. The Apple App Store is seeing an unexpected phenomenon. Is vibe coding behind it?, 2026. URL https://www.fastcompany.com/91522242/apple-app-store- vibe-coding-generative-ai-unexpected-phenomenon. Fast Company

work page arXiv 2026
[64]

OpenClaw - Personal AI Assistant

OpenClaw. OpenClaw - Personal AI Assistant. URLhttps://openclaw.ai/. OpenClaw

work page
[65]

Adaptive thinking

Claude API Docs. Adaptive thinking. URL https://platform.claude.com/docs/en/ build-with-claude/adaptive-thinking. Claude API Docs

work page
[66]

Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026

Russell Coleman. Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026. URL https://www.anthropic.com/engineering/eval-awareness-browsecomp. An- thropic

work page 2026
[67]

Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025

Apollo Research. Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025. URL https://www.apolloresearch.ai/blog/claude-sonnet-37-often- knows-when-its-in-alignment-evaluations/

work page 2025
[68]

Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025

Marcus Williams, Cameron Raymond, and Micah Carroll. Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025. URLhttps://alignment. openai.com/prod-evals/. OpenAI Alignment Blog

work page 2025
[69]

Security Overview

Apple. Security Overview. URL https://developer.apple.com/library/archive/ documentation/Security/Conceptual/Security_Overview/Architecture/ Architecture.html. Apple

work page
[70]

Controlling app access to files in macOS, 2021

Apple Support. Controlling app access to files in macOS, 2021. URL https: //support.apple.com/guide/security/controlling-app-access-to-files- secddd1d86a6/web. Apple Support

work page 2021
[71]

Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026

Viacheslav Potoropin. Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026. URL https://github.com/anthropics/claudes-c-compiler/ issues/1. GitHub

work page 2026
[72]

build fails with 32 errors , no releases, no tags, no stable branch · Issue #98 · wilsonzlin/fastrender, 2026

Youssef Tourki. build fails with 32 errors , no releases, no tags, no stable branch · Issue #98 · wilsonzlin/fastrender, 2026. URL https://github.com/wilsonzlin/fastrender/ issues/98. GitHub

work page 2026
[73]

Chung, B

L. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos. Non-Functional Requirements in Software Engineering, 1999. URL https://personal.utdallas.edu/~chung/BOOK/book.html. Kluwer Academic Publishing

work page 1999
[74]

What did we learn from the AI Village in 2025?, 2026

Shoshannah Tekofsky. What did we learn from the AI Village in 2025?, 2026. URL https: //theaidigest.org/village/blog/what-we-learned-2025

work page 2025
[75]

BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025

Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, and Osbert Bastani. BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025. URL https://arxiv.org/abs/2510.02418. arXiv

work page arXiv 2025
[76]

Cheating On AI Agent Evaluations, 2025

Maia Hamin and Benjamin Edelman. Cheating On AI Agent Evaluations, 2025. URL https: //www.nist.gov/caisi/cheating-ai-agent-evaluations. NIST

work page 2025
[77]

Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025

Jacob Kahn. Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025. URLhttps://github.com/SWE-bench/SWE-bench/issues/465. 14

work page 2025
[78]

Project Vend: Can Claude run a small shop? (And why does that matter?), 2025

Anthropic. Project Vend: Can Claude run a small shop? (And why does that matter?), 2025. URLhttps://www.anthropic.com/research/project-vend-1. Anthropic

work page 2025
[79]

We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026

Andon Labs. We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026. URL https://andonlabs.com/blog/andon-market-launch. Andon Labs

work page 2026
[80]

SciCode: A Research Coding Benchmark Curated by Scientists, 2024

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...

work page arXiv 2024

Showing first 80 references.

[1] [1]

Measuring ai ability to complete long tasks.arXiv preprint arXiv:2503.14499,

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

work page arXiv 2025

[2] [2]

Building a C compiler with a team of parallel Claudes, 2026

Nicholas Carlini. Building a C compiler with a team of parallel Claudes, 2026. URL https: //www.anthropic.com/engineering/building-c-compiler. Anthropic

work page 2026

[3] [3]

Project Vend: Phase two, 2025

Anthropic. Project Vend: Phase two, 2025. URL https://www.anthropic.com/research/ project-vend-2. Anthropic

work page 2025

[4] [4]

Assessing Claude Mythos Preview’s cybersecurity capabilities, 2026

Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, Milad Nasr, Vinay Prab- hushankar, Winnie Xiao, Hakeem Angulu, Evyatar Ben Asher, Jackie Bow, Keir Bradwell, Ben Buchanan, David Forsythe, Daniel Freeman, Alex Gaynor, Xinyang Ge, Logan Graham, Kyla Guru, Hasnain Lakhani, Matt McNiece, Mojtaba Mehrara, Renee Nichol, Adnan Pirzada, Sophia Porter, ...

work page 2026

[5] [5]

Common Ground between AI 2027 & AI as Normal Technology, 2025

Sayash Kapoor, Arvind Narayanan, Daniel Kokotajlo, Eli Lifland, and Thomas Larsen. Common Ground between AI 2027 & AI as Normal Technology, 2025. URL https://asteriskmag. substack.com/p/common-ground-between-ai-2027-and. Asterisk Magazine

work page 2027

[6] [6]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...

work page arXiv 2021

[7] [7]

Manning, Christopher Ré, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page 2023

[8] [8]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), pages 375–385,

work page 2021

[9] [9]

doi: 10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901

[10] [10]

Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. AI and the everything in the whole wide world benchmark. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URL https://arxiv.org/abs/2111.15366

work page arXiv 2021

[11] [11]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,

work page

[12] [12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

URLhttps://arxiv.org/abs/2310.06770. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

On the Measure of Intelligence

François Chollet. On the Measure of Intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2019

[14] [14]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024. URL https://arxiv.org/ abs/2406.12045. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Terminal-Bench

Terminal-Bench Team. Terminal-Bench. URLhttps://www.tbench.ai/

work page

[16] [16]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment, 2025. URL https:// arxiv.org/abs/2506.07982. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. ARC- AGI-2: A New Challenge for Frontier AI Reasoning Systems, 2025. URL https://arxiv. org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench Verified,

work page

[20] [20]

Ope- nAI

URL https://openai.com/index/introducing-swe-bench-verified/ . Ope- nAI

work page

[21] [21]

Time Horizon 1.1, 2026

METR. Time Horizon 1.1, 2026. URL https://metr.org/blog/2026-1-29-time- horizon-1-1/. METR

work page 2026

[22] [22]

SWE-bench Multilingual

SWE-bench. SWE-bench Multilingual. URL https://www.swebench.com/multilingual- leaderboard.html. SWE-bench

work page

[23] [23]

τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026

Sierra. τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026. URL https: //sierra.ai/resources/research/tau-3-bench. Sierra

work page 2026

[24] [24]

ARC-AGI-3

ARC Prize. ARC-AGI-3. URLhttps://arcprize.org/arc-agi/3. ARC Prize. 11

work page

[25] [25]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?,

work page

[26] [26]

arXiv.org

URLhttps://arxiv.org/abs/2410.03859. arXiv.org

work page arXiv

[27] [27]

GitHub - harbor-framework/harborz

Harbor Framework Team. GitHub - harbor-framework/harborz. URL https://github.com/ harbor-framework/harbor

work page

[28] [28]

To- wards a science of AI agent reliability,

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan. Towards a Science of AI Agent Reliability, 2026. URL https://arxiv.org/ abs/2602.16666. arXiv.org

work page arXiv 2026

[29] [29]

Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026

Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026. URL https://metr.org/notes/2026-03-10- many-swe-bench-passing-prs-would-not-be-merged-into-main/. METR

work page 2026

[30] [30]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding, 2020. URL https: //arxiv.org/abs/2009.03300. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2020

[31] [31]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024 a

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, 2024. URLhttps://arxiv.org/ abs/2406.04770. arXiv.org

work page arXiv 2024

[34] [34]

GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark

lmarena. GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark. URLhttps://github.com/lmarena/arena-hard-auto. GitHub

work page

[35] [35]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, 2023. URL https://arxiv.org/abs/ 2307.13854. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Seven simple steps for log analysis in AI systems ,

Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Satos, Sayash Kapoor, Sunischal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, and Cozmin Ududec. Seven simple steps for log analysis in AI systems ,

work page

[37] [37]

URLhttps://arxiv.org/html/2604.09563v1

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI Model Performance on Real-Worl...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

GDPval-AA Leaderboard

Artificial Analysis. GDPval-AA Leaderboard. URL https://artificialanalysis.ai/ evaluations/gdpval-aa. Artificial Analysis

work page

[40] [40]

Partnering with Mozilla to improve Firefox’s security, 2026

Anthropic. Partnering with Mozilla to improve Firefox’s security, 2026. URL https://www. anthropic.com/news/mozilla-firefox-security. Anthropic

work page 2026

[41] [41]

One of the World’s Most Advanced AI Agents Is Completely Stuck Trying to Beat a Pokémon Game for Children, 2025

Frank Landymore. One of the World’s Most Advanced AI Agents Is Completely Stuck Trying to Beat a Pokémon Game for Children, 2025. URL https://futurism.com/advanced-ai- stuck-pokemon. Futurism

work page 2025

[42] [42]

URLhttps://theaidigest.org/village

AI Village. URLhttps://theaidigest.org/village. 12

work page

[43] [43]

Scaling long-running autonomous coding · Cursor, 2026

Wilson Lin. Scaling long-running autonomous coding · Cursor, 2026. URL https://cursor. com/blog/scaling-agents. Cursor

work page 2026

[44] [44]

How we rebuilt Next.js with AI in one week, 2026

Steve Faulkner. How we rebuilt Next.js with AI in one week, 2026. URL https://blog. cloudflare.com/vinext/. The Cloudflare Blog

work page 2026

[45] [45]

Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026

Andrej Karpathy. Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026. URLhttps://x.com/karpathy/status/2031135152349524125. X

work page arXiv 2026

[46] [46]

Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946

Dimitris Papailiopoulos. Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946. X

work page arXiv

[47] [47]

How close is AI to taking my job?, 2026

Anson Ho. How close is AI to taking my job?, 2026. URL https://epoch.ai/gradient- updates/how-close-is-ai-to-taking-my-job. Epoch AI

work page 2026

[48] [48]

MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026

Tom Adamczewski, David Rein, David Owen, and Florian Brand. MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026. URL https://epoch.ai/blog/ mirrorcode-preliminary-results. Epoch AI

work page 2026

[49] [49]

Automated Weak-to- Strong Researcher, 2026

Jiaxin Wen, Liang Qiu, Joe Benton, Jan Hendrik Kirchner, and Jan Leike. Automated Weak-to- Strong Researcher, 2026. URL https://alignment.anthropic.com/2026/automated- w2s-researcher/. Alignment Science Blog

work page 2026

[50] [50]

URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html

Letting AI Post-train AI, 2026. URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html. Thoughtful Lab

work page 2026

[51] [51]

tinker-cookbook/tinker_cookbook/recipes/golf_forecasting at claude/golf-forecasting-setup-VIpRZ · dphuang2/tinker-cookbook

Dylan Huang. tinker-cookbook/tinker_cookbook/recipes/golf_forecasting at claude/golf-forecasting-setup-VIpRZ · dphuang2/tinker-cookbook. URL https: //github.com/dphuang2/tinker-cookbook/tree/claude/golf-forecasting- setup-VIpRZ/tinker_cookbook/recipes/golf_forecasting. GitHub

work page

[52] [52]

50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017

David Donoho. 50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017. doi: 10.1080/10618600.2017.1384734. URL https: //www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734. Taylor & Fran- cis

work page doi:10.1080/10618600.2017.1384734 2017

[53] [53]

Advances in neural information processing systems, 36:11809–11822

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?, 2024. URLhttps://arxiv.org/abs/2407.15711. arXiv.org

work page arXiv 2024

[54] [54]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/ abs/2311.12983. arXiv.org

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

AI as Normal Technology, 2025

Arvind Narayanan and Sayash Kapoor. AI as Normal Technology, 2025. URL https://www. normaltech.ai/p/ai-as-normal-technology. AI as Normal Technology

work page 2025

[56] [56]

Making frontier cybersecurity capabilities available to defenders, 2026

Anthropic. Making frontier cybersecurity capabilities available to defenders, 2026. URL https://www.anthropic.com/news/claude-code-security. Anthropic

work page 2026

[57] [57]

Project Glasswing: Securing critical software for the AI era

Anthropic. Project Glasswing: Securing critical software for the AI era. URL https://www. anthropic.com/glasswing. Anthropic

work page

[58] [58]

A Safe Harbor for AI Evaluation and Red Teaming, 2024

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson. A Safe Harb...

work page arXiv 2024

[59] [59]

Claude Mythos Preview system card, 2026

Anthropic. Claude Mythos Preview system card, 2026. URL https://www-cdn.anthropic. com/8b8380204f74670be75e81c820ca8dda846ab289.pdf. Anthropic. 13

work page 2026

[60] [60]

Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026

Michael Burkhardt. Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026. URL https://9to5mac.com/2026/03/29/vibe-coding- developers-report-long-app-store-review-queues/. 9to5Mac

work page 2026

[61] [61]

iOS developers: How long is App Review taking for everyone these days?, 2026

Nikita Bier. iOS developers: How long is App Review taking for everyone these days?, 2026. URLhttps://x.com/nikitabier/status/2033931821260648659. X

work page arXiv 2026

[62] [62]

The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025

Ariel. The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025. URL https://appfigures.com/resources/insights/20251205?f=2. Appfigures

work page arXiv 2025

[63] [63]

The Apple App Store is seeing an unexpected phenomenon

Jennifer Mattson. The Apple App Store is seeing an unexpected phenomenon. Is vibe coding behind it?, 2026. URL https://www.fastcompany.com/91522242/apple-app-store- vibe-coding-generative-ai-unexpected-phenomenon. Fast Company

work page arXiv 2026

[64] [64]

OpenClaw - Personal AI Assistant

OpenClaw. OpenClaw - Personal AI Assistant. URLhttps://openclaw.ai/. OpenClaw

work page

[65] [65]

Adaptive thinking

Claude API Docs. Adaptive thinking. URL https://platform.claude.com/docs/en/ build-with-claude/adaptive-thinking. Claude API Docs

work page

[66] [66]

Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026

Russell Coleman. Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026. URL https://www.anthropic.com/engineering/eval-awareness-browsecomp. An- thropic

work page 2026

[67] [67]

Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025

Apollo Research. Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025. URL https://www.apolloresearch.ai/blog/claude-sonnet-37-often- knows-when-its-in-alignment-evaluations/

work page 2025

[68] [68]

Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025

Marcus Williams, Cameron Raymond, and Micah Carroll. Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025. URLhttps://alignment. openai.com/prod-evals/. OpenAI Alignment Blog

work page 2025

[69] [69]

Security Overview

Apple. Security Overview. URL https://developer.apple.com/library/archive/ documentation/Security/Conceptual/Security_Overview/Architecture/ Architecture.html. Apple

work page

[70] [70]

Controlling app access to files in macOS, 2021

Apple Support. Controlling app access to files in macOS, 2021. URL https: //support.apple.com/guide/security/controlling-app-access-to-files- secddd1d86a6/web. Apple Support

work page 2021

[71] [71]

Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026

Viacheslav Potoropin. Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026. URL https://github.com/anthropics/claudes-c-compiler/ issues/1. GitHub

work page 2026

[72] [72]

build fails with 32 errors , no releases, no tags, no stable branch · Issue #98 · wilsonzlin/fastrender, 2026

Youssef Tourki. build fails with 32 errors , no releases, no tags, no stable branch · Issue #98 · wilsonzlin/fastrender, 2026. URL https://github.com/wilsonzlin/fastrender/ issues/98. GitHub

work page 2026

[73] [73]

Chung, B

L. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos. Non-Functional Requirements in Software Engineering, 1999. URL https://personal.utdallas.edu/~chung/BOOK/book.html. Kluwer Academic Publishing

work page 1999

[74] [74]

What did we learn from the AI Village in 2025?, 2026

Shoshannah Tekofsky. What did we learn from the AI Village in 2025?, 2026. URL https: //theaidigest.org/village/blog/what-we-learned-2025

work page 2025

[75] [75]

BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025

Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, and Osbert Bastani. BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025. URL https://arxiv.org/abs/2510.02418. arXiv

work page arXiv 2025

[76] [76]

Cheating On AI Agent Evaluations, 2025

Maia Hamin and Benjamin Edelman. Cheating On AI Agent Evaluations, 2025. URL https: //www.nist.gov/caisi/cheating-ai-agent-evaluations. NIST

work page 2025

[77] [77]

Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025

Jacob Kahn. Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025. URLhttps://github.com/SWE-bench/SWE-bench/issues/465. 14

work page 2025

[78] [78]

Project Vend: Can Claude run a small shop? (And why does that matter?), 2025

Anthropic. Project Vend: Can Claude run a small shop? (And why does that matter?), 2025. URLhttps://www.anthropic.com/research/project-vend-1. Anthropic

work page 2025

[79] [79]

We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026

Andon Labs. We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026. URL https://andonlabs.com/blog/andon-market-launch. Andon Labs

work page 2026

[80] [80]

SciCode: A Research Coding Benchmark Curated by Scientists, 2024

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...

work page arXiv 2024