Open-World Evaluations for Measuring Frontier AI Capabilities
Pith reviewed 2026-05-21 06:35 UTC · model grok-4.3
The pith
An AI agent developed and published a simple iOS app with only one manual intervention, showing open-world evaluations can flag frontier capabilities before benchmarks catch them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that open-world evaluations—long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than automated benchmarks—can detect emerging frontier AI capabilities that traditional benchmarks miss. In the reported instance, an AI agent was given the task of developing and publishing a simple iOS application to the Apple App Store and completed it with only a single avoidable manual intervention. This outcome is presented as evidence that such evaluations can serve as early warnings for capabilities that may soon become widespread. The work surveys prior open-world evaluations, notes their strengths and limits, launches the CRUX project for定期运行,
What carries the argument
open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation; this mechanism supplies a more realistic signal of deployed capability by removing the constraints of precise specification, automatic grading, and short horizons.
If this is right
- Benchmarks alone are insufficient for tracking frontier AI progress and should be supplemented with open-world tasks.
- Regular open-world evaluations through a project like CRUX can generate timely signals about capabilities approaching deployment.
- AI agents are nearing the ability to handle end-to-end real-world software development with minimal human oversight.
- Design and reporting recommendations for open-world evaluations should be adopted to improve their usefulness.
Where Pith is reading between the lines
- If open-world evaluations become routine, they could inform capability thresholds used in AI governance or deployment policies.
- The same qualitative approach could be applied to other messy domains such as scientific experimentation or physical-world planning to broaden capability tracking.
- Success on app-store publication suggests open-world methods may soon highlight automation risks in creative and commercial software work.
Load-bearing premise
Small-sample qualitative analysis of one long-horizon task supplies a reliable and generalizable signal of frontier capabilities.
What would settle it
Multiple independent replications on similar long-horizon software tasks in which agents require many manual interventions or fail outright would undermine the claim that open-world evaluations give dependable early warnings.
Figures
read the original abstract
Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that benchmark-based evaluations can both overstate and understate frontier AI capabilities because they favor precisely specified, automatically gradable, short-horizon tasks. It advocates for complementary 'open-world evaluations' consisting of long-horizon, messy real-world tasks assessed via small-sample qualitative analysis. The paper surveys recent open-world evaluations, introduces the CRUX project for conducting them regularly, and presents a first instance in which an AI agent develops and publishes a simple iOS application to the Apple App Store, succeeding with only a single avoidable manual intervention. This outcome is offered as suggestive evidence that open-world evaluations can furnish early warnings of capabilities that may soon become widespread. The paper closes with recommendations for designing and reporting such evaluations.
Significance. If the methodological recommendations can be placed on firmer empirical footing, the work could usefully broaden the evaluation toolkit beyond automated benchmarks, particularly for long-horizon tasks that are difficult to specify or score automatically. The survey of existing open-world efforts is a constructive contribution, and the CRUX framing provides a concrete organizational proposal for ongoing qualitative assessment. The single iOS case study usefully illustrates the intended depth of analysis, though its limited scope constrains immediate claims about generalizability or early-warning reliability.
major comments (2)
- [§4] §4 (CRUX iOS App Experiment): The inference that the agent's completion of the task with only one avoidable manual intervention demonstrates that open-world evaluations can provide early warning of capabilities soon to become widespread rests on a single qualitative run. No replication across independent trials, no comparison to prior model versions, and no control conditions (e.g., different prompt phrasings or task variants) are reported, leaving the robustness of the outcome and the strength of the broader methodological recommendation under-supported.
- [§5] §5 (Recommendations for Design and Reporting): The guidelines for conducting open-world evaluations stress qualitative judgment but do not specify procedures for establishing inter-rater reliability or for documenting how 'avoidable' interventions are distinguished from necessary ones. Because these judgments are central to the validity of the qualitative signal, their absence weakens the claim that the method can be routinely applied in a reproducible manner.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly distinguish the proposed small-sample qualitative approach from existing case-study practices in the AI evaluation literature to clarify the intended novelty.
- [Figure 1] Figure 1 (or equivalent diagram of the evaluation pipeline) would benefit from explicit labeling of the qualitative assessment step and any decision criteria used to classify interventions.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped clarify the scope and limitations of our initial open-world evaluation example. We address each major comment below and have revised the manuscript accordingly where feasible.
read point-by-point responses
-
Referee: [§4] §4 (CRUX iOS App Experiment): The inference that the agent's completion of the task with only one avoidable manual intervention demonstrates that open-world evaluations can provide early warning of capabilities soon to become widespread rests on a single qualitative run. No replication across independent trials, no comparison to prior model versions, and no control conditions (e.g., different prompt phrasings or task variants) are reported, leaving the robustness of the outcome and the strength of the broader methodological recommendation under-supported.
Authors: We agree that a single qualitative case cannot robustly support broad inferences about early-warning reliability or widespread capabilities. The iOS experiment is presented as an initial illustrative instance of the open-world approach rather than a controlled empirical study. In the revised manuscript we have explicitly reframed the example to emphasize its role in demonstrating the depth of qualitative analysis possible in real-world tasks, while removing language that could imply generalizability. We maintain that even one long-horizon success in an uncontrolled environment can surface capability signals missed by automated benchmarks, but we now clearly state that systematic replication and controls are needed for stronger claims and are planned for future CRUX evaluations. revision: partial
-
Referee: [§5] §5 (Recommendations for Design and Reporting): The guidelines for conducting open-world evaluations stress qualitative judgment but do not specify procedures for establishing inter-rater reliability or for documenting how 'avoidable' interventions are distinguished from necessary ones. Because these judgments are central to the validity of the qualitative signal, their absence weakens the claim that the method can be routinely applied in a reproducible manner.
Authors: We accept this critique and have expanded the recommendations in the revised §5. We now include explicit guidance on inter-rater reliability, such as using multiple independent reviewers and reporting agreement statistics where practical. We have also added a protocol for classifying interventions, requiring documentation of the precise action taken, the prompt or context that preceded it, and a rationale for whether it was avoidable (e.g., could have been resolved by re-prompting or improved agent scaffolding). These additions directly address reproducibility concerns while preserving the qualitative nature of the method. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper is a conceptual survey advocating open-world evaluations plus one qualitative case study of an AI agent building an iOS app. No mathematical derivations, fitted parameters, or predictions appear that reduce by construction to the paper's own inputs or self-citations. The central suggestion that such evaluations can provide early warning rests on the reported empirical instance rather than any self-referential definition or renaming of prior results. The work is self-contained against external benchmarks in the sense that its claims are presented as observational recommendations, not as forced outputs of internal equations or uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Qualitative small-sample analysis can reliably indicate broader AI capabilities on open-world tasks
Reference graph
Works this paper leans on
-
[1]
Measuring ai ability to complete long tasks.arXiv preprint arXiv:2503.14499,
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...
-
[2]
Building a C compiler with a team of parallel Claudes, 2026
Nicholas Carlini. Building a C compiler with a team of parallel Claudes, 2026. URL https: //www.anthropic.com/engineering/building-c-compiler. Anthropic
work page 2026
-
[3]
Anthropic. Project Vend: Phase two, 2025. URL https://www.anthropic.com/research/ project-vend-2. Anthropic
work page 2025
-
[4]
Assessing Claude Mythos Preview’s cybersecurity capabilities, 2026
Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, Milad Nasr, Vinay Prab- hushankar, Winnie Xiao, Hakeem Angulu, Evyatar Ben Asher, Jackie Bow, Keir Bradwell, Ben Buchanan, David Forsythe, Daniel Freeman, Alex Gaynor, Xinyang Ge, Logan Graham, Kyla Guru, Hasnain Lakhani, Matt McNiece, Mojtaba Mehrara, Renee Nichol, Adnan Pirzada, Sophia Porter, ...
work page 2026
-
[5]
Common Ground between AI 2027 & AI as Normal Technology, 2025
Sayash Kapoor, Arvind Narayanan, Daniel Kokotajlo, Eli Lifland, and Thomas Larsen. Common Ground between AI 2027 & AI as Normal Technology, 2025. URL https://asteriskmag. substack.com/p/common-ground-between-ai-2027-and. Asterisk Magazine
work page 2027
-
[6]
Dynabench: Rethinking benchmarking in NLP
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...
-
[7]
Manning, Christopher Ré, Diana Acosta-Navas, Drew A
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...
work page 2023
-
[8]
Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), pages 375–385,
work page 2021
-
[9]
doi: 10.1145/3442188.3445901
-
[10]
Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,
Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. AI and the everything in the whole wide world benchmark. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URL https://arxiv.org/abs/2111.15366
-
[11]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,
-
[12]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
URLhttps://arxiv.org/abs/2310.06770. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
On the Measure of Intelligence
François Chollet. On the Measure of Intelligence, 2019. URL https://arxiv.org/abs/ 1911.01547. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[14]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, 2024. URL https://arxiv.org/ abs/2406.12045. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [15]
-
[16]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment, 2025. URL https:// arxiv.org/abs/2506.07982. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. ARC- AGI-2: A New Challenge for Frontier AI Reasoning Systems, 2025. URL https://arxiv. org/abs/2505.11831
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry
Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench Verified,
- [20]
-
[21]
METR. Time Horizon 1.1, 2026. URL https://metr.org/blog/2026-1-29-time- horizon-1-1/. METR
work page 2026
-
[22]
SWE-bench. SWE-bench Multilingual. URL https://www.swebench.com/multilingual- leaderboard.html. SWE-bench
-
[23]
τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026
Sierra. τ 3-bench: advancing agent benchmarking to knowledge and voice, 2026. URL https: //sierra.ai/resources/research/tau-3-bench. Sierra
work page 2026
- [24]
-
[25]
John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?,
- [26]
-
[27]
GitHub - harbor-framework/harborz
Harbor Framework Team. GitHub - harbor-framework/harborz. URL https://github.com/ harbor-framework/harbor
-
[28]
To- wards a science of AI agent reliability,
Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan. Towards a Science of AI Agent Reliability, 2026. URL https://arxiv.org/ abs/2602.16666. arXiv.org
-
[29]
Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026
Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush. Many SWE-bench-Passing PRs Would Not Be Merged into Main, 2026. URL https://metr.org/notes/2026-03-10- many-swe-bench-passing-prs-would-not-be-merged-into-main/. METR
work page 2026
-
[30]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding, 2020. URL https: //arxiv.org/abs/2009.03300. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[31]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024 a
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild, 2024. URLhttps://arxiv.org/ abs/2406.04770. arXiv.org
-
[34]
GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark
lmarena. GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM benchmark. URLhttps://github.com/lmarena/arena-hard-auto. GitHub
-
[35]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, 2023. URL https://arxiv.org/abs/ 2307.13854. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Seven simple steps for log analysis in AI systems ,
Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Satos, Sayash Kapoor, Sunischal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, and Cozmin Ududec. Seven simple steps for log analysis in AI systems ,
-
[37]
URLhttps://arxiv.org/html/2604.09563v1
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI Model Performance on Real-Worl...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Artificial Analysis. GDPval-AA Leaderboard. URL https://artificialanalysis.ai/ evaluations/gdpval-aa. Artificial Analysis
-
[40]
Partnering with Mozilla to improve Firefox’s security, 2026
Anthropic. Partnering with Mozilla to improve Firefox’s security, 2026. URL https://www. anthropic.com/news/mozilla-firefox-security. Anthropic
work page 2026
-
[41]
Frank Landymore. One of the World’s Most Advanced AI Agents Is Completely Stuck Trying to Beat a Pokémon Game for Children, 2025. URL https://futurism.com/advanced-ai- stuck-pokemon. Futurism
work page 2025
- [42]
-
[43]
Scaling long-running autonomous coding · Cursor, 2026
Wilson Lin. Scaling long-running autonomous coding · Cursor, 2026. URL https://cursor. com/blog/scaling-agents. Cursor
work page 2026
-
[44]
How we rebuilt Next.js with AI in one week, 2026
Steve Faulkner. How we rebuilt Next.js with AI in one week, 2026. URL https://blog. cloudflare.com/vinext/. The Cloudflare Blog
work page 2026
-
[45]
Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026
Andrej Karpathy. Three days ago I left autoresearch tuning nanochat for 2 days on depth=12 model., 2026. URLhttps://x.com/karpathy/status/2031135152349524125. X
-
[46]
Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946
Dimitris Papailiopoulos. Can You Train a Computer? URL https://x.com/ DimitrisPapail/status/2028669695344148946. X
-
[47]
How close is AI to taking my job?, 2026
Anson Ho. How close is AI to taking my job?, 2026. URL https://epoch.ai/gradient- updates/how-close-is-ai-to-taking-my-job. Epoch AI
work page 2026
-
[48]
MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026
Tom Adamczewski, David Rein, David Owen, and Florian Brand. MirrorCode: Evidence that AI can already do some weeks-long coding tasks, 2026. URL https://epoch.ai/blog/ mirrorcode-preliminary-results. Epoch AI
work page 2026
-
[49]
Automated Weak-to- Strong Researcher, 2026
Jiaxin Wen, Liang Qiu, Joe Benton, Jan Hendrik Kirchner, and Jan Leike. Automated Weak-to- Strong Researcher, 2026. URL https://alignment.anthropic.com/2026/automated- w2s-researcher/. Alignment Science Blog
work page 2026
-
[50]
URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html
Letting AI Post-train AI, 2026. URL https://www.thoughtfullab.com/letting-ai- posttrain-ai.html. Thoughtful Lab
work page 2026
-
[51]
Dylan Huang. tinker-cookbook/tinker_cookbook/recipes/golf_forecasting at claude/golf-forecasting-setup-VIpRZ · dphuang2/tinker-cookbook. URL https: //github.com/dphuang2/tinker-cookbook/tree/claude/golf-forecasting- setup-VIpRZ/tinker_cookbook/recipes/golf_forecasting. GitHub
-
[52]
50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017
David Donoho. 50 Years of Data Science.Journal of Computational and Graphical Statistics, 26(4):745–766, 2017. doi: 10.1080/10618600.2017.1384734. URL https: //www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734. Taylor & Fran- cis
-
[53]
Advances in neural information processing systems, 36:11809–11822
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?, 2024. URLhttps://arxiv.org/abs/2407.15711. arXiv.org
-
[54]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/ abs/2311.12983. arXiv.org
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Arvind Narayanan and Sayash Kapoor. AI as Normal Technology, 2025. URL https://www. normaltech.ai/p/ai-as-normal-technology. AI as Normal Technology
work page 2025
-
[56]
Making frontier cybersecurity capabilities available to defenders, 2026
Anthropic. Making frontier cybersecurity capabilities available to defenders, 2026. URL https://www.anthropic.com/news/claude-code-security. Anthropic
work page 2026
-
[57]
Project Glasswing: Securing critical software for the AI era
Anthropic. Project Glasswing: Securing critical software for the AI era. URL https://www. anthropic.com/glasswing. Anthropic
-
[58]
A Safe Harbor for AI Evaluation and Red Teaming, 2024
Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, and Peter Henderson. A Safe Harb...
-
[59]
Claude Mythos Preview system card, 2026
Anthropic. Claude Mythos Preview system card, 2026. URL https://www-cdn.anthropic. com/8b8380204f74670be75e81c820ca8dda846ab289.pdf. Anthropic. 13
work page 2026
-
[60]
Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026
Michael Burkhardt. Vibe coding could mark the end of the App Store review process as we know it - 9to5Mac, 2026. URL https://9to5mac.com/2026/03/29/vibe-coding- developers-report-long-app-store-review-queues/. 9to5Mac
work page 2026
-
[61]
iOS developers: How long is App Review taking for everyone these days?, 2026
Nikita Bier. iOS developers: How long is App Review taking for everyone these days?, 2026. URLhttps://x.com/nikitabier/status/2033931821260648659. X
-
[62]
The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025
Ariel. The App Store Just Logged Its Biggest Release Year in Nearly a Decade, 2025. URL https://appfigures.com/resources/insights/20251205?f=2. Appfigures
-
[63]
The Apple App Store is seeing an unexpected phenomenon
Jennifer Mattson. The Apple App Store is seeing an unexpected phenomenon. Is vibe coding behind it?, 2026. URL https://www.fastcompany.com/91522242/apple-app-store- vibe-coding-generative-ai-unexpected-phenomenon. Fast Company
-
[64]
OpenClaw - Personal AI Assistant
OpenClaw. OpenClaw - Personal AI Assistant. URLhttps://openclaw.ai/. OpenClaw
-
[65]
Claude API Docs. Adaptive thinking. URL https://platform.claude.com/docs/en/ build-with-claude/adaptive-thinking. Claude API Docs
-
[66]
Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026
Russell Coleman. Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026. URL https://www.anthropic.com/engineering/eval-awareness-browsecomp. An- thropic
work page 2026
-
[67]
Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025
Apollo Research. Claude Sonnet 3.7 (often) knows when it’s in alignment evalua- tions, 2025. URL https://www.apolloresearch.ai/blog/claude-sonnet-37-often- knows-when-its-in-alignment-evaluations/
work page 2025
-
[68]
Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025
Marcus Williams, Cameron Raymond, and Micah Carroll. Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations, 2025. URLhttps://alignment. openai.com/prod-evals/. OpenAI Alignment Blog
work page 2025
-
[69]
Apple. Security Overview. URL https://developer.apple.com/library/archive/ documentation/Security/Conceptual/Security_Overview/Architecture/ Architecture.html. Apple
-
[70]
Controlling app access to files in macOS, 2021
Apple Support. Controlling app access to files in macOS, 2021. URL https: //support.apple.com/guide/security/controlling-app-access-to-files- secddd1d86a6/web. Apple Support
work page 2021
-
[71]
Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026
Viacheslav Potoropin. Hello world does not compile · Issue #1 · anthropics/claudes- c-compiler, 2026. URL https://github.com/anthropics/claudes-c-compiler/ issues/1. GitHub
work page 2026
-
[72]
Youssef Tourki. build fails with 32 errors , no releases, no tags, no stable branch · Issue #98 · wilsonzlin/fastrender, 2026. URL https://github.com/wilsonzlin/fastrender/ issues/98. GitHub
work page 2026
- [73]
-
[74]
What did we learn from the AI Village in 2025?, 2026
Shoshannah Tekofsky. What did we learn from the AI Village in 2025?, 2026. URL https: //theaidigest.org/village/blog/what-we-learned-2025
work page 2025
-
[75]
BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025
Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, and Osbert Bastani. BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks, 2025. URL https://arxiv.org/abs/2510.02418. arXiv
-
[76]
Cheating On AI Agent Evaluations, 2025
Maia Hamin and Benjamin Edelman. Cheating On AI Agent Evaluations, 2025. URL https: //www.nist.gov/caisi/cheating-ai-agent-evaluations. NIST
work page 2025
-
[77]
Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025
Jacob Kahn. Repo State Loopholes During Agentic Evaluation · Issue #465 · SWE-bench/SWE- bench, 2025. URLhttps://github.com/SWE-bench/SWE-bench/issues/465. 14
work page 2025
-
[78]
Project Vend: Can Claude run a small shop? (And why does that matter?), 2025
Anthropic. Project Vend: Can Claude run a small shop? (And why does that matter?), 2025. URLhttps://www.anthropic.com/research/project-vend-1. Anthropic
work page 2025
-
[79]
We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026
Andon Labs. We gave an AI a 3 year retail lease in SF and asked it to make a profit, 2026. URL https://andonlabs.com/blog/andon-market-launch. Andon Labs
work page 2026
-
[80]
SciCode: A Research Coding Benchmark Curated by Scientists, 2024
Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huert...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.