ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Bin Wang; Bo Zhang; Chaofan Hu; Chunfeng Song; Dongzhan Zhou; Fangchen Yu; Fenghua Ling; Guangtao Zhai; Haoxiang Yin; Haoxuan Li

arxiv: 2606.07591 · v3 · pith:IIHFIAMAnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CL

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Wanghan Xu , Shuo Li , Tianlin Ye , Qinglong Cao , Yixin Chen , Hengjian Gao , Yiheng Wang , Qi Li

show 43 more authors

Kun Li Sheng Xu Shengdu Chai Fangchen Yu Xiangyu Zhao Zhangrui Zhao Weijie Ma Zijie Guo Koutian Wu Haoyu Zhou Haoxiang Yin Lixue Cheng Chaofan Hu Haoxuan Li Lu Mi Xuxuan Xie Yifan Zhou Ruizhe Chen Zhiwang Zhou Xingjian Guo Yuhao Zhou Xuming He Shengyuan Xu Xinyu Gu Jiamin Wu Mianxin Liu Chunfeng Song Fenghua Ling Dongzhan Zhou Shixiang Tang Yuqiang Li Mao Su Peng Ye Siqi Sun Bin Wang Xue Yang Zhenfei Yin Tianfan Fu Guangtao Zhai Wanli Ouyang Bo Zhang Lei Bai Wenlong Zhang

This is my paper

Pith reviewed 2026-06-29 08:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords autonomous scientific researchAI agent benchmarkre-discovery evaluationLLM research capabilityscientific workflowmultimodal rubricsend-to-end research tasksagent evaluation protocol

0 comments

The pith

A benchmark shows top AI agents and LLMs average only 21-26 when tasked with re-discovering results from real published papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ResearchClawBench, a collection of 40 tasks drawn from actual papers across ten scientific domains, each supplied with related literature and raw data but not the target paper. Expert-designed rubrics break down the scientific artifacts into weighted criteria so that agent outputs can be scored for how well they recover the core findings. When seven autonomous research agents and seventeen LLMs are tested under a single protocol, the best agent reaches 21.5, the best LLM reaches 20.7, and the frontier average sits at 26.5. These low scores indicate that current systems cannot yet perform reliable end-to-end autonomous scientific research. The benchmark therefore supplies a concrete, reproducible yardstick for measuring future progress.

Core claim

ResearchClawBench evaluates autonomous research agents on 40 tasks grounded in published papers across ten domains. The strongest autonomous agent, Claude Code, averages 21.5 while the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, and the overall LLM frontier mean is 26.5. Failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. The benchmark supplies expert-curated multimodal rubrics that allow scoring of target-paper-level re-discovery while still permitting new discovery.

What carries the argument

ResearchClawBench, a benchmark of 40 tasks each tied to a hidden published paper, equipped with expert-curated multimodal rubrics that decompose scientific artifacts into weighted scoring criteria.

If this is right

Development of scientific AI agents can now be tracked against a fixed set of re-discovery tasks with public rubrics.
Error patterns concentrated in protocol and evidence matching identify specific capabilities that must improve.
The rubric design leaves explicit room for agents to produce discoveries beyond the original papers.
A unified evaluation protocol enables direct comparison across different agent architectures and LLM backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be expanded with tasks that require generating and testing new hypotheses rather than recovering existing ones.
Persistent low scores point to a need for better long-horizon coordination between literature search, experiment design, and result interpretation.
Adding simulation or robotic execution layers to the tasks would expose whether current failures are mainly reasoning or execution bottlenecks.
Regular addition of newer papers would keep the benchmark from becoming a static target that agents overfit to.

Load-bearing premise

The expert-curated rubrics correctly identify the scientific core of each paper and fairly score agent outputs that may differ in form from the original publication.

What would settle it

An agent that scores above 70 on the majority of the 40 tasks under the same evaluation protocol would demonstrate reliable re-discovery of published scientific results.

Figures

Figures reproduced from arXiv: 2606.07591 by Bin Wang, Bo Zhang, Chaofan Hu, Chunfeng Song, Dongzhan Zhou, Fangchen Yu, Fenghua Ling, Guangtao Zhai, Haoxiang Yin, Haoxuan Li, Haoyu Zhou, Hengjian Gao, Jiamin Wu, Koutian Wu, Kun Li, Lei Bai, Lixue Cheng, Lu Mi, Mao Su, Mianxin Liu, Peng Ye, Qi Li, Qinglong Cao, Ruizhe Chen, Shengdu Chai, Sheng Xu, Shengyuan Xu, Shixiang Tang, Shuo Li, Siqi Sun, Tianfan Fu, Tianlin Ye, Wanghan Xu, Wanli Ouyang, Weijie Ma, Wenlong Zhang, Xiangyu Zhao, Xingjian Guo, Xinyu Gu, Xue Yang, Xuming He, Xuxuan Xie, Yifan Zhou, Yiheng Wang, Yixin Chen, Yuhao Zhou, Yuqiang Li, Zhangrui Zhao, Zhenfei Yin, Zhiwang Zhou, Zijie Guo.

**Figure 1.** Figure 1: Overview of ResearchClawBench. (a) ResearchClawBench spans 10 domains and 40 end-to-end tasks, covering diverse scientific questions and data modalities. (b) Overall scores of agents and LLMs; the 50-point line marks target-paper-level re-discovery, and scores above it indicate the discovery regime. arXiv:2606.07591v3 [cs.LG] 17 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 1.** Figure 1: Pump-off, pump-on at thetap = 0 deg, and pump-induced difference maps. The cyan marker denotes the processed replica target region used for raw-window validation. 3. Methods 3.1 Replica-band energy test For each processed replica entry with order n = +-1, I computed an inferred parent energy 𝐸𝑝𝑎𝑟𝑒𝑛𝑡 = 𝐸𝑟𝑒𝑝𝑙𝑖𝑐𝑎 − 𝑛ℏ𝜔, using the pump energy stored in the processed feature file, pump_energy = 0.248 eV. A Floq… view at source ↗

**Figure 2.** Figure 2: Left: extracted Dirac-cone dispersion and identified replica features. Right: order-averaged replica-parent separations compared with the 0.248 eV pump photon energy. 4.2 Raw pump-induced signal near the replica region The raw HDF5 maps support the presence of a pump-induced feature near the processed target region. Averaging pump-on minus pump-off intensity in the target window gives positive values for … view at source ↗

**Figure 3.** Figure 3: Pump-induced difference maps for thetap = 0 deg and 90 deg, an energy distribution curve through the target momentum, and comparison of raw-window signal with mean-subtracted processed polarization intensity. 4.3 Polarization dependence and Volkov final-state interpretation The polarization CSV shows a weak but structured intensity variation. The fitted pi-periodic model gives: • model: I(theta)=c+a cos(2t… view at source ↗

**Figure 4.** Figure 4: Replica intensity versus pump polarization angle with a pi-periodic fit, shown both on linear and polar axes. 5. Validation and traceability 5.1 Directly verified from workspace data • The raw HDF5 axes, spectra shapes, and intensity ranges are summarized in outputs/data_ overview.json. • The processed replicas are photon-spaced from their inferred parent energy by 0.248 eV for both first-order sidebands; … view at source ↗

read the original abstract

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ResearchClawBench grounds tasks in real papers with data and literature, but low re-discovery scores rest on rubrics whose fairness is not yet shown.

read the letter

The paper's main move is to create 40 tasks drawn from actual published work across ten domains. Agents receive related literature and raw data but not the target paper, then get scored on whether they recover the scientific core via expert-curated weighted rubrics. This produces concrete numbers: the best agent reaches 21.5 and the best LLM 20.7, with failures clustered in protocol mismatch, evidence mismatch, and missing core.

What is new is the combination of real-paper grounding, hidden targets, supplied data plus literature, and multimodal rubrics that explicitly allow for new discovery rather than exact replication. Running the same protocol across seven full agents and seventeen LLMs under ResearchHarness is also useful for direct comparison.

The evaluation setup is straightforward and the error breakdown is clear enough to point future work at specific gaps. The numbers are reported directly from the runs.

The load-bearing uncertainty is rubric validity. The abstract states the rubrics are expert-curated and leave room for alternatives, yet gives no inter-rater agreement figures, no sensitivity checks on methodologically different but equivalent outputs, and no comparison against human re-implementations. If the weights over-penalize presentation differences or protocol variations, the low averages could partly reflect rubric choices rather than agent limits. That concern from the stress-test note stands on the information given.

This is for groups building or benchmarking AI agents for science. It supplies a reproducible set of tasks they can try to beat.

It deserves peer review. The benchmark design is concrete and the evaluation is unified; referees can usefully press on rubric construction and task selection details.

Referee Report

3 major / 1 minor

Summary. The paper introduces ResearchClawBench, a benchmark with 40 tasks drawn from real published papers across 10 scientific domains. Each task supplies related literature and raw data while hiding the target paper; expert-curated multimodal rubrics decompose the target artifacts into weighted criteria for scoring autonomous re-discovery. Evaluations of seven auto-research agents and seventeen LLMs via ResearchHarness yield low averages (Claude Code at 21.5, Claude-Opus-4.7 at 20.7, frontier LLM mean 26.5), with errors concentrated in protocol mismatch, evidence mismatch, and missing scientific core. The work positions the benchmark as a reproducible frontier for measuring progress toward autonomous scientific research.

Significance. If the rubrics prove reliable, the benchmark supplies a concrete, reproducible yardstick that quantifies the distance between current AI systems and reliable end-to-end scientific re-discovery. The grounding in actual published papers, provision of raw data and literature, and explicit allowance for new discovery within the rubrics are constructive design choices that distinguish it from purely synthetic or narrow coding benchmarks.

major comments (3)

[Abstract / Rubric Construction] Abstract and rubric description: The headline claim that systems remain 'far from reliable re-discovery' rests entirely on the reported scores (21.5–26.5). No details are supplied on how the expert-curated multimodal rubrics were constructed, what inter-rater agreement was achieved, or whether they were validated against human re-implementations that use different but scientifically equivalent protocols. Without this, it is impossible to determine whether the low scores reflect agent limitations or rubric choices that over-penalize format or protocol differences.
[Error Analysis] Error analysis: The manuscript states that failures concentrate in 'experimental protocol mismatch' and 'evidence mismatch.' Because the rubrics are described only at a high level and no sensitivity analysis is reported, it remains unclear whether these categories would still dominate if the rubrics explicitly credited methodologically distinct but scientifically equivalent outputs, as the abstract claims they 'leave room for new discovery.'
[Benchmark Construction] Task selection and generalizability: The 40 tasks are said to be 'grounded in a real published paper' across 10 domains, yet no explicit criteria for task selection, difficulty calibration, or domain representativeness are provided. This omission weakens the inference that the observed performance gap is representative of autonomous research capability in general rather than an artifact of the chosen papers.

minor comments (1)

[Evaluation Protocol] The abstract refers to 'multimodal rubrics' but the manuscript should clarify whether scoring criteria include visual or data-visualization elements and how these are evaluated when agent outputs differ in presentation format.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the manuscript's transparency and rigor without altering its core claims or results.

read point-by-point responses

Referee: [Abstract / Rubric Construction] Abstract and rubric description: The headline claim that systems remain 'far from reliable re-discovery' rests entirely on the reported scores (21.5–26.5). No details are supplied on how the expert-curated multimodal rubrics were constructed, what inter-rater agreement was achieved, or whether they were validated against human re-implementations that use different but scientifically equivalent protocols. Without this, it is impossible to determine whether the low scores reflect agent limitations or rubric choices that over-penalize format or protocol differences.

Authors: We agree that the manuscript would benefit from greater transparency on rubric construction. In the revised version we will add a dedicated subsection describing the process: domain experts decomposed each target paper into weighted criteria based on scientific importance, with explicit allowance for equivalent protocols and new discoveries. We will report any inter-rater agreement metrics that were collected during rubric finalization. A formal validation study against independent human re-implementations was not conducted, as the benchmark's primary purpose is to measure AI performance relative to the published artifacts; however, we will clarify this scope limitation and discuss its implications for interpreting the low scores. revision: yes
Referee: [Error Analysis] Error analysis: The manuscript states that failures concentrate in 'experimental protocol mismatch' and 'evidence mismatch.' Because the rubrics are described only at a high level and no sensitivity analysis is reported, it remains unclear whether these categories would still dominate if the rubrics explicitly credited methodologically distinct but scientifically equivalent outputs, as the abstract claims they 'leave room for new discovery.'

Authors: The error categories were derived from qualitative review of agent outputs against the rubrics. To directly address the concern, the revision will include a sensitivity analysis on a representative subset of tasks: we will re-score outputs while explicitly crediting methodologically distinct but scientifically equivalent approaches and report the resulting changes in error distributions and aggregate scores. This will test whether the dominant failure modes persist under more flexible interpretations consistent with the benchmark's design intent. revision: yes
Referee: [Benchmark Construction] Task selection and generalizability: The 40 tasks are said to be 'grounded in a real published paper' across 10 domains, yet no explicit criteria for task selection, difficulty calibration, or domain representativeness are provided. This omission weakens the inference that the observed performance gap is representative of autonomous research capability in general rather than an artifact of the chosen papers.

Authors: We will expand the Benchmark Construction section to list the explicit selection criteria (availability of open raw data and code, presence of multimodal artifacts, coverage across 10 domains, and feasibility of expert rubric creation). Difficulty was calibrated via internal pilot runs with frontier models; we will report these steps and any observed variance. While the 40 tasks cannot claim exhaustive representativeness of all scientific research, the selection aimed for diversity; the revision will include a limitations paragraph on generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark uses external ground truth

full rationale

The paper defines ResearchClawBench using 40 tasks drawn from independently published external papers as ground truth, with expert-curated rubrics applied to score agent and LLM outputs. Reported averages (e.g., 21.5 for Claude Code) are computed directly from these fixed external references and rubrics under a unified protocol. No equations, fitted parameters, self-referential predictions, or derivations appear in the abstract or described structure; the central claim of low re-discovery performance rests on empirical evaluation against outside artifacts rather than reducing to quantities defined within the benchmark itself. This satisfies the condition of being self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper's central claim rests on the assumption that the chosen tasks and rubrics are representative and that hiding the target paper prevents trivial leakage; no free parameters or invented entities are introduced beyond the benchmark construction itself.

axioms (2)

domain assumption The 40 tasks drawn from published papers are representative of end-to-end scientific research across the 10 domains.
Invoked when generalizing the low scores to the broader claim that current systems are far from reliable autonomous research.
domain assumption Expert-curated rubrics provide a valid decomposition of scientific artifacts into weighted, scorable criteria.
This premise is required for the reported average scores to be interpreted as measures of research capability.

pith-pipeline@v0.9.1-grok · 5922 in / 1472 out tokens · 32282 ms · 2026-06-29T08:22:17.833630+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements
cs.AI 2026-06 unverdicted novelty 6.0

Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
cs.AI 2026-06 unverdicted novelty 5.0

EurekAgent achieves new state-of-the-art results on mathematics, kernel engineering, and machine learning tasks by engineering agent environments for autonomous scientific discovery, including a 26-circle packing resu...

Reference graph

Works this paper leans on

38 extracted references · 22 canonical work pages · cited by 2 Pith papers · 13 internal anchors

[1]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Humanity's Last Exam

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Advances in Neural Information Processing Systems , volume=

Scicode: A research coding benchmark curated by scientists , author=. Advances in Neural Information Processing Systems , volume=
[4]

International Conference on Learning Representations , volume=

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. International Conference on Learning Representations , volume=
[5]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

PaperBench: Evaluating AI's Ability to Replicate AI Research

PaperBench: Evaluating AI's Ability to Replicate AI Research , author=. arXiv preprint arXiv:2504.01848 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Scienceworld: Is your agent smarter than a 5th grader? , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[8]

arXiv preprint arXiv:2502.14499 , year=

Mlgym: A new framework and benchmark for advancing ai research agents , author=. arXiv preprint arXiv:2502.14499 , year=

work page arXiv
[9]

Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

Crowdsourcing multiple choice science questions , author=. Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=
[10]

Huang, J

Mlagentbench: Evaluating language agents on machine learning experimentation , author=. arXiv preprint arXiv:2310.03302 , year=

work page arXiv
[11]

International Conference on Learning Representations , volume=

Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. International Conference on Learning Representations , volume=
[12]

Advances in Neural Information Processing Systems , volume=

Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents , author=. Advances in Neural Information Processing Systems , volume=
[13]

Lupidi, B

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents , author=. arXiv preprint arXiv:2602.06855 , year=

work page arXiv
[14]

Advances in Neural Information Processing Systems , volume=

Mlr-bench: Evaluating ai agents on open-ended machine learning research , author=. Advances in Neural Information Processing Systems , volume=
[15]

AI & SOCIETY , volume=

Researchers’ perceptions of automating scientific research , author=. AI & SOCIETY , volume=. 2025 , publisher=

2025
[16]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[17]

EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026

Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery , author=. arXiv preprint arXiv:2603.08127 , year=

work page arXiv
[18]

Journal of Systems and Software , volume=

Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents , author=. Journal of Systems and Software , volume=. 2025 , publisher=

2025
[19]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[21]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Scibench: Evaluating college-level scientific problem-solving abilities of large language models , author=. arXiv preprint arXiv:2307.10635 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2506.12958 , year=

Domain specific benchmarks for evaluating multimodal large language models , author=. arXiv preprint arXiv:2506.12958 , year=

work page arXiv
[23]

Bioinformatics , volume=

Chembench: a cheminformatics workbench , author=. Bioinformatics , volume=. 2010 , publisher=

2010
[24]

Advances in neural information processing systems , volume=

What can large language models do in chemistry? a comprehensive benchmark on eight tasks , author=. Advances in neural information processing systems , volume=
[25]

The Fourteenth International Conference on Learning Representations , year=

Earthse: A benchmark evaluating earth scientific exploration capability for large language models , author=. The Fourteenth International Conference on Learning Representations , year=
[26]

MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs

MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science , author=. arXiv preprint arXiv:2505.20740 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark , author=. arXiv preprint arXiv:2409.11363 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Autoreproduce: Automatic ai experiment reproduction with paper lineage , author=. arXiv preprint arXiv:2505.20662 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration , author=. arXiv preprint arXiv:2605.03042 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2512.16969 , year=

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows , author=. arXiv preprint arXiv:2512.16969 , year=

work page arXiv
[31]

arXiv preprint arXiv:2511.14366 , year=

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning , author=. arXiv preprint arXiv:2511.14366 , year=

work page arXiv
[32]

Kimi K2.5: Visual Agentic Intelligence

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2602.09132 , year=

SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery , author=. arXiv preprint arXiv:2602.09132 , year=

work page arXiv
[34]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Mineru: An open-source solution for precise document content extraction , author=. arXiv preprint arXiv:2409.18839 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

2026 , howpublished =

Mingxin Yang , title =. 2026 , howpublished =

2026
[36]

Towards an AI co-scientist

Towards an AI co-scientist , author=. arXiv preprint arXiv:2502.18864 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2505.18705 , year=

Ai-researcher: Autonomous scientific innovation , author=. arXiv preprint arXiv:2505.18705 , year=

work page arXiv
[38]

2026 , eprint=

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery , author=. 2026 , eprint=

2026

[1] [1]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Humanity's Last Exam

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Advances in Neural Information Processing Systems , volume=

Scicode: A research coding benchmark curated by scientists , author=. Advances in Neural Information Processing Systems , volume=

[4] [4]

International Conference on Learning Representations , volume=

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery , author=. International Conference on Learning Representations , volume=

[5] [5]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

PaperBench: Evaluating AI's Ability to Replicate AI Research

PaperBench: Evaluating AI's Ability to Replicate AI Research , author=. arXiv preprint arXiv:2504.01848 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Scienceworld: Is your agent smarter than a 5th grader? , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022

[8] [8]

arXiv preprint arXiv:2502.14499 , year=

Mlgym: A new framework and benchmark for advancing ai research agents , author=. arXiv preprint arXiv:2502.14499 , year=

work page arXiv

[9] [9]

Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

Crowdsourcing multiple choice science questions , author=. Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

[10] [10]

Huang, J

Mlagentbench: Evaluating language agents on machine learning experimentation , author=. arXiv preprint arXiv:2310.03302 , year=

work page arXiv

[11] [11]

International Conference on Learning Representations , volume=

Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. International Conference on Learning Representations , volume=

[12] [12]

Advances in Neural Information Processing Systems , volume=

Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

Lupidi, B

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents , author=. arXiv preprint arXiv:2602.06855 , year=

work page arXiv

[14] [14]

Advances in Neural Information Processing Systems , volume=

Mlr-bench: Evaluating ai agents on open-ended machine learning research , author=. Advances in Neural Information Processing Systems , volume=

[15] [15]

AI & SOCIETY , volume=

Researchers’ perceptions of automating scientific research , author=. AI & SOCIETY , volume=. 2025 , publisher=

2025

[16] [16]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[17] [17]

EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026

Evoscientist: Towards multi-agent evolving ai scientists for end-to-end scientific discovery , author=. arXiv preprint arXiv:2603.08127 , year=

work page arXiv

[18] [18]

Journal of Systems and Software , volume=

Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents , author=. Journal of Systems and Software , volume=. 2025 , publisher=

2025

[19] [19]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

[21] [21]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Scibench: Evaluating college-level scientific problem-solving abilities of large language models , author=. arXiv preprint arXiv:2307.10635 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

arXiv preprint arXiv:2506.12958 , year=

Domain specific benchmarks for evaluating multimodal large language models , author=. arXiv preprint arXiv:2506.12958 , year=

work page arXiv

[23] [23]

Bioinformatics , volume=

Chembench: a cheminformatics workbench , author=. Bioinformatics , volume=. 2010 , publisher=

2010

[24] [24]

Advances in neural information processing systems , volume=

What can large language models do in chemistry? a comprehensive benchmark on eight tasks , author=. Advances in neural information processing systems , volume=

[25] [25]

The Fourteenth International Conference on Learning Representations , year=

Earthse: A benchmark evaluating earth scientific exploration capability for large language models , author=. The Fourteenth International Conference on Learning Representations , year=

[26] [26]

MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs

MSEarth: A Multimodal Scientific Dataset and Benchmark for Phenomena Uncovering in Earth Science , author=. arXiv preprint arXiv:2505.20740 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark , author=. arXiv preprint arXiv:2409.11363 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Autoreproduce: Automatic ai experiment reproduction with paper lineage , author=. arXiv preprint arXiv:2505.20662 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration , author=. arXiv preprint arXiv:2605.03042 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

arXiv preprint arXiv:2512.16969 , year=

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows , author=. arXiv preprint arXiv:2512.16969 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2511.14366 , year=

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning , author=. arXiv preprint arXiv:2511.14366 , year=

work page arXiv

[32] [32]

Kimi K2.5: Visual Agentic Intelligence

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

arXiv preprint arXiv:2602.09132 , year=

SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery , author=. arXiv preprint arXiv:2602.09132 , year=

work page arXiv

[34] [34]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Mineru: An open-source solution for precise document content extraction , author=. arXiv preprint arXiv:2409.18839 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

2026 , howpublished =

Mingxin Yang , title =. 2026 , howpublished =

2026

[36] [36]

Towards an AI co-scientist

Towards an AI co-scientist , author=. arXiv preprint arXiv:2502.18864 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

arXiv preprint arXiv:2505.18705 , year=

Ai-researcher: Autonomous scientific innovation , author=. arXiv preprint arXiv:2505.18705 , year=

work page arXiv

[38] [38]

2026 , eprint=

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery , author=. 2026 , eprint=

2026