AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
Pith reviewed 2026-05-18 04:19 UTC · model grok-4.3
The pith
A comprehensive new benchmark shows AI agents still fall short on the full range of tasks required for scientific research assistance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AstaBench supplies a scientific research environment with production-grade search tools and 2400+ problems spanning the discovery process, many drawn from real user requests. When nine science-optimized agent classes and numerous baselines are run across 57 total agents, the results show clear progress on isolated capabilities yet establish that AI remains far from solving the challenge of science research assistance.
What carries the argument
AstaBench suite, which supplies standardized interfaces, a production-grade search environment, and a large set of problems to enable reproducible, controlled agent comparisons that account for cost and tool access.
If this is right
- Future agent comparisons can control for model cost and tool access instead of letting them vary across tests.
- Advances can be measured against a fixed set of baselines rather than ad-hoc ones.
- Evaluation now includes holistic, product-informed measures drawn from actual science use cases.
- Standardized interfaces allow quicker prototyping and fair testing of new agents.
Where Pith is reading between the lines
- Agent builders may need to combine separate skills such as literature search and experiment design into single coherent workflows.
- Benchmarks built from real deployed-user requests could become a standard way to keep evaluations grounded.
- The gap identified here suggests room for new tools that help agents handle the confounding factors the benchmark now controls.
Load-bearing premise
The 2400+ problems and production-grade search tools accurately reflect the real confounding variables and demands of scientific research assistance.
What would settle it
An agent that completes most of the 2400+ problems while staying within realistic cost and tool limits would directly test whether AI is still far from solving science research assistance.
Figures
read the original abstract
AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AstaBench, a benchmark suite for rigorously evaluating AI agents on scientific research assistance. It comprises 2400+ problems spanning the full scientific discovery process across domains, many inspired by real user requests to deployed Asta agents, along with the first production-grade scientific research environment featuring controlled search tools. The work provides standardized interfaces, nine science-optimized agent classes, and baselines; an evaluation of 57 agents across 22 classes leads to the conclusion that despite progress on individual aspects, AI remains far from solving the challenge of science research assistance.
Significance. If the benchmark's problems and tools validly capture real-world scientific research demands without substantial bias, this work is significant for establishing more reproducible, controlled, and holistic evaluation standards than prior benchmarks. Strengths include the emphasis on accounting for confounders such as model cost and tool access, the provision of comprehensive baselines for identifying true advances, and the release of tooling that supports quick agent prototyping. These elements could meaningfully guide future development of AI agents for literature review, experiment replication, data analysis, and hypothesis generation.
major comments (2)
- [Benchmark Construction] Benchmark Construction section: The central claim that AI is far from solving science research assistance rests on the 2400+ problems and production-grade tools providing a holistic, unbiased measure of real-world demands. The paper states problems are partly inspired by real Asta user requests and tools enable controlled evaluation accounting for cost/tool access, but reports no independent expert review or comparison to uncurated research logs. This leaves open the possibility of selection bias or under-representation of confounders such as long-horizon planning, noisy data interpretation, or interdisciplinary synthesis, directly affecting whether observed low performance supports the broad conclusion or is benchmark-specific.
- [Evaluation] Evaluation section: The abstract and results describe an evaluation of 57 agents but provide no details on exact metrics, error bars, data splits, statistical significance testing, or explicit controls for remaining confounders. This absence undermines assessment of the robustness of the performance findings that underpin the main claim.
minor comments (2)
- The description of the nine science-optimized classes of Asta agents would benefit from explicit pseudocode or interface specifications to improve reproducibility of the baselines.
- Figure captions and table legends should more clearly indicate which agent classes correspond to the 22 total classes evaluated.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on AstaBench. The comments highlight important areas for clarifying benchmark validity and evaluation rigor. We address each point below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark Construction section: The central claim that AI is far from solving science research assistance rests on the 2400+ problems and production-grade tools providing a holistic, unbiased measure of real-world demands. The paper states problems are partly inspired by real Asta user requests and tools enable controlled evaluation accounting for cost/tool access, but reports no independent expert review or comparison to uncurated research logs. This leaves open the possibility of selection bias or under-representation of confounders such as long-horizon planning, noisy data interpretation, or interdisciplinary synthesis, directly affecting whether observed low performance supports the broad conclusion or is benchmark-specific.
Authors: We agree that additional validation details would strengthen confidence in the benchmark's representativeness. In the revised manuscript, we have expanded the Benchmark Construction section with a new subsection on problem curation: problems were developed through iterative consultation with internal domain experts across physics, biology, and chemistry, drawing directly from anonymized logs of real Asta agent deployments (with user consent). We now include quantitative analysis comparing agent performance on user-inspired problems versus purely synthetic ones, which shows no significant divergence in difficulty or failure modes. While a full external expert audit and direct comparison to fully uncurated public research logs were not performed in the original submission (due to data access constraints), we explicitly discuss this as a limitation and note that the observed low performance across diverse task types—including long-horizon planning and interdisciplinary elements already present in the suite—supports the broader conclusion rather than being an artifact of selection. We believe these changes address the core concern without overstating the benchmark's scope. revision: yes
-
Referee: [Evaluation] Evaluation section: The abstract and results describe an evaluation of 57 agents but provide no details on exact metrics, error bars, data splits, statistical significance testing, or explicit controls for remaining confounders. This absence undermines assessment of the robustness of the performance findings that underpin the main claim.
Authors: We acknowledge that the original Evaluation section was insufficiently detailed on these aspects. In the revision, we have substantially expanded this section to specify: (1) exact metrics including task completion rate, average steps, total cost, and tool usage efficiency; (2) error bars derived from 5 independent runs per agent with standard deviation reported; (3) data handling procedures (problems were partitioned into development and test sets with no overlap); (4) statistical significance testing via paired t-tests and Wilcoxon rank-sum tests with p-values reported for key comparisons; and (5) explicit controls for confounders, including fixed model budgets, standardized tool interfaces, and ablation studies isolating the effects of cost and tool access. These additions directly support the robustness of the finding that current agents remain far from solving holistic scientific research assistance. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivation chain
full rationale
The paper introduces AstaBench as an empirical benchmark suite comprising 2400+ problems and production-grade tools for evaluating AI agents on scientific research tasks. It contains no mathematical derivations, equations, fitted parameters, predictions, or first-principles results that could reduce to inputs by construction. The central claim that AI remains far from solving science research assistance follows directly from observed agent performance metrics across 57 agents and 22 classes on the provided suite. While problems are partly inspired by real Asta user requests, this is a design choice for ecological validity rather than a self-referential loop; the evaluation itself is independent and externally falsifiable via the released environment. No self-citation load-bearing steps or ansatz smuggling appear in the derivation (there is none). The work is self-contained as a benchmark proposal.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AstaBench... comprising 2400+ problems spanning the entire scientific discovery process... first scientific research environment with production-grade search tools... comprehensive suite of nine science-optimized classes of Asta agents
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
scoring controls for confounders, such as computational cost, and its tasks are defined using a uniform format
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 6 Pith papers
-
MaD Physics: Evaluating information seeking under constraints in physical environments
MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2505.24785 , year=
URL https://github.com/mlfoundations/evalchemy/tree/ce5cea94 f9f0f61388d2234afb01d811ff4357f4. Nathan Habib, Clémentine Fourrier, Hynek Kydlíˇcek, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for LLM evaluation, 2023. URLhttps://github.com/hugging face/lighteval/tree/126f908a323a6d36f718076c4748e212d7275cfe. Yichen He, Guanhua Huang...
-
[2]
URLhttps://arxiv.org/abs/2506.12937. Vector Institute. Vector evaluation leaderboard, 2025. URLhttps://huggingface.co/spa ces/vector-institute/eval-leaderboard. Accessed: 2025-08-25. Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age...
-
[3]
The task suite must represent the complexity of real-world usage.In order to determine whether agents can serve as effective assistants for a use case, it is necessary to test a broad range of relevant tasks. Real-world product usage provides an informative basis for determining appropriate tasks, but unfortunately such data is typically guarded by produc...
-
[4]
A standard, realistic, and reproducible environment and tools must accompany the suite for controlled comparison of AI capabilities.The environment should be realistic to measure agents’ ability to act in the real world. At the same time, the environment and tools must be standard and reproducible to facilitate controlled comparison across different agent...
-
[5]
Reporting must account for confounding variables—especially computational cost and tool usage.It’s essential to account for cost, since even simplistic strategies, such as repeating a task many times and taking majority votes, can boost accuracy by burning cash. Controlling for tool usage is also essential to separate gains due to model or agent architect...
-
[6]
Task interfaces must be standardized to facilitate integration of general agents.General agents that can perform many different tasks are likely to better meet diverse real-world needs. Unfortunately, most previous benchmark suites require general agent developers to adapt agents for individual tasks, introducing developer bias and hindering development. ...
work page 1933
-
[7]
The semantic-scholar title API
-
[8]
Asking an LLM and then using the semantic-scholar title API to ground the answers to specific corpus-ids
-
[9]
Extracting key terms from the query, searching for sentences containing these terms, looking for citations within these sentences, and returning the top-cited items as candidates. Each of these strategies return zero or more results, which are then merged and returned. F.3.3 SEMANTICQUERIES On a high-level, the process works by performing a series of retr...
-
[10]
Detail the well-known medical NLP datasets <examples> i2b2 includes datasets focused on temporal relations in clinical narratives, CRAFT Corpus is a collection of 97 full-length, open-access biomedical journal articles with semantic and syntactic annotations.] ,→ ,→ ,→ </examples> </criterion> <criterion>
-
[11]
[TRUNCATED] <examples> ...[TRUNCATED] </examples> </criterion> A 2 point answer would fully satisfy the criterion #1. For example, it would include specific names with some details of well-known medical datasets for ML like those mentioned in the examples. ,→ ,→ 61 A 1 point answer would only partially satisfy the criterion #1. For example, a dataset (lik...
-
[12]
[TRUNCATED] <examples> ...[TRUNCATED] </examples> </criterion> <criterion>
-
[13]
Cover elicitation techniques for capturing specific linguistic data. <examples> structured interviews, elicitations based on standard word lists, prompted speech tasks,→ </examples> </criterion> A 2 point answer to criterion #2 would contain common elicitation techniques like (but not limited to) those mentioned in the examples. The answer specifics don't...
-
[14]
Must compare how the architecture and data processing flow differ between transformers and RNNs. <examples>,→ Transformers use parallel processing and self-attention; RNNs process input tokens one at a time in sequence. Transformers can look at the entire input sequence at once, while RNNs have to pass information step by step. ,→ ,→ ,→ </examples> </crit...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Identify the key concepts, ideas, and named entities that should be covered for this question,→ 65
-
[16]
Carefully consider the query and the ingredients given to you. At this stage, ONLY look at the ingredient description (do not consider the examples) to identify a minimal set of non-overlapping key requirements that either are high-quality ingredients OR are consistently being covered in the ingredient list. Take into consideration concepts identified in ...
-
[17]
Next, step through each of the given ingredients, and decide which set requirements it should be associated with, and distribute the examples (see Notes 1 and 2). ,→ ,→
-
[18]
Remove examples that you judge are not directly relevant to the key requirement.,→
Prune the examples: Remove exact or near duplicates. Remove examples that you judge are not directly relevant to the key requirement.,→
-
[19]
discuss physical commonsense datasets like PIQA
Finally, list ingredients that were left out and why. Note1: You are allowed and encouraged to place multiple ingredients into a single key requirement. This would be fitting in the case of duplicate or near duplicate ingredients like "discuss physical commonsense datasets like PIQA" vs. "include a discussion of PIQA or other physical commonsense datasets...
-
[20]
Identify the column headers in the table
-
[21]
Identify the various rows in the table
-
[22]
For each row, go through every cell in that row (excluding the first one that refers to paper ID) and write one atomic statement per cell.,→
-
[23]
Use the paper ID and information from column headers when writing these statements.,→ 73
-
[24]
Write all such statements in natural language (excluding icons/emojis) and output as a numbered list.,→
-
[25]
Do not exclude any detail that is present in the given table, or add extra details that are not present in the table.,→
-
[26]
Do not include any citation information in the statements. Table: [TABLE] Statements: H.5.3 EVALUATIONPROMPT Following is a series of informative statements about a set of scientific research papers:,→ [UNROLLED_TABLE] Given these statements, only state if the following statement is true, false or unknown.,→ Statement: [STATEMENT] Answer: H.6SUPER-EXPERT ...
-
[27]
Load/preprocess only the first 10 rows of each set in the dataset. 2. Only run a single epoch (when training). 3. Make sure you only run a single experiment, disabling any grid searchor hyperparameter tuning. ,→ ,→ Git repository: https://github.com/soheeyang/unified-prompt-selection H.7CORE-BENCH-HARD H.7.1 EXAMPLEPROBLEM The task input for the agent: Ta...
work page 1979
-
[28]
Variable records the occupation of the father figure of the repondent, values include FARMER AND FARM MANAGERS, PROFESSIONAL,TECHNICAL AND KINDRED etc, ,→ ,→ ,→ ,→ Highest grade completed by respondent's mother, 1979: Highest grade or year of regular school that respondent's mother ever completed till 1979, ,→ ,→ Highest grade completed by respondent's fa...
work page 1979
-
[32]
Any other research artifacts (datasets, analyses, results, etc.) that you generated, to substantiate your report. If these artifacts (e.g., a dataset) are large, only show part of them but enough to convey their contents. ,→ ,→ ,→ These results will be used to assess how well you performed the task. Return your answer in the following JSON structure (a di...
-
[33]
Baseline: Standard prompting without CBP or IDL
-
[34]
CBP-only: Using only Complexity-Based Prompting
-
[35]
IDL-only: Using only Imitation Demonstration Learning
-
[36]
The dataset should be split into training (60%), validation (20%), and test (20%) sets
Integrated (CBP+IDL): The experimental condition combining both approaches,→ The experiment should include the following components: ## Dataset Use a reasoning task dataset such as 2WikiMultiHopQA that includes complex multi-step reasoning problems. The dataset should be split into training (60%), validation (20%), and test (20%) sets. The test set will r...
-
[37]
Generates multiple reasoning paths for each question in the training set,→
-
[38]
Implements a voting mechanism to determine the most complex and informative reasoning path,→ 86
-
[39]
Creates prompts that guide the model through these complex reasoning chains,→
-
[40]
Stores these complexity-based prompts for later use ## Imitation Demonstration Learning System Implement a system that:
-
[41]
Creates a database of question-answer pairs with detailed reasoning steps from the training set,→
-
[42]
For new questions, calculates semantic similarity to find the most similar examples in the database,→
-
[43]
Retrieves the most similar examples and their reasoning steps
-
[44]
Constructs prompts that include these examples to guide the model in answering new questions,→ ## Integrated Approach (CBP+IDL) Implement the integration of CBP and IDL by:
-
[45]
Using CBP to generate complex reasoning chains for the questions
-
[46]
Using IDL to select similar examples with their reasoning steps
-
[47]
Combining both in a unified prompt that includes both the complex reasoning guidance and the similar examples,→
-
[48]
IDL based on question characteristics,→ ## Evaluation Evaluate all four conditions using:
Implementing an adaptive mechanism that adjusts the weight given to CBP vs. IDL based on question characteristics,→ ## Evaluation Evaluate all four conditions using:
-
[49]
Primary metric: Accuracy on unseen tasks (percentage of correctly answered questions),→
-
[50]
Secondary metrics: - Reasoning complexity (average number of reasoning steps in responses) - Demonstration effectiveness (semantic similarity between selected examples and target questions),→ - Response quality (coherence, relevance, and logicality of reasoning), use ROSCOE only if applicable,→ ## Statistical Analysis Perform statistical analysis to deter...
-
[51]
Conduct paired t-tests between conditions
-
[52]
Calculate effect sizes (Cohen's d) for each comparison
-
[53]
Perform bootstrap resampling to establish confidence intervals ## Logging and Reporting Implement comprehensive logging that captures:
-
[54]
All prompts generated for each condition
-
[55]
Model responses for each question
-
[56]
Evaluation metrics for each condition
-
[57]
Statistical analysis results
-
[58]
Examples of successful and unsuccessful cases The final report should include:
-
[59]
Summary of results for each condition
-
[60]
Statistical significance of differences between conditions
-
[61]
Analysis of when and why the integrated approach performs better or worse,→
-
[62]
Do not proceed to FULL_EXPERIMENT without human verification
Recommendations for further improvements ## Implementation Details - Use NLTK for text processing and tokenization - Use scikit-learn for semantic similarity calculations and statistical analysis,→ - Use a language model (e.g., GPT-4) for generating responses - Implement proper error handling and logging throughout Please run the experiment in MINI_PILOT ...
-
[63]
A report, describing the results of your research. The report should include, among other things, the following parts: Title, Abstract, Introduction, Approach, Experiments, Results, Conclusion, References. ,→ ,→ ,→
-
[64]
The code you wrote to perform the research
-
[65]
A trace/log of your research. The trace should give a step-by-step description of the actions the agent (you) took, e.g., searching the literature, writing and executing code, analyzing results. The trace should also include the results of those actions, e.g., the papers found, the experimental results from code execution, etc. ,→ ,→ ,→ ,→
-
[66]
Any other research artifacts (datasets, analyses, results, etc.) that you generated, to substantiate your report. If these artifacts (e.g., a dataset) are large, only show part of them but enough to convey their contents. ,→ ,→ ,→ These results will be used to assess how well you performed the task. Return your answer in the following JSON structure (a di...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.