Autonomous Scientific Discovery via Iterative Meta-Reflection

Bingchen Zhao; Oisin Mac Aodha; Sara Beery

arxiv: 2607.01131 · v1 · pith:WVZL2AUZnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI

Autonomous Scientific Discovery via Iterative Meta-Reflection

Bingchen Zhao , Sara Beery , Oisin Mac Aodha This is my paper

Pith reviewed 2026-07-02 13:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autonomous scientific discoverylarge language modelsmeta-reflectionecological patternshypothesis validationmultimodal dataiNatDiscocausal discovery

0 comments

The pith

DiscoPER recovers 8 of 9 known ecological patterns by using meta-reflection on its own prior discoveries to guide open-ended hypothesis search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiscoPER, an LLM-based system that generates code to test hypotheses on datasets without any pre-set research questions. It requires every candidate discovery to pass statistical tests for validity. A second-order meta-reflection step periodically treats the system's own findings as data to detect patterns, confounds, and gaps, then steers further exploration away from covered areas. Tool use lets the system pull information from images and other multimodal inputs. On a new benchmark built from peer-reviewed ecological literature, the method recovers eight of nine documented patterns at a 72.7 percent support rate and exceeds classical causal-discovery and plain LLM baselines.

Core claim

DiscoPER performs open-ended research by dynamically generating and executing code to explore datasets. Every proposed discovery must pass statistical testing. A second-order reasoning mechanism periodically analyzes accumulated discoveries as empirical data to identify structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions. Tool use expands the search to multimodal sources such as images. Evaluated on the iNatDisco benchmark with pattern-level ground truth from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided ba

What carries the argument

The second-order meta-reflection mechanism that treats prior discoveries as empirical data to detect structural patterns, confounds, and epistemic gaps and then redirects the search.

If this is right

The approach works without any pre-specified research objectives.
Second-order meta-reflection improves performance over standard iterative hypothesis generation.
Tool use for multimodal inputs enlarges the reachable search space.
The system scales with additional data volume.
Every discovery is required to pass statistical testing before acceptance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same meta-reflection loop could be applied to multimodal datasets outside ecology to test whether recovery rates remain high.
If epistemic-gap detection works as described, the method might systematically surface areas where existing literature is sparse.
Combining the framework with richer code-execution sandboxes could allow validation of more complex hypotheses than the current statistical tests cover.

Load-bearing premise

Statistical testing of each proposed discovery is sufficient to guarantee scientific validity and the meta-reflection step does not introduce biases that change the reported recovery rate.

What would settle it

Re-running the full DiscoPER evaluation on the iNatDisco benchmark and obtaining either fewer than eight of the nine known patterns recovered or a hypothesis support rate materially below 72.7%.

Figures

Figures reproduced from arXiv: 2607.01131 by Bingchen Zhao, Oisin Mac Aodha, Sara Beery.

**Figure 2.** Figure 2: DiscoPER is an iterative scientific discovery system consisting of three core modules: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling behavior on iNatDisco-50K. (a) Providing more data improves recall and yields more supported insights. (b) More model iterations increases recall but the support rate decreases as the model moves on from easy hypotheses and starts to propose more speculative ones. iNatDisco-50K) and support rate, confirming that REFLECT not only broadens what the system investigates but also improves the quality of… view at source ↗

**Figure 4.** Figure 4: Experiments on our iNatDisco-800-CF counterfactual dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation of reflection. Left: distribution of generated hypothesis with and without REFLECT. Without reflection, hypotheses are dominated by simple pairwise comparisons, while REFLECT produces a broader set of seasonal, interaction, visual, and correlation-based hypotheses. Right: examples of guidance produced by REFLECT, including gap detection, compound hypothesis generation, and confound detection. Thes… view at source ↗

**Figure 6.** Figure 6: Examples of vision-grounded discoveries produced by DiscoPER. DiscoPER can use visual [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiscoPER's meta-reflection and multimodal setup are a reasonable step for LLM discovery agents, but the recovery numbers on literature-derived patterns look vulnerable to pretraining recall rather than genuine data-driven exploration.

read the letter

The paper's core move is adding periodic second-order reflection over its own accumulated hypotheses, plus dynamic code execution and image tool use, to let an LLM do open-ended search on ecological data without a preset question. On the new iNatDisco benchmark it recovers 8 of 9 literature patterns at 72.7% support and beats the cited baselines, with ablations suggesting the reflection step helps and that more data improves results.

That combination of reflection plus multimodal access is the clearest addition over simpler iterative LLM or causal-discovery baselines. The benchmark itself, built from peer-reviewed patterns with pattern-level ground truth, gives a concrete way to score open-ended output.

The main weakness is the one the stress-test flags. The ground-truth patterns come from published literature that LLMs have almost certainly seen during training. Nothing in the abstract or described evaluation shows controls that would separate recall from actual dataset-driven discovery—no cutoff models, no data-only ablations, no leakage checks. Without those, the 72.7% figure is hard to interpret as evidence of autonomous exploration. The statistical testing requirement is stated but the abstract gives no detail on procedures, multiple-testing correction, or how hypotheses were chosen for testing, so it is difficult to judge whether the support rate is robust.

This is aimed at groups building LLM agents for science in data-rich, multimodal domains. A reader already working on agentic discovery or ecological AI could extract the reflection mechanism and the benchmark construction as useful pieces even if the main result needs stronger controls. It is coherent enough on its own terms to deserve referee time, provided the authors can address the recall concern directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DiscoPER, an LLM-powered autonomous discovery framework that performs open-ended research via dynamic code generation and execution on datasets (without pre-specified objectives), requires statistical testing for every proposed discovery, adds a second-order meta-reflection mechanism that periodically treats accumulated discoveries as data to identify structural patterns, confounds, and epistemic gaps, and incorporates multimodal tool use for hypotheses involving images and other non-structured inputs. It presents the new iNatDisco multimodal ecological benchmark whose pattern-level ground truth is drawn from peer-reviewed literature, and reports that DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate while outperforming classical causal discovery and LLM-guided baselines; ablations indicate scaling with data volume and benefit from the meta-reflection component.

Significance. If the reported recovery rate and outperformance prove robust after controls for pretraining effects and full methodological disclosure, the work would be significant for demonstrating a concrete mechanism (iterative second-order reflection) that expands search beyond isolated hypothesis generation and for releasing a new benchmark with literature-derived ground truth. The emphasis on statistical validation and multimodal integration addresses two recurring limitations in current autonomous-discovery systems.

major comments (2)

[Abstract] Abstract (evaluation paragraph): the central claim that DiscoPER recovers 8 of 9 literature-derived patterns through its data-driven mechanisms (dynamic code execution, statistical testing, and meta-reflection) is load-bearing, yet no controls are described to rule out parametric recall from pretraining on the same peer-reviewed sources that supplied the ground-truth patterns (e.g., knowledge-cutoff models, data-only baselines that disable LLM parametric knowledge, or leakage audits).
[Abstract] Abstract (framework and evaluation paragraphs): the 72.7% hypothesis support rate and outperformance statements rest on unspecified statistical procedures, data exclusion rules, error analysis, and exact implementation details; without these it is impossible to verify that the reported figures survive scrutiny or that the meta-reflection step does not introduce bias.

minor comments (2)

The phrase 'second-order reasoning mechanism' is introduced without a concise formal definition or pseudocode sketch in the abstract or early sections, making it harder for readers to distinguish it from standard iterative prompting.
The iNatDisco benchmark description would benefit from an explicit statement of how many images, metadata fields, and literature sources are included, even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments highlight important issues around pretraining controls and methodological transparency in the abstract. We address each below with specific plans for revision.

read point-by-point responses

Referee: [Abstract] Abstract (evaluation paragraph): the central claim that DiscoPER recovers 8 of 9 literature-derived patterns through its data-driven mechanisms (dynamic code execution, statistical testing, and meta-reflection) is load-bearing, yet no controls are described to rule out parametric recall from pretraining on the same peer-reviewed sources that supplied the ground-truth patterns (e.g., knowledge-cutoff models, data-only baselines that disable LLM parametric knowledge, or leakage audits).

Authors: We agree this is a substantive concern for claims of data-driven discovery. The iNatDisco patterns were drawn from peer-reviewed sources post-dating common training cutoffs where possible, but we did not include explicit controls such as leakage audits or non-LLM baselines. In revision we will add (1) a new ablation using a purely statistical baseline that disables LLM parametric knowledge and (2) a short discussion of potential leakage risks with the benchmark construction details. These additions will appear in Section 4 and a new appendix. revision: yes
Referee: [Abstract] Abstract (framework and evaluation paragraphs): the 72.7% hypothesis support rate and outperformance statements rest on unspecified statistical procedures, data exclusion rules, error analysis, and exact implementation details; without these it is impossible to verify that the reported figures survive scrutiny or that the meta-reflection step does not introduce bias.

Authors: The statistical procedures (hypothesis testing via permutation tests and bootstrap confidence intervals), data exclusion rules, error analysis, and implementation details are fully specified in Sections 3.2, 4.1, and Appendix B. However, the abstract is too terse. We will revise the abstract to include a one-sentence summary of the support-rate calculation and add a compact table in the main text summarizing per-pattern support rates, exclusion counts, and meta-reflection impact. This addresses transparency without altering the reported numbers. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark result with independent evaluation

full rationale

The paper describes an LLM-based framework (DiscoPER) and reports an empirical recovery rate (8 of 9 patterns, 72.7% support) on the iNatDisco benchmark whose ground truth is drawn from external peer-reviewed literature. No equations, fitted parameters, or first-principles derivations are present that reduce the reported metric to a quantity defined by the same inputs. The evaluation relies on statistical testing of proposed discoveries and ablations, which are external to any self-referential construction. Self-citation is not invoked as a load-bearing uniqueness theorem or ansatz. The result is therefore self-contained against the benchmark and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review; limited visibility into implementation details means the ledger captures only the high-level assumptions stated or implied in the summary.

axioms (2)

domain assumption Large language models can reliably generate and execute code for statistical hypothesis testing on real datasets
Required for the core loop of hypothesis generation and validation described in the abstract.
ad hoc to paper Periodic second-order analysis of accumulated discoveries can identify confounds and epistemic gaps that productively redirect future exploration
This is the load-bearing innovation of the meta-reflection component.

invented entities (2)

DiscoPER framework no independent evidence
purpose: Autonomous open-ended discovery via code execution and meta-reflection
The main system introduced by the paper.
iNatDisco benchmark no independent evidence
purpose: Multimodal ecological dataset with peer-reviewed pattern ground truth for evaluation
New resource created to test the system.

pith-pipeline@v0.9.1-grok · 5790 in / 1613 out tokens · 45973 ms · 2026-07-02T13:39:55.783408+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Accessed on 2026-05-05

iNaturalist.https://www.inaturalist.org. Accessed on 2026-05-05

2026
[2]

Autodiscovery: Open-ended scientific discovery via bayesian surprise

Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, et al. Autodiscovery: Open-ended scientific discovery via bayesian surprise. InNeurIPS, 2025

2025
[3]

Climate change and the long-term northward shift in the african wintering range of the barn swallow hirundo rustica.Climate Research, 2011

Roberto Ambrosini, Diego Rubolini, Anders Pape Møller, Luciano Bani, Jacquie Clark, Zsolt Karcza, Didier Vangeluwe, Chris du Feu, Fernando Spina, and Nicola Saino. Climate change and the long-term northward shift in the african wintering range of the barn swallow hirundo rustica.Climate Research, 2011

2011
[4]

Bennie, James P

Jonathan J. Bennie, James P. Duffy, Richard Inger, and Kevin J. Gaston. Biogeography of time partitioning in mammals.PNAS, 2014

2014
[5]

Climate variation effects on fungal fruiting.Fungal Ecology, 2014

Lynne Boddy, Ulf Büntgen, Simon Egli, Alan C Gange, Einar Heegaard, Paul M Kirk, Aqilah Mohammad, and Håvard Kauserud. Climate variation effects on fungal fruiting.Fungal Ecology, 2014

2014
[6]

Autonomous chemical research with large language models.Nature, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 2023

2023
[7]

Monarch butterfly orientation: missing pieces of a magnificent puzzle

Lincoln P Brower. Monarch butterfly orientation: missing pieces of a magnificent puzzle. Journal of Experimental Biology, 1996

1996
[8]

Optimal structure identification with greedy search.JMLR, 2002

David Maxwell Chickering. Optimal structure identification with greedy search.JMLR, 2002

2002
[9]

Chmielewski and Thomas Rötzer

Frank-M. Chmielewski and Thomas Rötzer. Response of tree phenology to climate change across europe.Agricultural and Forest Meteorology, 2001

2001
[10]

Body size evolution in mammals: complexity in tempo and mode.The American Naturalist, 2010

Natalie Cooper and Andy Purvis. Body size evolution in mammals: complexity in tempo and mode.The American Naturalist, 2010

2010
[11]

Gaston.The Structure and Dynamics of Geographic Ranges

Kevin J. Gaston.The Structure and Dynamics of Geographic Ranges. Oxford University Press, 2003

2003
[12]

Stackpole Books, 1998

Valerius Geist.Deer of the World: Their Evolution, Behaviour, and Ecology. Stackpole Books, 1998

1998
[13]

SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.Advanced Materials, 2025

Alireza Ghafarollahi and Markus J Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.Advanced Materials, 2025

2025
[14]

Szostkiewicz, Jon M

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Jon M Laurent, Muhammed T Razzak, Andrew D White, Michaela M Hinks, and Samuel G Rodriques. Robin: A multi-agent system for automating scientific discovery.arXiv:2505.13400, 2025

work page arXiv 2025
[15]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, et al. Towards an AI co-scientist.arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

McGehee, and Don F

Elmer Gray, Eugene M. McGehee, and Don F. Carlisle. Seasonal variation in flowering of common dandelion.Weed Science, 1973

1973
[17]

Blade: Benchmarking language model agents for data-driven science

Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, et al. Blade: Benchmarking language model agents for data-driven science. InEMNLP (Findings), 2024

2024
[18]

Accelerating social science research via agentic hypothesization and experimentation.arXiv:2602.07983, 2026

Jishu Sen Gupta, Harini SI, Somesh Kumar Singh, Syed Mohamad Tawseeq, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah, and Balaji Krishnamurthy. Accelerating social science research via agentic hypothesization and experimentation.arXiv:2602.07983, 2026. 10

work page arXiv 2026
[19]

The extent and consequences of p-hacking in science.PLoS Biology, 2015

Megan L Head, Luke Holman, Rob Lanfear, Andrew T Kahn, and Michael D Jennions. The extent and consequences of p-hacking in science.PLoS Biology, 2015

2015
[20]

On the generality of the latitudinal diversity gradient.The American Naturalist, 2004

Helmut Hillebrand. On the generality of the latitudinal diversity gradient.The American Naturalist, 2004

2004
[21]

Automated hypothesis validation with agentic sequential falsifications

Kexin Huang, Ying Jin, Ryan Li, Michael Y Li, Emmanuel Candès, and Jure Leskovec. Automated hypothesis validation with agentic sequential falsifications. InICML, 2025

2025
[22]

Can large language models infer causation from correlation? In ICLR, 2024

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? In ICLR, 2024

2024
[23]

Efficient Causal Graph Discovery Using Large Language Models

Thomas Jiralerspong, Xiaoyin Chen, Yash More, Vedant Shah, and Yoshua Bengio. Efficient causal graph discovery using large language models.arXiv:2402.01207, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Outstanding challenges and future directions for biodiversity monitoring using citizen science data.Methods in Ecology and Evolution, 2023

Alison Johnston, Eleni Matechou, and Emily B Dennis. Outstanding challenges and future directions for biodiversity monitoring using citizen science data.Methods in Ecology and Evolution, 2023

2023
[25]

Causal reasoning and large language models: Opening a new frontier for causality.TMLR, 2024

Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.TMLR, 2024

2024
[26]

Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 2004

Ross D King, Kenneth E Whelan, Ffion M Jones, Philip GK Reiser, Christopher H Bryant, Stephen H Muggleton, Douglas B Kell, and Stephen G Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 2004

2004
[27]

The use of ‘altitude’ in ecological research.Trends in Ecology and Evolution, 2007

Christian Körner. The use of ‘altitude’ in ecological research.Trends in Ecology and Evolution, 2007

2007
[28]

Local computations with probabilities on graphical structures and their application to expert systems.Journal of the Royal Statistical Society: Series B (Methodological), 1988

Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems.Journal of the Royal Statistical Society: Series B (Methodological), 1988

1988
[29]

Lomolino, Brett R

Mark V . Lomolino, Brett R. Riddle, Robert J. Whittaker, and James H. Brown.Biogeography. Sinauer Associates, 4th edition, 2010

2010
[30]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Benchmarking ai scientists in omics data-driven biological research.arXiv:2505.08341, 2025

Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research.arXiv:2505.08341, 2025

work page arXiv 2025
[32]

Discoverybench: Towards data-driven discovery with large language models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhi- jeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. InICLR, 2025

2025
[33]

Sparks, Nicole Estrella, Elisabeth Koch, Anto Aasa, Rein Ahas, Kerstin Alm-Kübler, Peter Bissolli, Ol’ga Braslavska, Agrita Briede, Frank M

Annette Menzel, Tim H. Sparks, Nicole Estrella, Elisabeth Koch, Anto Aasa, Rein Ahas, Kerstin Alm-Kübler, Peter Bissolli, Ol’ga Braslavska, Agrita Briede, Frank M. Chmielewski, Zalika Crepinsek, Yannick Curnel, Aslog Dahl, Claudio Defila, Alison Donnelly, Yolanda Filella, Katarzyna Jatczak, Finn Mage, Antonio Mestre, Oyvind Nordli, Josep Penuelas, Pentti ...

2006
[34]

Montague H. C. Neate-Clegg and Morgan W. Tingley. Adult male birds advance spring migratory phenology faster than females and juveniles across north america.Global Change Biology, 2023

2023
[35]

On the role of sparsity and dag constraints for learning linear dags.NeurIPS, 2020

Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags.NeurIPS, 2020

2020
[36]

Heurekabench: A benchmarking framework for ai co-scientist

Siba Smarak Panigrahi, Jovana Videnovi´c, and Maria Brbi´c. Heurekabench: A benchmarking framework for ai co-scientist. InICLR, 2026. 11

2026
[37]

BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments

Yusuf Roohani et al. BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments. InICLR, 2025

2025
[38]

Causal protein-signaling networks derived from multiparameter single-cell data.Science, 2005

Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data.Science, 2005

2005
[39]

MIT Press, 2nd edition, 2000

Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, Prediction, and Search. MIT Press, 2nd edition, 2000

2000
[40]

Global diversity and geography of soil fungi.Science, 2014

Leho Tedersoo, Mohammad Bahram, Sergei Põlme, et al. Global diversity and geography of soil fungi.Science, 2014

2014
[41]

Vanderhoff, P

N. Vanderhoff, P. Pyle, M. A. Patten, R. Sallabanks, and F. C. James. American robin (Turdus migratorius), version 1.0. InBirds of the World. Cornell Lab of Ornithology, 2020

2020
[42]

Inquire: A natural world text-to-image retrieval benchmark

Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E Jones, Oisin Mac Aodha, Sara Beery, and Grant Van Horn. Inquire: A natural world text-to-image retrieval benchmark. InNeurIPS - Datasets and Benchmarks, 2024

2024
[43]

Hypothesis search: Inductive reasoning with language models.ICLR, 2024

Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D Goodman. Hypothesis search: Inductive reasoning with language models.ICLR, 2024

2024
[44]

Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv:2505.16100, 2025

Zifeng Wang, Benjamin Danek, and Jimeng Sun. Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv:2505.16100, 2025

work page arXiv 2025
[45]

Wells.The Ecology and Behavior of Amphibians

Kentwood D. Wells.The Ecology and Behavior of Amphibians. University of Chicago Press, 2007

2007
[46]

DAG-GNN: DAG structure learning with graph neural networks

Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. InICML, 2019

2019
[47]

Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous Optimization for Structure Learning. InNeurIPS, 2018

2018
[48]

fungi peak in autumn

Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models. InWorkshop on NLP for Science (NLP4Science), 2024. 12 Appendix A Additional results A.1 Additional ablations Base LLM comparison.Table A1 (left) shows that DiscoPER is compatible with different backbone LLMs, but the choice of mo...

2024
[49]

Seasonal variation in flowering of common dandelion

[33] 7 800 Dandelion early flower- ing T. officinalepeaks Mar–May “Seasonal variation in flowering of common dandelion” (Gray et al.,
[50]

Deer of the World: Their Evo- lution, Behaviour and Ecology

[16] 8 800 Red Deer northern habitat C. elaphusconcentrated 45–60°N“Deer of the World: Their Evo- lution, Behaviour and Ecology” (Geist, 1998) [12] 9 800, 50K Hemisphere season in- version Seasonal patterns invert between NH and SH “Response of tree phenology to climate change across Europe” (Chmielewski & Rötzer, 2001) [9] 10 50K Latitudinal diversity gr...

1998
[51]

Adult male birds advance spring migratory phenology faster than fe- males and juveniles across North America

[20] 11 50K Bird latitudinal migra- tion Birds shift northward spring– summer “Adult male birds advance spring migratory phenology faster than fe- males and juveniles across North America” (Neate-Clegg et al.,
[52]

The Ecology and Behavior of Am- phibians

[34] 12 50K Amphibian spring emergence Amphibians peak sharply Mar– May “The Ecology and Behavior of Am- phibians” (Wells, 2007) [45] 13 50K Lepidoptera wide lati- tude Butterflies span wider range than other insects “The Structure and Dynamics of Geographic Ranges” (Gaston,

2007
[53]

Global diversity and geography of soil fungi

[11] 14 50K Fungi temperate con- centration Fungi concentrated 40–60°N “Global diversity and geography of soil fungi” (Tedersoo et al.,
[54]

Biogeography of time partition- ing in mammals

[40] 15 50K Mammal temporal uni- formity Mammals more uniform across months than birds “Biogeography of time partition- ing in mammals” (Bennie et al.,
[55]

Biogeography,

[4] 16 50K Continental endemism Certain families show continental endemism “Biogeography,” 4th ed. (Lomolino et al., 2010) [29] 17 50K Elevation-latitude proxy Alpine plants at higher latitudes in mid-lat bands “The use of ‘altitude’ in ecological research” (Körner, 2007) [27] data, even when the LLM’s prior knowledge strongly suggests it should be there....

2010
[56]

Propose ONE specific, testable hypothesis that is DIFFERENT from previous attempts
[57]

If previous hypotheses about a topic were rejected, try a completely different angle
[58]

Focus on unexplored variable combinations
[59]

When images are available, the prompt is augmented with: Visual Hypothesis Augmentation {n_images} sample images from the dataset are attached

Look for interaction effects, threshold effects, or conditional relationships Output a JSON object with: statement, scope, variables, expected_direction, risk_flags. When images are available, the prompt is augmented with: Visual Hypothesis Augmentation {n_images} sample images from the dataset are attached. These are the ACTUAL ecology scene images. Look...
[60]

The grouping variable MUST match what the hypothesis compares (species → species_name, class→class_name, kingdom→kingdom)
[61]

seasonal shift

The metric MUST measure what the claim describes (“seasonal shift” = difference in latitude between seasons, not raw latitude)
[62]

The data slice MUST include exactly the populations the claim describes
[63]

Output a JSON object with: method, feature_spec, dataset_slice_spec

If the hypothesis is about a visual property, use visual_attribute_test Available tools: corr_test, group_diff_test, visual_attribute_test, visual_group_comparison, predic- tive_test, stratified_retest. Output a JSON object with: method, feature_spec, dataset_slice_spec. C.5.3 Reflective accumulation (REFLECT) The REFLECTagent receives all accumulated cla...

[1] [1]

Accessed on 2026-05-05

iNaturalist.https://www.inaturalist.org. Accessed on 2026-05-05

2026

[2] [2]

Autodiscovery: Open-ended scientific discovery via bayesian surprise

Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, et al. Autodiscovery: Open-ended scientific discovery via bayesian surprise. InNeurIPS, 2025

2025

[3] [3]

Climate change and the long-term northward shift in the african wintering range of the barn swallow hirundo rustica.Climate Research, 2011

Roberto Ambrosini, Diego Rubolini, Anders Pape Møller, Luciano Bani, Jacquie Clark, Zsolt Karcza, Didier Vangeluwe, Chris du Feu, Fernando Spina, and Nicola Saino. Climate change and the long-term northward shift in the african wintering range of the barn swallow hirundo rustica.Climate Research, 2011

2011

[4] [4]

Bennie, James P

Jonathan J. Bennie, James P. Duffy, Richard Inger, and Kevin J. Gaston. Biogeography of time partitioning in mammals.PNAS, 2014

2014

[5] [5]

Climate variation effects on fungal fruiting.Fungal Ecology, 2014

Lynne Boddy, Ulf Büntgen, Simon Egli, Alan C Gange, Einar Heegaard, Paul M Kirk, Aqilah Mohammad, and Håvard Kauserud. Climate variation effects on fungal fruiting.Fungal Ecology, 2014

2014

[6] [6]

Autonomous chemical research with large language models.Nature, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 2023

2023

[7] [7]

Monarch butterfly orientation: missing pieces of a magnificent puzzle

Lincoln P Brower. Monarch butterfly orientation: missing pieces of a magnificent puzzle. Journal of Experimental Biology, 1996

1996

[8] [8]

Optimal structure identification with greedy search.JMLR, 2002

David Maxwell Chickering. Optimal structure identification with greedy search.JMLR, 2002

2002

[9] [9]

Chmielewski and Thomas Rötzer

Frank-M. Chmielewski and Thomas Rötzer. Response of tree phenology to climate change across europe.Agricultural and Forest Meteorology, 2001

2001

[10] [10]

Body size evolution in mammals: complexity in tempo and mode.The American Naturalist, 2010

Natalie Cooper and Andy Purvis. Body size evolution in mammals: complexity in tempo and mode.The American Naturalist, 2010

2010

[11] [11]

Gaston.The Structure and Dynamics of Geographic Ranges

Kevin J. Gaston.The Structure and Dynamics of Geographic Ranges. Oxford University Press, 2003

2003

[12] [12]

Stackpole Books, 1998

Valerius Geist.Deer of the World: Their Evolution, Behaviour, and Ecology. Stackpole Books, 1998

1998

[13] [13]

SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.Advanced Materials, 2025

Alireza Ghafarollahi and Markus J Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.Advanced Materials, 2025

2025

[14] [14]

Szostkiewicz, Jon M

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Jon M Laurent, Muhammed T Razzak, Andrew D White, Michaela M Hinks, and Samuel G Rodriques. Robin: A multi-agent system for automating scientific discovery.arXiv:2505.13400, 2025

work page arXiv 2025

[15] [15]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, et al. Towards an AI co-scientist.arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

McGehee, and Don F

Elmer Gray, Eugene M. McGehee, and Don F. Carlisle. Seasonal variation in flowering of common dandelion.Weed Science, 1973

1973

[17] [17]

Blade: Benchmarking language model agents for data-driven science

Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, et al. Blade: Benchmarking language model agents for data-driven science. InEMNLP (Findings), 2024

2024

[18] [18]

Accelerating social science research via agentic hypothesization and experimentation.arXiv:2602.07983, 2026

Jishu Sen Gupta, Harini SI, Somesh Kumar Singh, Syed Mohamad Tawseeq, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah, and Balaji Krishnamurthy. Accelerating social science research via agentic hypothesization and experimentation.arXiv:2602.07983, 2026. 10

work page arXiv 2026

[19] [19]

The extent and consequences of p-hacking in science.PLoS Biology, 2015

Megan L Head, Luke Holman, Rob Lanfear, Andrew T Kahn, and Michael D Jennions. The extent and consequences of p-hacking in science.PLoS Biology, 2015

2015

[20] [20]

On the generality of the latitudinal diversity gradient.The American Naturalist, 2004

Helmut Hillebrand. On the generality of the latitudinal diversity gradient.The American Naturalist, 2004

2004

[21] [21]

Automated hypothesis validation with agentic sequential falsifications

Kexin Huang, Ying Jin, Ryan Li, Michael Y Li, Emmanuel Candès, and Jure Leskovec. Automated hypothesis validation with agentic sequential falsifications. InICML, 2025

2025

[22] [22]

Can large language models infer causation from correlation? In ICLR, 2024

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? In ICLR, 2024

2024

[23] [23]

Efficient Causal Graph Discovery Using Large Language Models

Thomas Jiralerspong, Xiaoyin Chen, Yash More, Vedant Shah, and Yoshua Bengio. Efficient causal graph discovery using large language models.arXiv:2402.01207, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Outstanding challenges and future directions for biodiversity monitoring using citizen science data.Methods in Ecology and Evolution, 2023

Alison Johnston, Eleni Matechou, and Emily B Dennis. Outstanding challenges and future directions for biodiversity monitoring using citizen science data.Methods in Ecology and Evolution, 2023

2023

[25] [25]

Causal reasoning and large language models: Opening a new frontier for causality.TMLR, 2024

Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.TMLR, 2024

2024

[26] [26]

Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 2004

Ross D King, Kenneth E Whelan, Ffion M Jones, Philip GK Reiser, Christopher H Bryant, Stephen H Muggleton, Douglas B Kell, and Stephen G Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 2004

2004

[27] [27]

The use of ‘altitude’ in ecological research.Trends in Ecology and Evolution, 2007

Christian Körner. The use of ‘altitude’ in ecological research.Trends in Ecology and Evolution, 2007

2007

[28] [28]

Local computations with probabilities on graphical structures and their application to expert systems.Journal of the Royal Statistical Society: Series B (Methodological), 1988

Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems.Journal of the Royal Statistical Society: Series B (Methodological), 1988

1988

[29] [29]

Lomolino, Brett R

Mark V . Lomolino, Brett R. Riddle, Robert J. Whittaker, and James H. Brown.Biogeography. Sinauer Associates, 4th edition, 2010

2010

[30] [30]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Benchmarking ai scientists in omics data-driven biological research.arXiv:2505.08341, 2025

Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research.arXiv:2505.08341, 2025

work page arXiv 2025

[32] [32]

Discoverybench: Towards data-driven discovery with large language models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhi- jeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. InICLR, 2025

2025

[33] [33]

Sparks, Nicole Estrella, Elisabeth Koch, Anto Aasa, Rein Ahas, Kerstin Alm-Kübler, Peter Bissolli, Ol’ga Braslavska, Agrita Briede, Frank M

Annette Menzel, Tim H. Sparks, Nicole Estrella, Elisabeth Koch, Anto Aasa, Rein Ahas, Kerstin Alm-Kübler, Peter Bissolli, Ol’ga Braslavska, Agrita Briede, Frank M. Chmielewski, Zalika Crepinsek, Yannick Curnel, Aslog Dahl, Claudio Defila, Alison Donnelly, Yolanda Filella, Katarzyna Jatczak, Finn Mage, Antonio Mestre, Oyvind Nordli, Josep Penuelas, Pentti ...

2006

[34] [34]

Montague H. C. Neate-Clegg and Morgan W. Tingley. Adult male birds advance spring migratory phenology faster than females and juveniles across north america.Global Change Biology, 2023

2023

[35] [35]

On the role of sparsity and dag constraints for learning linear dags.NeurIPS, 2020

Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags.NeurIPS, 2020

2020

[36] [36]

Heurekabench: A benchmarking framework for ai co-scientist

Siba Smarak Panigrahi, Jovana Videnovi´c, and Maria Brbi´c. Heurekabench: A benchmarking framework for ai co-scientist. InICLR, 2026. 11

2026

[37] [37]

BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments

Yusuf Roohani et al. BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments. InICLR, 2025

2025

[38] [38]

Causal protein-signaling networks derived from multiparameter single-cell data.Science, 2005

Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data.Science, 2005

2005

[39] [39]

MIT Press, 2nd edition, 2000

Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, Prediction, and Search. MIT Press, 2nd edition, 2000

2000

[40] [40]

Global diversity and geography of soil fungi.Science, 2014

Leho Tedersoo, Mohammad Bahram, Sergei Põlme, et al. Global diversity and geography of soil fungi.Science, 2014

2014

[41] [41]

Vanderhoff, P

N. Vanderhoff, P. Pyle, M. A. Patten, R. Sallabanks, and F. C. James. American robin (Turdus migratorius), version 1.0. InBirds of the World. Cornell Lab of Ornithology, 2020

2020

[42] [42]

Inquire: A natural world text-to-image retrieval benchmark

Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E Jones, Oisin Mac Aodha, Sara Beery, and Grant Van Horn. Inquire: A natural world text-to-image retrieval benchmark. InNeurIPS - Datasets and Benchmarks, 2024

2024

[43] [43]

Hypothesis search: Inductive reasoning with language models.ICLR, 2024

Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D Goodman. Hypothesis search: Inductive reasoning with language models.ICLR, 2024

2024

[44] [44]

Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv:2505.16100, 2025

Zifeng Wang, Benjamin Danek, and Jimeng Sun. Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv:2505.16100, 2025

work page arXiv 2025

[45] [45]

Wells.The Ecology and Behavior of Amphibians

Kentwood D. Wells.The Ecology and Behavior of Amphibians. University of Chicago Press, 2007

2007

[46] [46]

DAG-GNN: DAG structure learning with graph neural networks

Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. InICML, 2019

2019

[47] [47]

Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous Optimization for Structure Learning. InNeurIPS, 2018

2018

[48] [48]

fungi peak in autumn

Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models. InWorkshop on NLP for Science (NLP4Science), 2024. 12 Appendix A Additional results A.1 Additional ablations Base LLM comparison.Table A1 (left) shows that DiscoPER is compatible with different backbone LLMs, but the choice of mo...

2024

[49] [49]

Seasonal variation in flowering of common dandelion

[33] 7 800 Dandelion early flower- ing T. officinalepeaks Mar–May “Seasonal variation in flowering of common dandelion” (Gray et al.,

[50] [50]

Deer of the World: Their Evo- lution, Behaviour and Ecology

[16] 8 800 Red Deer northern habitat C. elaphusconcentrated 45–60°N“Deer of the World: Their Evo- lution, Behaviour and Ecology” (Geist, 1998) [12] 9 800, 50K Hemisphere season in- version Seasonal patterns invert between NH and SH “Response of tree phenology to climate change across Europe” (Chmielewski & Rötzer, 2001) [9] 10 50K Latitudinal diversity gr...

1998

[51] [51]

Adult male birds advance spring migratory phenology faster than fe- males and juveniles across North America

[20] 11 50K Bird latitudinal migra- tion Birds shift northward spring– summer “Adult male birds advance spring migratory phenology faster than fe- males and juveniles across North America” (Neate-Clegg et al.,

[52] [52]

The Ecology and Behavior of Am- phibians

[34] 12 50K Amphibian spring emergence Amphibians peak sharply Mar– May “The Ecology and Behavior of Am- phibians” (Wells, 2007) [45] 13 50K Lepidoptera wide lati- tude Butterflies span wider range than other insects “The Structure and Dynamics of Geographic Ranges” (Gaston,

2007

[53] [53]

Global diversity and geography of soil fungi

[11] 14 50K Fungi temperate con- centration Fungi concentrated 40–60°N “Global diversity and geography of soil fungi” (Tedersoo et al.,

[54] [54]

Biogeography of time partition- ing in mammals

[40] 15 50K Mammal temporal uni- formity Mammals more uniform across months than birds “Biogeography of time partition- ing in mammals” (Bennie et al.,

[55] [55]

Biogeography,

[4] 16 50K Continental endemism Certain families show continental endemism “Biogeography,” 4th ed. (Lomolino et al., 2010) [29] 17 50K Elevation-latitude proxy Alpine plants at higher latitudes in mid-lat bands “The use of ‘altitude’ in ecological research” (Körner, 2007) [27] data, even when the LLM’s prior knowledge strongly suggests it should be there....

2010

[56] [56]

Propose ONE specific, testable hypothesis that is DIFFERENT from previous attempts

[57] [57]

If previous hypotheses about a topic were rejected, try a completely different angle

[58] [58]

Focus on unexplored variable combinations

[59] [59]

When images are available, the prompt is augmented with: Visual Hypothesis Augmentation {n_images} sample images from the dataset are attached

Look for interaction effects, threshold effects, or conditional relationships Output a JSON object with: statement, scope, variables, expected_direction, risk_flags. When images are available, the prompt is augmented with: Visual Hypothesis Augmentation {n_images} sample images from the dataset are attached. These are the ACTUAL ecology scene images. Look...

[60] [60]

The grouping variable MUST match what the hypothesis compares (species → species_name, class→class_name, kingdom→kingdom)

[61] [61]

seasonal shift

The metric MUST measure what the claim describes (“seasonal shift” = difference in latitude between seasons, not raw latitude)

[62] [62]

The data slice MUST include exactly the populations the claim describes

[63] [63]

Output a JSON object with: method, feature_spec, dataset_slice_spec

If the hypothesis is about a visual property, use visual_attribute_test Available tools: corr_test, group_diff_test, visual_attribute_test, visual_group_comparison, predic- tive_test, stratified_retest. Output a JSON object with: method, feature_spec, dataset_slice_spec. C.5.3 Reflective accumulation (REFLECT) The REFLECTagent receives all accumulated cla...