pith. sign in

arxiv: 2607.01131 · v1 · pith:WVZL2AUZnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI

Autonomous Scientific Discovery via Iterative Meta-Reflection

Pith reviewed 2026-07-02 13:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords autonomous scientific discoverylarge language modelsmeta-reflectionecological patternshypothesis validationmultimodal dataiNatDiscocausal discovery
0
0 comments X

The pith

DiscoPER recovers 8 of 9 known ecological patterns by using meta-reflection on its own prior discoveries to guide open-ended hypothesis search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiscoPER, an LLM-based system that generates code to test hypotheses on datasets without any pre-set research questions. It requires every candidate discovery to pass statistical tests for validity. A second-order meta-reflection step periodically treats the system's own findings as data to detect patterns, confounds, and gaps, then steers further exploration away from covered areas. Tool use lets the system pull information from images and other multimodal inputs. On a new benchmark built from peer-reviewed ecological literature, the method recovers eight of nine documented patterns at a 72.7 percent support rate and exceeds classical causal-discovery and plain LLM baselines.

Core claim

DiscoPER performs open-ended research by dynamically generating and executing code to explore datasets. Every proposed discovery must pass statistical testing. A second-order reasoning mechanism periodically analyzes accumulated discoveries as empirical data to identify structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions. Tool use expands the search to multimodal sources such as images. Evaluated on the iNatDisco benchmark with pattern-level ground truth from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided ba

What carries the argument

The second-order meta-reflection mechanism that treats prior discoveries as empirical data to detect structural patterns, confounds, and epistemic gaps and then redirects the search.

If this is right

  • The approach works without any pre-specified research objectives.
  • Second-order meta-reflection improves performance over standard iterative hypothesis generation.
  • Tool use for multimodal inputs enlarges the reachable search space.
  • The system scales with additional data volume.
  • Every discovery is required to pass statistical testing before acceptance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same meta-reflection loop could be applied to multimodal datasets outside ecology to test whether recovery rates remain high.
  • If epistemic-gap detection works as described, the method might systematically surface areas where existing literature is sparse.
  • Combining the framework with richer code-execution sandboxes could allow validation of more complex hypotheses than the current statistical tests cover.

Load-bearing premise

Statistical testing of each proposed discovery is sufficient to guarantee scientific validity and the meta-reflection step does not introduce biases that change the reported recovery rate.

What would settle it

Re-running the full DiscoPER evaluation on the iNatDisco benchmark and obtaining either fewer than eight of the nine known patterns recovered or a hypothesis support rate materially below 72.7%.

Figures

Figures reproduced from arXiv: 2607.01131 by Bingchen Zhao, Oisin Mac Aodha, Sara Beery.

Figure 1
Figure 1. Figure 1: We introduce DiscoPER, an iterative approach for autonomous scientific discovery that [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DiscoPER is an iterative scientific discovery system consisting of three core modules: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scaling behavior on iNatDisco-50K. (a) Providing more data improves recall and yields more supported insights. (b) More model iterations increases recall but the support rate decreases as the model moves on from easy hypotheses and starts to propose more speculative ones. iNatDisco-50K) and support rate, confirming that REFLECT not only broadens what the system investigates but also improves the quality of… view at source ↗
Figure 4
Figure 4. Figure 4: Experiments on our iNatDisco-800-CF counterfactual dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of reflection. Left: distribution of generated hypothesis with and without REFLECT. Without reflection, hypotheses are dominated by simple pairwise comparisons, while REFLECT produces a broader set of seasonal, interaction, visual, and correlation-based hypotheses. Right: examples of guidance produced by REFLECT, including gap detection, compound hypothesis generation, and confound detection. Thes… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of vision-grounded discoveries produced by DiscoPER. DiscoPER can use visual [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DiscoPER, an LLM-powered autonomous discovery framework that performs open-ended research via dynamic code generation and execution on datasets (without pre-specified objectives), requires statistical testing for every proposed discovery, adds a second-order meta-reflection mechanism that periodically treats accumulated discoveries as data to identify structural patterns, confounds, and epistemic gaps, and incorporates multimodal tool use for hypotheses involving images and other non-structured inputs. It presents the new iNatDisco multimodal ecological benchmark whose pattern-level ground truth is drawn from peer-reviewed literature, and reports that DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate while outperforming classical causal discovery and LLM-guided baselines; ablations indicate scaling with data volume and benefit from the meta-reflection component.

Significance. If the reported recovery rate and outperformance prove robust after controls for pretraining effects and full methodological disclosure, the work would be significant for demonstrating a concrete mechanism (iterative second-order reflection) that expands search beyond isolated hypothesis generation and for releasing a new benchmark with literature-derived ground truth. The emphasis on statistical validation and multimodal integration addresses two recurring limitations in current autonomous-discovery systems.

major comments (2)
  1. [Abstract] Abstract (evaluation paragraph): the central claim that DiscoPER recovers 8 of 9 literature-derived patterns through its data-driven mechanisms (dynamic code execution, statistical testing, and meta-reflection) is load-bearing, yet no controls are described to rule out parametric recall from pretraining on the same peer-reviewed sources that supplied the ground-truth patterns (e.g., knowledge-cutoff models, data-only baselines that disable LLM parametric knowledge, or leakage audits).
  2. [Abstract] Abstract (framework and evaluation paragraphs): the 72.7% hypothesis support rate and outperformance statements rest on unspecified statistical procedures, data exclusion rules, error analysis, and exact implementation details; without these it is impossible to verify that the reported figures survive scrutiny or that the meta-reflection step does not introduce bias.
minor comments (2)
  1. The phrase 'second-order reasoning mechanism' is introduced without a concise formal definition or pseudocode sketch in the abstract or early sections, making it harder for readers to distinguish it from standard iterative prompting.
  2. The iNatDisco benchmark description would benefit from an explicit statement of how many images, metadata fields, and literature sources are included, even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments highlight important issues around pretraining controls and methodological transparency in the abstract. We address each below with specific plans for revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract (evaluation paragraph): the central claim that DiscoPER recovers 8 of 9 literature-derived patterns through its data-driven mechanisms (dynamic code execution, statistical testing, and meta-reflection) is load-bearing, yet no controls are described to rule out parametric recall from pretraining on the same peer-reviewed sources that supplied the ground-truth patterns (e.g., knowledge-cutoff models, data-only baselines that disable LLM parametric knowledge, or leakage audits).

    Authors: We agree this is a substantive concern for claims of data-driven discovery. The iNatDisco patterns were drawn from peer-reviewed sources post-dating common training cutoffs where possible, but we did not include explicit controls such as leakage audits or non-LLM baselines. In revision we will add (1) a new ablation using a purely statistical baseline that disables LLM parametric knowledge and (2) a short discussion of potential leakage risks with the benchmark construction details. These additions will appear in Section 4 and a new appendix. revision: yes

  2. Referee: [Abstract] Abstract (framework and evaluation paragraphs): the 72.7% hypothesis support rate and outperformance statements rest on unspecified statistical procedures, data exclusion rules, error analysis, and exact implementation details; without these it is impossible to verify that the reported figures survive scrutiny or that the meta-reflection step does not introduce bias.

    Authors: The statistical procedures (hypothesis testing via permutation tests and bootstrap confidence intervals), data exclusion rules, error analysis, and implementation details are fully specified in Sections 3.2, 4.1, and Appendix B. However, the abstract is too terse. We will revise the abstract to include a one-sentence summary of the support-rate calculation and add a compact table in the main text summarizing per-pattern support rates, exclusion counts, and meta-reflection impact. This addresses transparency without altering the reported numbers. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark result with independent evaluation

full rationale

The paper describes an LLM-based framework (DiscoPER) and reports an empirical recovery rate (8 of 9 patterns, 72.7% support) on the iNatDisco benchmark whose ground truth is drawn from external peer-reviewed literature. No equations, fitted parameters, or first-principles derivations are present that reduce the reported metric to a quantity defined by the same inputs. The evaluation relies on statistical testing of proposed discoveries and ablations, which are external to any self-referential construction. Self-citation is not invoked as a load-bearing uniqueness theorem or ansatz. The result is therefore self-contained against the benchmark and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review; limited visibility into implementation details means the ledger captures only the high-level assumptions stated or implied in the summary.

axioms (2)
  • domain assumption Large language models can reliably generate and execute code for statistical hypothesis testing on real datasets
    Required for the core loop of hypothesis generation and validation described in the abstract.
  • ad hoc to paper Periodic second-order analysis of accumulated discoveries can identify confounds and epistemic gaps that productively redirect future exploration
    This is the load-bearing innovation of the meta-reflection component.
invented entities (2)
  • DiscoPER framework no independent evidence
    purpose: Autonomous open-ended discovery via code execution and meta-reflection
    The main system introduced by the paper.
  • iNatDisco benchmark no independent evidence
    purpose: Multimodal ecological dataset with peer-reviewed pattern ground truth for evaluation
    New resource created to test the system.

pith-pipeline@v0.9.1-grok · 5790 in / 1613 out tokens · 45973 ms · 2026-07-02T13:39:55.783408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Accessed on 2026-05-05

    iNaturalist.https://www.inaturalist.org. Accessed on 2026-05-05

  2. [2]

    Autodiscovery: Open-ended scientific discovery via bayesian surprise

    Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, et al. Autodiscovery: Open-ended scientific discovery via bayesian surprise. InNeurIPS, 2025

  3. [3]

    Climate change and the long-term northward shift in the african wintering range of the barn swallow hirundo rustica.Climate Research, 2011

    Roberto Ambrosini, Diego Rubolini, Anders Pape Møller, Luciano Bani, Jacquie Clark, Zsolt Karcza, Didier Vangeluwe, Chris du Feu, Fernando Spina, and Nicola Saino. Climate change and the long-term northward shift in the african wintering range of the barn swallow hirundo rustica.Climate Research, 2011

  4. [4]

    Bennie, James P

    Jonathan J. Bennie, James P. Duffy, Richard Inger, and Kevin J. Gaston. Biogeography of time partitioning in mammals.PNAS, 2014

  5. [5]

    Climate variation effects on fungal fruiting.Fungal Ecology, 2014

    Lynne Boddy, Ulf Büntgen, Simon Egli, Alan C Gange, Einar Heegaard, Paul M Kirk, Aqilah Mohammad, and Håvard Kauserud. Climate variation effects on fungal fruiting.Fungal Ecology, 2014

  6. [6]

    Autonomous chemical research with large language models.Nature, 2023

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 2023

  7. [7]

    Monarch butterfly orientation: missing pieces of a magnificent puzzle

    Lincoln P Brower. Monarch butterfly orientation: missing pieces of a magnificent puzzle. Journal of Experimental Biology, 1996

  8. [8]

    Optimal structure identification with greedy search.JMLR, 2002

    David Maxwell Chickering. Optimal structure identification with greedy search.JMLR, 2002

  9. [9]

    Chmielewski and Thomas Rötzer

    Frank-M. Chmielewski and Thomas Rötzer. Response of tree phenology to climate change across europe.Agricultural and Forest Meteorology, 2001

  10. [10]

    Body size evolution in mammals: complexity in tempo and mode.The American Naturalist, 2010

    Natalie Cooper and Andy Purvis. Body size evolution in mammals: complexity in tempo and mode.The American Naturalist, 2010

  11. [11]

    Gaston.The Structure and Dynamics of Geographic Ranges

    Kevin J. Gaston.The Structure and Dynamics of Geographic Ranges. Oxford University Press, 2003

  12. [12]

    Stackpole Books, 1998

    Valerius Geist.Deer of the World: Their Evolution, Behaviour, and Ecology. Stackpole Books, 1998

  13. [13]

    SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.Advanced Materials, 2025

    Alireza Ghafarollahi and Markus J Buehler. SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning.Advanced Materials, 2025

  14. [14]

    Szostkiewicz, Jon M

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Jon M Laurent, Muhammed T Razzak, Andrew D White, Michaela M Hinks, and Samuel G Rodriques. Robin: A multi-agent system for automating scientific discovery.arXiv:2505.13400, 2025

  15. [15]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, et al. Towards an AI co-scientist.arXiv:2502.18864, 2025

  16. [16]

    McGehee, and Don F

    Elmer Gray, Eugene M. McGehee, and Don F. Carlisle. Seasonal variation in flowering of common dandelion.Weed Science, 1973

  17. [17]

    Blade: Benchmarking language model agents for data-driven science

    Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, et al. Blade: Benchmarking language model agents for data-driven science. InEMNLP (Findings), 2024

  18. [18]

    Accelerating social science research via agentic hypothesization and experimentation.arXiv:2602.07983, 2026

    Jishu Sen Gupta, Harini SI, Somesh Kumar Singh, Syed Mohamad Tawseeq, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah, and Balaji Krishnamurthy. Accelerating social science research via agentic hypothesization and experimentation.arXiv:2602.07983, 2026. 10

  19. [19]

    The extent and consequences of p-hacking in science.PLoS Biology, 2015

    Megan L Head, Luke Holman, Rob Lanfear, Andrew T Kahn, and Michael D Jennions. The extent and consequences of p-hacking in science.PLoS Biology, 2015

  20. [20]

    On the generality of the latitudinal diversity gradient.The American Naturalist, 2004

    Helmut Hillebrand. On the generality of the latitudinal diversity gradient.The American Naturalist, 2004

  21. [21]

    Automated hypothesis validation with agentic sequential falsifications

    Kexin Huang, Ying Jin, Ryan Li, Michael Y Li, Emmanuel Candès, and Jure Leskovec. Automated hypothesis validation with agentic sequential falsifications. InICML, 2025

  22. [22]

    Can large language models infer causation from correlation? In ICLR, 2024

    Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? In ICLR, 2024

  23. [23]

    Efficient Causal Graph Discovery Using Large Language Models

    Thomas Jiralerspong, Xiaoyin Chen, Yash More, Vedant Shah, and Yoshua Bengio. Efficient causal graph discovery using large language models.arXiv:2402.01207, 2024

  24. [24]

    Outstanding challenges and future directions for biodiversity monitoring using citizen science data.Methods in Ecology and Evolution, 2023

    Alison Johnston, Eleni Matechou, and Emily B Dennis. Outstanding challenges and future directions for biodiversity monitoring using citizen science data.Methods in Ecology and Evolution, 2023

  25. [25]

    Causal reasoning and large language models: Opening a new frontier for causality.TMLR, 2024

    Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.TMLR, 2024

  26. [26]

    Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 2004

    Ross D King, Kenneth E Whelan, Ffion M Jones, Philip GK Reiser, Christopher H Bryant, Stephen H Muggleton, Douglas B Kell, and Stephen G Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 2004

  27. [27]

    The use of ‘altitude’ in ecological research.Trends in Ecology and Evolution, 2007

    Christian Körner. The use of ‘altitude’ in ecological research.Trends in Ecology and Evolution, 2007

  28. [28]

    Local computations with probabilities on graphical structures and their application to expert systems.Journal of the Royal Statistical Society: Series B (Methodological), 1988

    Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems.Journal of the Royal Statistical Society: Series B (Methodological), 1988

  29. [29]

    Lomolino, Brett R

    Mark V . Lomolino, Brett R. Riddle, Robert J. Whittaker, and James H. Brown.Biogeography. Sinauer Associates, 4th edition, 2010

  30. [30]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv:2408.06292, 2024

  31. [31]

    Benchmarking ai scientists in omics data-driven biological research.arXiv:2505.08341, 2025

    Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research.arXiv:2505.08341, 2025

  32. [32]

    Discoverybench: Towards data-driven discovery with large language models

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhi- jeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. InICLR, 2025

  33. [33]

    Sparks, Nicole Estrella, Elisabeth Koch, Anto Aasa, Rein Ahas, Kerstin Alm-Kübler, Peter Bissolli, Ol’ga Braslavska, Agrita Briede, Frank M

    Annette Menzel, Tim H. Sparks, Nicole Estrella, Elisabeth Koch, Anto Aasa, Rein Ahas, Kerstin Alm-Kübler, Peter Bissolli, Ol’ga Braslavska, Agrita Briede, Frank M. Chmielewski, Zalika Crepinsek, Yannick Curnel, Aslog Dahl, Claudio Defila, Alison Donnelly, Yolanda Filella, Katarzyna Jatczak, Finn Mage, Antonio Mestre, Oyvind Nordli, Josep Penuelas, Pentti ...

  34. [34]

    Montague H. C. Neate-Clegg and Morgan W. Tingley. Adult male birds advance spring migratory phenology faster than females and juveniles across north america.Global Change Biology, 2023

  35. [35]

    On the role of sparsity and dag constraints for learning linear dags.NeurIPS, 2020

    Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags.NeurIPS, 2020

  36. [36]

    Heurekabench: A benchmarking framework for ai co-scientist

    Siba Smarak Panigrahi, Jovana Videnovi´c, and Maria Brbi´c. Heurekabench: A benchmarking framework for ai co-scientist. InICLR, 2026. 11

  37. [37]

    BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments

    Yusuf Roohani et al. BioDiscoveryAgent: An AI agent for designing genetic perturbation experiments. InICLR, 2025

  38. [38]

    Causal protein-signaling networks derived from multiparameter single-cell data.Science, 2005

    Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data.Science, 2005

  39. [39]

    MIT Press, 2nd edition, 2000

    Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, Prediction, and Search. MIT Press, 2nd edition, 2000

  40. [40]

    Global diversity and geography of soil fungi.Science, 2014

    Leho Tedersoo, Mohammad Bahram, Sergei Põlme, et al. Global diversity and geography of soil fungi.Science, 2014

  41. [41]

    Vanderhoff, P

    N. Vanderhoff, P. Pyle, M. A. Patten, R. Sallabanks, and F. C. James. American robin (Turdus migratorius), version 1.0. InBirds of the World. Cornell Lab of Ornithology, 2020

  42. [42]

    Inquire: A natural world text-to-image retrieval benchmark

    Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E Jones, Oisin Mac Aodha, Sara Beery, and Grant Van Horn. Inquire: A natural world text-to-image retrieval benchmark. InNeurIPS - Datasets and Benchmarks, 2024

  43. [43]

    Hypothesis search: Inductive reasoning with language models.ICLR, 2024

    Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D Goodman. Hypothesis search: Inductive reasoning with language models.ICLR, 2024

  44. [44]

    Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv:2505.16100, 2025

    Zifeng Wang, Benjamin Danek, and Jimeng Sun. Biodsa-1k: Benchmarking data science agents for biomedical research.arXiv:2505.16100, 2025

  45. [45]

    Wells.The Ecology and Behavior of Amphibians

    Kentwood D. Wells.The Ecology and Behavior of Amphibians. University of Chicago Press, 2007

  46. [46]

    DAG-GNN: DAG structure learning with graph neural networks

    Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. InICML, 2019

  47. [47]

    Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous Optimization for Structure Learning. InNeurIPS, 2018

  48. [48]

    fungi peak in autumn

    Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. Hypothesis generation with large language models. InWorkshop on NLP for Science (NLP4Science), 2024. 12 Appendix A Additional results A.1 Additional ablations Base LLM comparison.Table A1 (left) shows that DiscoPER is compatible with different backbone LLMs, but the choice of mo...

  49. [49]

    Seasonal variation in flowering of common dandelion

    [33] 7 800 Dandelion early flower- ing T. officinalepeaks Mar–May “Seasonal variation in flowering of common dandelion” (Gray et al.,

  50. [50]

    Deer of the World: Their Evo- lution, Behaviour and Ecology

    [16] 8 800 Red Deer northern habitat C. elaphusconcentrated 45–60°N“Deer of the World: Their Evo- lution, Behaviour and Ecology” (Geist, 1998) [12] 9 800, 50K Hemisphere season in- version Seasonal patterns invert between NH and SH “Response of tree phenology to climate change across Europe” (Chmielewski & Rötzer, 2001) [9] 10 50K Latitudinal diversity gr...

  51. [51]

    Adult male birds advance spring migratory phenology faster than fe- males and juveniles across North America

    [20] 11 50K Bird latitudinal migra- tion Birds shift northward spring– summer “Adult male birds advance spring migratory phenology faster than fe- males and juveniles across North America” (Neate-Clegg et al.,

  52. [52]

    The Ecology and Behavior of Am- phibians

    [34] 12 50K Amphibian spring emergence Amphibians peak sharply Mar– May “The Ecology and Behavior of Am- phibians” (Wells, 2007) [45] 13 50K Lepidoptera wide lati- tude Butterflies span wider range than other insects “The Structure and Dynamics of Geographic Ranges” (Gaston,

  53. [53]

    Global diversity and geography of soil fungi

    [11] 14 50K Fungi temperate con- centration Fungi concentrated 40–60°N “Global diversity and geography of soil fungi” (Tedersoo et al.,

  54. [54]

    Biogeography of time partition- ing in mammals

    [40] 15 50K Mammal temporal uni- formity Mammals more uniform across months than birds “Biogeography of time partition- ing in mammals” (Bennie et al.,

  55. [55]

    Biogeography,

    [4] 16 50K Continental endemism Certain families show continental endemism “Biogeography,” 4th ed. (Lomolino et al., 2010) [29] 17 50K Elevation-latitude proxy Alpine plants at higher latitudes in mid-lat bands “The use of ‘altitude’ in ecological research” (Körner, 2007) [27] data, even when the LLM’s prior knowledge strongly suggests it should be there....

  56. [56]

    Propose ONE specific, testable hypothesis that is DIFFERENT from previous attempts

  57. [57]

    If previous hypotheses about a topic were rejected, try a completely different angle

  58. [58]

    Focus on unexplored variable combinations

  59. [59]

    When images are available, the prompt is augmented with: Visual Hypothesis Augmentation {n_images} sample images from the dataset are attached

    Look for interaction effects, threshold effects, or conditional relationships Output a JSON object with: statement, scope, variables, expected_direction, risk_flags. When images are available, the prompt is augmented with: Visual Hypothesis Augmentation {n_images} sample images from the dataset are attached. These are the ACTUAL ecology scene images. Look...

  60. [60]

    The grouping variable MUST match what the hypothesis compares (species → species_name, class→class_name, kingdom→kingdom)

  61. [61]

    seasonal shift

    The metric MUST measure what the claim describes (“seasonal shift” = difference in latitude between seasons, not raw latitude)

  62. [62]

    The data slice MUST include exactly the populations the claim describes

  63. [63]

    Output a JSON object with: method, feature_spec, dataset_slice_spec

    If the hypothesis is about a visual property, use visual_attribute_test Available tools: corr_test, group_diff_test, visual_attribute_test, visual_group_comparison, predic- tive_test, stratified_retest. Output a JSON object with: method, feature_spec, dataset_slice_spec. C.5.3 Reflective accumulation (REFLECT) The REFLECTagent receives all accumulated cla...