pith. sign in

arxiv: 2606.29182 · v1 · pith:BM5QN3WCnew · submitted 2026-06-28 · 💻 cs.AI · cs.CL· cs.LG

Evidence-Informed LLM Beliefs for Continual Scientific Discovery

Pith reviewed 2026-06-30 07:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords LLMscientific discoveryBayesian surprisenon-stationary beliefsretrieval-augmented generationhypothesis searchcontinual learning
0
0 comments X

The pith

LLMs compute better discovery rewards when priors update with evidence from past hypotheses instead of treating surprise as fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AutoDiscovery-style methods treat Bayesian surprise as a static property of each hypothesis, but real scientific reasoning updates beliefs continuously so that some early surprises become expected once prior findings are absorbed. By replacing static priors with evidence-informed beliefs updated through embedding-based retrieval over earlier discoveries, 37.5 percent of the original surprisal values are revealed as spurious. The authors then alter the search loop with belief-update filtering and diversity maximization so that only hypotheses that remain surprising under the updated beliefs receive reward. Across five domains this change raises total accumulated non-stationary surprisal by 30.62 percent on average.

Core claim

Evidence-informed LLM beliefs, formed by retrieval-augmented generation over prior discoveries, replace static surprisal with non-stationary surprisal; this correctly flags 37.5 percent of static values as spurious and, when paired with belief-update filtering plus diversity maximization, raises accumulated non-stationary surprisal by 30.62 percent relative to the unmodified search procedure.

What carries the argument

Evidence-informed LLM beliefs produced by embedding-based retrieval-augmented generation over prior discoveries, which supplies updated priors for computing non-stationary surprisal and supplies the filter used in modified search.

If this is right

  • Search must filter out hypotheses whose surprisal disappears once prior evidence is incorporated.
  • Diversity maximization is required in addition to belief filtering to sustain high non-stationary surprisal over long horizons.
  • Non-stationary surprisal becomes the operative reward signal once beliefs are allowed to evolve.
  • The same pattern holds across five distinct discovery domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Static surprise metrics may systematically over-count progress in any iterative LLM system that does not refresh its priors.
  • The same retrieval-based updating mechanism could be applied to sequential reasoning tasks outside scientific discovery.
  • The 30.62 percent gain suggests that redundancy avoidance, not merely better base models, is a first-order lever for continual discovery performance.

Load-bearing premise

Embedding-based retrieval over prior discoveries supplies an accurate forecast of the posterior the LLM would reach if it received direct evidence for the new hypothesis.

What would settle it

A direct test in which the LLM is given full evidence for a held-out set of hypotheses and the fraction of static surprisals still counted as spurious after the update is compared against the reported 37.5 percent.

Figures

Figures reproduced from arXiv: 2606.29182 by Andrew McCallum, Ashish Sabharwal, Bodhisattwa Prasad Majumder, Dhruv Agarwal, Peter Clark, Reece Adamson.

Figure 1
Figure 1. Figure 1: Static vs. non-stationary beliefs. Belief distributions for 7500 hypotheses found by AutoDiscovery across five discovery domains. AutoDiscovery uses static beliefs to score hypotheses using an unchanged LLM prior, which causes discoveries already implied by past evidence from search to spuriously appear novel. Evidence-informed LLM priors move the reference belief state towards the eventual posterior and a… view at source ↗
Figure 2
Figure 2. Figure 2: Evidence-informed priors reduce total surprisal. (a) Comparing context-construction methods, ICL with top-k retrieval yields the lowest total surprisal T , indicating that retrieved evidence helps align the prior with the eventual posterior. (b) Decomposing the effect of Ctop-k relative to the static prior shows the proportion of surprising hypotheses from static priors that were reduced (53.5%) or newly i… view at source ↗
Figure 3
Figure 3. Figure 3: Discovery with static vs. non-stationary beliefs. The trajectories are nearly identical, indicating that replacing the reward alone is insufficient to make the original search respond to non-stationary beliefs. To address this, we consider ways to make the search more sensitive to evidence￾informed prior, specifically via two mech￾anisms that directly guide the search dur￾ing the hypothesis selection step.… view at source ↗
Figure 4
Figure 4. Figure 4: Evidence-informed search improves non-stationary discovery. Search performance over 5 datasets and 3 repeat runs with n = 500 experiments. (a) Evidence-informed search, which combines belief-update filtering with diversity maximization, outperforms standard AutoDiscovery search across reward and deduplication variants. Our method with non-stationary beliefs and online de-duplication performs best overall, … view at source ↗
Figure 5
Figure 5. Figure 5: Evidence-informed search improves semantic diversity. Our method produces hy￾potheses with lower average pairwise cosine sim￾ilarity than standard search, indicating broader semantic coverage. Switching from static to non￾stationary beliefs alone does not change semantic diversity. Analyzing diversity. Next, we analyze the se￾mantic diversity of hypotheses found by each search variant5 . First, we see that… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of k in Ctop-k. We vary the number of retrieved prior discoveries used to construct the evidence-informed context and report both the total number of surprisals in the run and the average number of input+output tokens used. We find that increasing k does not monotonically improve performance, instead showing best performance at k = 5 and k = 25, while saving 14.34× and 8.45× fewer tokens, respective… view at source ↗
Figure 7
Figure 7. Figure 7: Performance-cost tradeoff at different levels of LLM reasoning. We vary the GPT￾5-mini reasoning effort setting from minimal to high and plot the reduction in total surprisals as compared to a static beliefs run against the average number of input+output tokens used. Higher reasoning effort can improve surprisal reduction but increases inference cost, allowing us to select the low setting, which is at the … view at source ↗
Figure 8
Figure 8. Figure 8: Proportion of non-stationary surprisals as a function of prior belief shift. Across hypotheses generated from 5 datasets and 3 repeat runs using static search, we find that there is a decreasing trend, which emerges at a shift threshold of 0.2, where as belief shift increases, the proportion of non-stationary surprisals goes down. This indicates that when beliefs significantly update with evidence from pas… view at source ↗
Figure 9
Figure 9. Figure 9: Search improvements under non-stationary surprisal evaluation. Search variants by accumulated non-stationary surprisal count across 5 discovery domains and 3 repeats runs. Our proposed method combining belief-update filtering with diversity maximization (with online￾deduplication) shows the best performance across all methods, yielding a 30.63% gain (≈ 41 surprisals) over the original static-reward search … view at source ↗
Figure 10
Figure 10. Figure 10: Evidence-informed search improves semantic diversity. We compare the diversity of hy￾potheses produced by standard search and Evidence-Informed Search across static and non-stationary belief variants, with and without online de-duplication. As expected, adding online de-duplication increases the number of unique hypotheses across variants. Uniqueness alone does not capture seman￾tic diversity: hypotheses … view at source ↗
read the original abstract

Open-ended scientific discovery with large language models (LLMs) increasingly operates as a long-horizon loop of hypothesis search and verification, where a reward signal guides which hypotheses to test next. A notable recent example is AutoDiscovery, which uses "Bayesian surprise" - the belief shift an LLM undergoes after observing evidence for a hypothesis - as both a discovery metric and a reward for search. We first observe that AutoDiscovery treats surprisal as a static quantity, while surprisal in human reasoning is non-stationary - it is defined relative to beliefs that evolve with experience, a prerequisite for continual scientific discovery. We address this mismatch with evidence-informed LLM beliefs: priors updated with evidence from previous hypotheses to compute non-stationary surprisal for new hypotheses. We compare in-context belief-updating mechanisms and find that embedding-based retrieval-augmented generation over prior discoveries best anticipates eventual posteriors, identifying 37.5% of static surprisals as spurious. We then modify search to avoid these spurious rewards and prioritize hypotheses that remain surprising under non-stationary beliefs. Concretely, we introduce two complementary changes to the original search procedure: belief-update filtering and diversity maximization. Across five discovery domains, our method increases accumulated non-stationary surprisal by 30.62% on average compared to the original search procedure, demonstrating that continual scientific discovery with LLMs requires not only better belief measurement but also search procedures that avoid redundancy and encourage diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that AutoDiscovery's use of static Bayesian surprise is mismatched to continual scientific discovery, where surprisal should be non-stationary relative to evolving beliefs. It introduces evidence-informed LLM beliefs via in-context updates over prior discoveries, finds that embedding-based RAG best anticipates eventual posteriors (flagging 37.5% of static surprisals as spurious), and modifies the search procedure with belief-update filtering plus diversity maximization. Across five domains this yields a 30.62% average increase in accumulated non-stationary surprisal relative to the original procedure.

Significance. If the quantitative result and the RAG-based justification hold under robustness checks, the work would usefully highlight that belief measurement alone is insufficient and that search must actively avoid redundancy; the explicit comparison of update mechanisms and the two concrete search modifications constitute a clear, testable contribution to LLM-driven discovery loops.

major comments (3)
  1. [Abstract] Abstract: the central quantitative claim (30.62% increase in accumulated non-stationary surprisal) is reported without error bars, per-domain breakdowns, dataset sizes, or verification across random seeds; this directly undermines in the reported superiority of the modified search.
  2. [Belief-updating comparison (likely §3)] The section comparing in-context belief-updating mechanisms: the claim that embedding-based RAG 'best anticipates eventual posteriors' (thereby identifying 37.5% spurious static surprisals) is load-bearing for both the non-stationary framing and the subsequent search modifications, yet no controls for prompt sensitivity, LLM dependence in posterior construction, or alternative retrieval methods are described.
  3. [Search modification experiments (likely §4)] The experimental results on modified search: the 30.62% gain is measured against the authors' own baseline procedure after the two changes (belief-update filtering and diversity maximization) have been introduced; without an ablation isolating each change or a comparison to an external non-stationary baseline, attribution of the gain remains unclear.
minor comments (2)
  1. [Introduction / Preliminaries] Notation for non-stationary surprisal versus static Bayesian surprise should be introduced with an explicit equation early in the paper to avoid ambiguity when the two are contrasted.
  2. [Experimental setup] The five discovery domains are referenced but not listed with their characteristics or citation; adding a short table would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important aspects for improving the clarity and robustness of our results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central quantitative claim (30.62% increase in accumulated non-stationary surprisal) is reported without error bars, per-domain breakdowns, dataset sizes, or verification across random seeds; this directly undermines in the reported superiority of the modified search.

    Authors: We agree that the presentation of the quantitative results can be strengthened with additional details. In the revised version, we will report per-domain breakdowns, include the sizes of the datasets used in each of the five domains, and provide results from multiple random seeds with corresponding error bars to better support the 30.62% average increase. revision: yes

  2. Referee: [Belief-updating comparison (likely §3)] The section comparing in-context belief-updating mechanisms: the claim that embedding-based RAG 'best anticipates eventual posteriors' (thereby identifying 37.5% spurious static surprisals) is load-bearing for both the non-stationary framing and the subsequent search modifications, yet no controls for prompt sensitivity, LLM dependence in posterior construction, or alternative retrieval methods are described.

    Authors: The referee is correct that robustness to prompt variations and LLM choice would increase confidence in the finding. We will conduct additional experiments in the revision to test sensitivity to different prompt formulations and alternative LLMs for constructing the posteriors. We will also compare embedding-based RAG to other retrieval approaches such as BM25 to confirm its performance. revision: yes

  3. Referee: [Search modification experiments (likely §4)] The experimental results on modified search: the 30.62% gain is measured against the authors' own baseline procedure after the two changes (belief-update filtering and diversity maximization) have been introduced; without an ablation isolating each change or a comparison to an external non-stationary baseline, attribution of the gain remains unclear.

    Authors: We will add an ablation study in the revised manuscript to isolate the contribution of belief-update filtering versus diversity maximization to the overall gain. For an external baseline, the original AutoDiscovery procedure serves as the direct comparison point from the literature; we will clarify this and discuss why direct implementation of other non-stationary methods may not be straightforward, while acknowledging this as a limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is empirically grounded

full rationale

The paper defines non-stationary surprisal via evidence-informed belief updates, performs an empirical comparison of in-context mechanisms against eventual posteriors to select embedding-based RAG (flagging 37.5% spurious), applies two search modifications, and reports a 30.62% average increase in the target metric versus the original AutoDiscovery procedure across five domains. No quoted step reduces a claimed prediction or result to a fitted parameter, self-citation, or definitional equivalence by construction; the central improvement is measured against an external baseline procedure and the belief-update choice rests on an observable anticipation metric rather than internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human-like continual discovery requires non-stationary surprisal and on the empirical claim that embedding RAG best approximates posterior beliefs; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Surprisal in human reasoning is non-stationary and defined relative to beliefs that evolve with experience
    Explicitly invoked in the abstract as a prerequisite for continual scientific discovery.

pith-pipeline@v0.9.1-grok · 5808 in / 1257 out tokens · 24167 ms · 2026-06-30T07:55:26.509541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    Agarwal, M

    D. Agarwal, M. G. Arivazhagan, R. Das, S. Swamy, S. Khosla, and R. Gangadharaiah. Searching for optimal solutions with LLM s via bayesian optimization. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=aVfDrl7xDV

  2. [2]

    Agarwal, B

    D. Agarwal, B. P. Majumder, R. Adamson, M. Chakravorty, S. R. Gavireddy, A. Parashar, H. Surana, B. D. Mishra, A. McCallum, A. Sabharwal, and P. Clark. Autodiscovery: Open-ended scientific discovery via bayesian surprise. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview.net/forum?id=kJqTkj2HhF

  3. [3]

    Agrawal, K

    S. Agrawal, K. V. Kher, S. Mittal, S. Maheshwari, and V. N. Balasubramanian. Mira: Memory-integrated reconfigurable adapters: A unified framework for settings with multiple tasks. In Advances in Neural Information Processing Systems, 2025

  4. [4]

    C. E. Alchourr \'o n, P. G \"a rdenfors, and D. Makinson. On the logic of theory change: Partial meet contraction and revision functions. The journal of symbolic logic, 50 0 (2): 0 510--530, 1985

  5. [5]

    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6: 0 525 -- 535, 2023. URL https://api.semanticscholar.org/CorpusID:258059792

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  7. [7]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025

  8. [8]

    Cou \"e toux, J.-B

    A. Cou \"e toux, J.-B. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard. Continuous upper confidence trees. In Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers 5, pages 433--445. Springer, 2011

  9. [9]

    R. Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72--83. Springer, 2006

  10. [10]

    J. Earman. Bayes or bust?: A critical examination of Bayesian confirmation theory, volume 92. MIT Press Cambridge, MA, 1992

  11. [11]

    G \"a rdenfors

    P. G \"a rdenfors. Knowledge in flux: Modeling the dynamics of epistemic states. The MIT Press, 1988

  12. [12]

    J. Geng, H. Chen, R. Liu, M. H. Ribeiro, R. Willer, G. Neubig, and T. L. Griffiths. Accumulating context changes the beliefs of language models. arXiv preprint arXiv:2511.01805, 2025

  13. [13]

    Towards an AI co-scientist

    J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, K. Saab, D. Popovici, J. Blum, F. Zhang, K. Chou, A. Hassidim, B. Gokturk, A. Vahdat, P. Kohli, Y. Matias, A. Carroll, K. Kulkarni, N. Tomasev, Y. Guan, V. Dhillon, E. D. Vaishnav, B. Lee, T. R. D. Costa, J. R. Penadés, G. Peltz, Y. Xu, A...

  14. [14]

    K. Gu, R. Shang, R. Jiang, K. Kuang, R.-J. Lin, D. Lyu, Y. Mao, Y. Pan, T. Wu, J. Yu, Y. Zhang, T. M. Zhang, L. Zhu, M. A. Merrill, J. Heer, and T. Althoff. Blade: Benchmarking language model agents for data-driven science. arXiv, 2024. URL https://arxiv.org/abs/2408.09667

  15. [15]

    Howson and P

    C. Howson and P. Urbach. Scientific reasoning: the Bayesian approach. Open Court Publishing, 2006

  16. [16]

    J. Hu, Z. Zhang, G. Chen, X. Wen, C. Shuai, W. Luo, B. Xiao, Y. Li, and M. Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025

  17. [17]

    Itti and P

    L. Itti and P. Baldi. Bayesian surprise attracts human attention. Advances in neural information processing systems, 18, 2005

  18. [18]

    Generating Literature-Driven Scientific Theories at Scale

    P. Jansen, P. Clark, D. Downey, and D. S. Weld. Generating literature-driven scientific theories at scale, 2026. URL https://arxiv.org/abs/2601.16282

  19. [19]

    J. Kang, M. Ji, Z. Zhao, and T. Bai. Memory OS of AI agent. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961--25970, Suzhou, China, Nov. 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi:10.18653/v1/2025.em...

  20. [20]

    Kassner, O

    N. Kassner, O. Tafjord, H. Sch \"u tze, and P. Clark. B elief B ank: Adding memory to a pre-trained language model for a systematic notion of belief. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8849--8861, Online and Punta Cana, Dominican Repu...

  21. [21]

    Kuratov, M

    Y. Kuratov, M. Kairov, A. Bulatov, I. Rodkin, and M. Burtsev. Gradmem: Learning to write context into memory with test-time gradient descent. In Third Workshop on Test-Time Updates (Main Track), 2026. URL https://openreview.net/forum?id=GidQ1tmQ2G

  22. [22]

    L Griffiths, C

    T. L Griffiths, C. Kemp, and J. B Tenenbaum. Bayesian models of cognition. Carnegie Mellon University, 2008

  23. [23]

    T. Liu, N. Astorga, N. Seedat, and M. van der Schaar. Large language models to enhance bayesian optimization. In International Conference on Learning Representations, 2024

  24. [24]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292

  25. [25]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id...

  26. [26]

    B. P. Majumder, H. Surana, D. Agarwal, S. Hazra, A. Sabharwal, and P. Clark. Position: data-driven discovery with large generative models. In Forty-first International Conference on Machine Learning, 2024

  27. [27]

    B. P. Majumder, H. Surana, D. Agarwal, B. D. Mishra, A. Meena, A. Prakhar, T. Vora, T. Khot, A. Sabharwal, and P. Clark. Discoverybench: Towards data-driven discovery with large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vyflgpwfJW

  28. [28]

    Martin and D

    E. Martin and D. Osherson. Scientific discovery based on belief revision. The Journal of Symbolic Logic, 62 0 (4): 0 1352--1370, 1997

  29. [29]

    Kosmos: An AI Scientist for Autonomous Discovery

    L. Mitchener, A. Yiu, B. Chang, M. Bourdenx, T. Nadolski, A. Sulovari, E. C. Landsness, D. L. Barabasi, S. Narayanan, N. Evans, et al. Kosmos: An ai scientist for autonomous discovery. arXiv preprint arXiv:2511.02824, 2025

  30. [30]

    P. Phan, D. Agarwal, K. Srinivas, H. Samulowitz, P. Kapanipathi, and A. McCallum. Migrate: Mixed-policy grpo for adaptation at test-time. arXiv preprint arXiv:2508.08641, 2025

  31. [31]

    Rezazadeh, Z

    A. Rezazadeh, Z. Li, W. Wei, and Y. Bao. From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=moXtEmCleY

  32. [32]

    Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training with self-supervision for generalization under distribution shifts. In Proceedings of the 37th International Conference on Machine Learning, pages 9229--9248. PMLR, 2020

  33. [33]

    J. B. Tenenbaum, T. L. Griffiths, and C. Kemp. Theory-based bayesian models of inductive learning and reasoning. Trends in cognitive sciences, 10 0 (7): 0 309--318, 2006

  34. [34]

    J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Goodman. How to grow a mind: Statistics, structure, and abstraction. science, 331 0 (6022): 0 1279--1285, 2011

  35. [35]

    P. Thagard. Explanatory coherence. Behavioral and brain sciences, 12 0 (3): 0 435--467, 1989

  36. [36]

    P. Thagard. Conceptual Revolutions. Princeton University Press, 1992. URL http://www.jstor.org/stable/j.ctv36zq4g

  37. [37]

    C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, and M. Sun. Inf LLM : Training-free long-context extrapolation for LLM s with an efficient context memory. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=bTHFrqhASY

  38. [38]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URL https://arxiv.org/abs/2504.08066

  39. [39]

    C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen. Large language models as optimizers. In International Conference on Learning Representations, 2024

  40. [40]

    Zhang, Z

    Y. Zhang, Z. Wang, and J. Shang. Clusterllm: Large language models as a guide for text clustering, 2023. URL https://arxiv.org/abs/2305.14871