pith. sign in

arxiv: 2606.01975 · v1 · pith:O6M74BXGnew · submitted 2026-06-01 · 💻 cs.AI · cs.SE

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

Pith reviewed 2026-06-28 14:33 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords tensor networkscontraction order optimizationLLMalgorithm developmentevolutionary coding agentsverifier-guidedOpenEvolve
0
0 comments X

The pith

Verifier-guided LLM agents show promise for developing better tensor contraction algorithms while human validation stays essential

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the application of large language models to algorithm development and improvement through a case study focused on contraction order optimization for tensor networks. It deploys verifier-guided evolutionary coding agents with OpenEvolve and tests the effects of different LLMs, evaluation metrics, and test instances on the generated solutions. The work establishes that these agents can produce algorithmic improvements for the contraction task. At the same time it shows that human scientists must continue to perform evaluation, validation, and interpretation of the outputs. The case study therefore illustrates both the capabilities and the current limits of LLM assistance in creating scientific algorithms.

Core claim

Verifier-guided evolutionary coding agents that use LLMs can develop and improve algorithms for contraction order optimization in tensor networks, yet the process still requires human evaluation, validation, and interpretation to ensure the results are reliable and meaningful.

What carries the argument

Verifier-guided evolutionary coding agents that iteratively propose, test, and refine code for tensor-network contraction ordering

If this is right

  • Design choices for the evaluation metric and test instances directly influence the quality of algorithms generated by the agents
  • The same verifier-guided approach can be used to attempt algorithmic improvements on other tensor-network related tasks
  • Human oversight remains necessary to interpret agent outputs and confirm they solve the intended scientific problem

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be applied to contraction optimization in other scientific computing domains that rely on similar ordering problems
  • Results may change if different LLMs or verification procedures are substituted in the evolutionary loop
  • Longer-term use might require new benchmarks that better capture real-world tensor network performance beyond the study instances

Load-bearing premise

That conclusions drawn from this single tensor-network case study with its chosen metrics, test instances, and LLM will generalize to algorithmic development tasks in other domains

What would settle it

An experiment in which the contraction-order algorithms produced by the LLM agents are shown to be inferior to established human-written methods when measured on a broader collection of tensor networks outside the original test set

Figures

Figures reproduced from arXiv: 2606.01975 by Fabian Hoppe, Melven R\"ohrig-Z\"ollner, Philipp Knechtges.

Figure 1
Figure 1. Figure 1: Examples for tensor networks Consequently, contraction orders for TNs are often determined using heuristics or meta￾heuristics, or by dedicated algorithms for restricted families of networks. We restrict ourselves to a very concise overview over some common approaches: • Greedy and random-greedy heuristics are fast and simple, and can be surprisingly effective on many instances [SG18, GK21] • Optimization-… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of the average reduction of log10 FLOPs for various LLMs on the “full small” TNs set. Name by size (active) release deployment GPT-OSS-120B OpenAI (US) 117B (5.1B) 08/2025 ChatAI/Blablador GPT-OSS-20B 21B (3.6B) 08/2025 self-hosted (vllm) Qwen3-235B-A22B-Instruct Alibaba (CN) 235B (22B) 04/2025 ChatAI/Blablador Qwen3-30B-A3B-Instruct 30.5B (3.3B) 04/2025 ChatAI Qwen3-30B-A3B-Thinking 30.5B (3.3B)… view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of the average reduction of log10 FLOPs on the “reduced small” TNs set over 20 runs with different random seeds. The LLM is GPT￾OSS-20B with different levels of reasoning effort (“low” to “high” from left to right) [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of average reduction of log10 FLOPs on the “reduced small” TNs set over 20 runs with different random seeds after 1000, 500, and 250 iterations, respectively. The LLM is GPT-OSS-20B with different levels of reasoning effort (“low” to “high”). The bar left to the violin plot indicates mean (circle), standard deviation, and median (star) [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of various metrics (“metrics measured”, columns) over the iterations for five experiments with different metric optimized (“metric optimized”, rows) with GPT-OSS-20B on the “reduced small” TNs set. Light blue dots indicate individual members of the population, whereas black crosses indicated members of the population that just have become a new best solution (w.r.t. the metric optimized). The num… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of log10 FLOPs for the final best solution of four exper￾iments with GPT-OSS-20B (evolution on reduced “small”, “middle”, “large” and “all” TNs sets; columns), evaluated on the full “small” to “large” TNs sets (rows). The gray and blue histograms show the distribution of FLOPs of the initial and the final best code, respectively. The dotted green line indicates the distribution observed for th… view at source ↗
Figure 7
Figure 7. Figure 7: Number of lines of codes, percentage of comments, and relative code complexity for the codes leading at times in Experiment 1; cf [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Number of lines of codes, percentage of comments, and relative code complexity for the codes from Experiment 2; cf [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evolution of various metrics for the best run (GPT-OSS-20B on the “full small” TNs set with reasoning level “high”). Light blue dots indicate individual members of the population, whereas black crosses indicated members of the population that just have become a new best solution. The gray dotted lines indicate the level of the initial code, whereas the green dotted line indicates the threshold for an impro… view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of log10 FLOPs for the final code of the best run (blue) vs different baselines (“cotengra cheap”, “cotengra+cmaes”, and “coten￾gra+optuna” in green, red, and purple, respectively) on different TNs sets (columns). The top row uses the “full” TNs sets, whereas the bottom row uses only the last 100 of the TNs from the respective sets. The numbers in the upper left corners give the average impro… view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of runtimes for computing the contraction order with the final code of the best run (blue) vs different baselines (“cotengra cheap”, “cotengra+cmaes”, and “cotengra+optuna” in green, red, and purple, respec￾tively) on different TNs sets (columns). The top row uses the “full” TNs sets, whereas the bottom row uses only the last 100 of the TNs from the respective sets. The numbers in the upper l… view at source ↗
read the original abstract

We consider LLM-based algorithm development through a case study on contractionorder optimisation for tensor networks with OpenEvolve. We pay particular attention to the choice of the LLM as well as design choices such as evaluation metric and test instances. Our results highlight both the promise of verifier-guided evolutionary coding agents for algorithm development/improvement and the continuing importance of evaluation, validation, and interpretation -- and corresponding challenges -- by the human scientist.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents a case study on LLM-based algorithm development using verifier-guided evolutionary coding agents (OpenEvolve) for contraction-order optimization in tensor networks. It examines the impact of LLM choice, evaluation metrics, and test instances, concluding that such agents show promise for algorithmic improvement while underscoring the essential role of human evaluation, validation, and interpretation.

Significance. If the case study demonstrates measurable improvements over baselines with the described setup, the work would illustrate a concrete application of LLMs to a combinatorial optimization task in tensor networks and reinforce the value of hybrid human-AI workflows. However, the single narrow domain limits broader significance for algorithmic development in general without evidence of transfer.

major comments (2)
  1. [Abstract] Abstract: the central claim that verifier-guided evolutionary coding agents show promise for algorithm development/improvement rests on results from one tensor-network case study using one LLM, one set of test instances, and one evaluation metric. No evidence is supplied that the observed improvements or necessity of human interpretation would hold for other problems.
  2. [Abstract] Abstract: no quantitative results, error bars, baseline comparisons, or description of how improvements were measured are supplied, so it is impossible to judge whether the central claim is supported by data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments. We address the major comments point by point below, agreeing where revisions to the abstract are warranted to better reflect the case-study nature of the work and to include quantitative details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that verifier-guided evolutionary coding agents show promise for algorithm development/improvement rests on results from one tensor-network case study using one LLM, one set of test instances, and one evaluation metric. No evidence is supplied that the observed improvements or necessity of human interpretation would hold for other problems.

    Authors: The manuscript is explicitly framed as a case study on contraction-order optimization in tensor networks, as stated in the abstract and introduction. The central claim concerns the observed promise and the essential role of human validation within this specific setting; we do not claim or provide evidence that the results transfer to other algorithmic problems. We will revise the abstract to more clearly emphasize the case-study scope and the absence of broader transfer evidence. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative results, error bars, baseline comparisons, or description of how improvements were measured are supplied, so it is impossible to judge whether the central claim is supported by data.

    Authors: The full manuscript contains quantitative results, baseline comparisons (including standard contraction-order heuristics), error bars from multiple runs, and explicit descriptions of the evaluation metric and test instances. However, the abstract does not summarize these elements. We will revise the abstract to incorporate key quantitative highlights and measurement details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical case study with no derivation chain

full rationale

The paper is a case study reporting experimental results from applying an LLM-based evolutionary coding agent (OpenEvolve) to one specific task: contraction-order optimization for tensor networks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim that such agents 'show promise' rests on direct empirical observations rather than any self-referential reduction to inputs by construction. Generalization concerns are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or postulated entities; ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5605 in / 937 out tokens · 22831 ms · 2026-06-28T14:33:15.377721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    Agrawal, O

    doi:10.18653/v1/2025.eval4nlp-1.12 [AAGa24] E. Agrawal, O. Alam, C. Goenka, et al. Code Compass: A Study on the Challenges of Navigating Unfamiliar Codebases.CoRRabs/2405.06271,

  2. [2]

    Assump¸ c˜ ao, D

    doi:10.48550/ARXIV.2405.06271 [AFCa25] H. Assump¸ c˜ ao, D. Ferreira, L. Campos, et al. CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

  3. [3]

    CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

    doi:10.48550/arXiv.2510.14150 [AKSa24] V. Aglietti, I. Ktena, J. Schrouff, et al. FunBO: Discovering Acquisition Functions for Bayesian Optimization with FunSearch

  4. [4]

    https://arxiv.org/abs/2406.04824 [ATSa26] L. A. Agrawal, S. Tan, D. Soylu, et al. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

  5. [5]

    https://arxiv.org/abs/2507.19457 8In fact, a tiny and hidden approach to this has already happened when we chose 2 f in the definition of combined score based on a few experiments as indicated at the beginning of Sect. 3 22 F. HOPPE, M. R ¨OHRIG-Z ¨OLLNER, P. KNECHTGES [BB17] J. Biamonte, V. Bergholm. Tensor Networks in a Nutshell

  6. [6]

    B¨ auerle, A

    https://arxiv.org/abs/1708.00006 [BCNa26] A. B¨ auerle, A. Connors, A. Novikov, et al. Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery

  7. [7]

    Brown, J

    https://arxiv.org/abs/2605.05921 [BHJa25] D. Brown, J. He, H. Jenne, et al. Even with AI, Bijection Discovery is Still Hard: The Opportu- nities and Challenges of OpenEvolve for Novel Bijection Construction

  8. [8]

    Ballard, T

    doi:10.48550/arXiv.2511.20987 [BK25] G. Ballard, T. G. Kolda.Tensor Decompositions for Data Science. Cambridge University Press, June

  9. [9]

    Beel, M.-Y

    doi:10.1017/9781009471664 [BKB25] J. Beel, M.-Y. Kan, M. Baumgart. Evaluating Sakana’s AI Scientist: Bold Claims, Mixed Results, and a Promising Future?SIGIR Forum59(1):1–20, Oct

  10. [10]

    doi:10.1145/3769733.3769747 [BNL26] J. Bhan, N. Nobili, P. Langer. New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search

  11. [11]

    Caravaca, ´Angel Cuevas, R

    https://arxiv.org/abs/2605.01120 [CCC25] F. Caravaca, ´Angel Cuevas, R. Cuevas. From Prompts to Power: Measuring the Energy Footprint of LLM Inference

  12. [12]

    Cheng, S

    https://arxiv.org/abs/2511.05597 [CLPa25] A. Cheng, S. Liu, M. Pan, et al. Let the Barbarians In: How AI Can Accelerate Systems Perfor- mance Research

  13. [13]

    Cheng, L

    doi:10.48550/arXiv.2512.14806 [CZH26] A. Cheng, L. Zhang, G. He. Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision

  14. [14]

    https://arxiv.org/abs/2508.20729 [Dec99] R. Dechter. Bucket elimination: A unifying framework for reasoning.Artificial Intelligence113(1– 2):41–85,

  15. [15]

    doi:10.1016/S0004-3702(99)00059-4 [DFGa18] E. F. Dumitrescu, A. L. Fisher, T. D. Goodrich, et al. Benchmarking treewidth as a practical component of tensor network simulations.PLOS ONE13(12),

  16. [16]

    Fernando, D

    doi:10.1371/journal.pone.0207827 [FBMa24] C. Fernando, D. Banarse, H. Michalewski, et al. Promptbreeder: self-referential self-improvement via prompt evolution. InProceedings of the 41st International Conference on Machine Learning. ICML’24. JMLR.org,

  17. [17]

    https://dl.acm.org/doi/10.5555/3692070.3692611 [Fei22] D. G. Feitelson. Considerations and Pitfalls for Reducing Threats to the Validity of Controlled Experiments on Code Comprehension.Empirical Software Engineering27(6):123, Jun

  18. [18]

    Felderer, M

    doi:10.1007/s10664-022-10160-3 [FGGa25] M. Felderer, M. Goedicke, L. Grunske, et al. Investigating Research Software Engineering: Toward RSE Research.Commun. ACM68(2):20–23, Jan

  19. [19]

    Fisher, V

    doi:10.1145/3685265 [FKSa26] D. Fisher, V. Khrulkov, M. Saygin, et al. LLM-Guided Evolutionary Search for Algebraic T- Count Optimization

  20. [20]

    Feldt, P

    https://arxiv.org/abs/2603.29894 [FLFa26] R. Feldt, P. Lenberg, J. Frattini, et al. The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE

  21. [21]

    https://arxiv.org/abs/2604.15468 [FLXa26] R. Fu, Y. Liu, Q. Xu, et al. MappingEvolve: LLM-Driven Code Evolution for Technology Map- ping

  22. [22]

    Goodfellow, Y

    https://arxiv.org/abs/2604.26591 [GBC16] I. Goodfellow, Y. Bengio, A. Courville.Deep Learning. MIT Press, 2016.http://www. deeplearningbook.org. ALGORITHMIC ALGORITHM DEVELOPMENT WITH LLMS 23 [GD04] V. Gogate, R. Dechter. A complete anytime algorithm for treewidth. InProceedings of the 20th Conference on Uncertainty in Artificial Intelligence. UAI ’04, p....

  23. [23]

    Georgiev, J

    https://dl.acm.org/doi/10.5555/1036843.1036868 [GGTa25] B. Georgiev, J. G´ omez-Serrano, T. Tao, et al. Mathematical exploration and discovery at scale

  24. [24]

    https://arxiv.org/abs/2511.02864 [GK21] J. Gray, S. Kourtis. Hyper-optimized tensor network contraction.Quantum5:410,

  25. [25]

    doi:10.22331/q-2021-03-15-410 [GRSa25] P. W. Goncalves, P. Rani, M.-A. Storey, et al. Code Review Comprehension: Reviewing Strategies Seen Through Code Comprehension Theories . In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). Pp. 589–601. IEEE Computer Society, Los Alamitos, CA, USA, Apr

  26. [26]

    Gottweis, W.-H

    doi:10.1109/ICPC66645.2025.00068 [GWDa25] J. Gottweis, W.-H. Weng, A. Daryin, et al. Towards an AI co-scientist

  27. [27]

    Imajuku, K

    https://arxiv.org/abs/2502.18864 [IHIa] Y. Imajuku, K. Horie, Y. Iwata, et al. ALE-Bench: A Benchmark for Long-Horizon Objective- Driven Algorithm Engineering. NeurIPS

  28. [28]

    Ibrahim, D

    https://arxiv.org/abs/2506.09050 [ILHa22] C. Ibrahim, D. Lykov, Z. He, et al. Constructing Optimal Contraction Trees for Tensor Network Quantum Circuit Simulation

  29. [29]

    Iacovides, W

    https://arxiv.org/abs/2209.02895 [IZLa25] G. Iacovides, W. Zhou, C. Li, et al. Domain-Aware Tensor Network Structure Search

  30. [30]

    Jiang, F

    https://arxiv.org/abs/2505.23537 [JWSa26] J. Jiang, F. Wang, J. Shen, et al. A Survey on Large Language Models for Code Generation. ACM Trans. Softw. Eng. Methodol.35(2), Jan

  31. [31]

    doi: 10.1145/3747588

    doi:10.1145/3747588 [KGBa25] V. Khrulkov, A. Galichin, D. Bashkirov, et al. GigaEvo: An Open Source Optimization Frame- work Powered By LLMs And Evolution Algorithms

  32. [32]

    https://arxiv.org/abs/2511.17592 [Kjæ90] U. B. Kjærulff. Triangulation of Graphs – Algorithms Giving Small Total State Space. Technical report R 90-09, Aalborg University,

  33. [33]

    Kumar, A

    https://cse.unl.edu/~choueiry/Documents/Kjaerulff-TR-1990.pdf [KSNa26] U. Kumar, A. Saito, H. Niranjani, et al. Evolving Interpretable Constitutions for Multi-Agent Coordination

  34. [34]

    Klowden, T

    https://arxiv.org/abs/2602.00755 [KT26] T. Klowden, T. Tao. Mathematical methods and human thought in the age of AI

  35. [35]

    https://arxiv.org/abs/2603.26524 [LGWa26] Z. Liu, X. Guo, X. Wei, et al. Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

  36. [36]

    https://arxiv.org/abs/2604.23472 [LIC25] R. T. Lange, Y. Imajuku, E. Cetin. ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

  37. [37]

    https://arxiv.org/abs/2509.19349 [LLLa26] C. Lu, C. Lu, R. T. Lange, et al. Towards end-to-end automation of AI research.Nature 651(8107):914–919, Mar

  38. [38]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    See alsohttps://arxiv.org/abs/2408.06292. doi:10.1038/s41586-026-10265-5 [LMSa26] K.-A. Lie, O. Møyner, E. Svee, et al. Agentic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction

  39. [39]

    https://arxiv.org/abs/2603.00214 [LTYa24] F. Liu, X. Tong, M. Yuan, et al. Evolution of heuristics: towards efficient automatic algorithm design using large language model. InProceedings of the 41st International Conference on Ma- chine Learning. ICML’24. JMLR.org,

  40. [40]

    HOPPE, M

    https://dl.acm.org/doi/abs/10.5555/3692070.3693374 24 F. HOPPE, M. R ¨OHRIG-Z ¨OLLNER, P. KNECHTGES [LYFa26] H. Lin, H. Ye, W. Feng, et al. Can Language Models Discover Scaling Laws?

  41. [41]

    https://arxiv.org/abs/2507.21184 [LZ] X.-Y. Liu, Z. Zhang. Classical Simulation of Quantum Circuits Using Reinforcement Learning: Parallel Environments and Benchmark. InNeurIPS

  42. [42]

    https://proceedings.neurips.cc/paper_files/paper/2023/file/ d41b70011dd21ec3de5e019302279551-Paper-Datasets_and_Benchmarks.pdf [LZCa25] G. Liu, Y. Zhu, J. Chen, et al. Scientific Algorithm Discovery by Augmenting AlphaEvolve with Deep Research

  43. [43]

    https://arxiv.org/abs/2510.06056 [LZX+24] F. Liu, R. Zhang, Z. Xie, R. Sun, K. Li, X. Lin, Z. Wang, Z. Lu, Q. Zhang. LLM4AD: A Platform for Algorithm Design with Large Language Model

  44. [44]

    https://arxiv.org/abs/2412.17287 [MMMa] E. A. Meirom, H. Maron, S. Mannor, et al. Optimizing Tensor Network Contraction Using Re- inforcement Learning. InProceedings of the 39th International Conference on Machine Learning (ICML 2022). https://proceedings.mlr.press/v162/meirom22a.html [MS08] I. L. Markov, Y. Shi. Simulating Quantum Computation by Contract...

  45. [45]

    Mitchener, A

    doi:10.1137/050644756 [MYCa25] L. Mitchener, A. Yiu, B. Chang, et al. Kosmos: An AI Scientist for Autonomous Discovery

  46. [46]

    https://arxiv.org/abs/2511.02824 [MZKa26] V. A. Mazin, M. A. Zorin, D. S. Korzh, et al. LLM-Guided Prompt Evolution for Password Guessing

  47. [47]

    Nagaitsev, L

    https://arxiv.org/abs/2604.12601 [NGWI25] K. Nagaitsev, L. Grbcic, S. Williams, C. Iancu. Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

  48. [48]

    Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

    https://arxiv.org/abs/2511.16964 [NRT] D. Neum¨ uller, A. Raschke, M. Tichy. Providing Information About Implemented Algorithms Improves Program Comprehension: A Controlled Experiment. InProceedings of the 29th Inter- national Conference on Evaluation and Assessment in Software Engineering. EASE ’25. doi:10.1145/3756681.3756968 [NVEa25] A. Novikov, N. V˜ ...

  49. [49]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    https://arxiv.org/abs/2506.13131 [O’G] B. O’Gorman. Parameterization of Tensor Network Contraction. In14th Conference on the The- ory of Quantum Computation, Communication and Cryptography (TQC 2019). doi:10.4230/LIPIcs.TQC.2019.10 [Or´ u14] R. Or´ us. A Practical Introduction to Tensor Networks: Matrix Product States and Projected Entangled Pair States.A...

  50. [50]

    A practical introduction to tensor networks: Matrix product states and projected entangled pair states,

    doi:10.1016/j.aop.2014.06.013 [PA+25] O. Press, B. Amos et al. AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

  51. [51]

    Romera-Paredes, M

    https://arxiv.org/abs/2507.15887 [RBNa24] B. Romera-Paredes, M. Barekatain, A. Novikov, et al. Mathematical discoveries from program search with large language models.Nature625:468–475,

  52. [52]

    Pawan Kumar, Emilien Dupont, Francisco J

    doi:10.1038/s41586-023-06924-6 [RBSa25] P. Rajput, A. A. Bonkoungou, Y. Song, et al. Dynamic Stability of LLM-Generated Code

  53. [53]

    https://arxiv.org/abs/2511.07463 [RTL76] D. J. Rose, R. E. Tarjan, G. S. Lueker. Algorithmic Aspects of Vertex Elimination on Graphs. SIAM Journal on Computing5(2):266–283,

  54. [54]

    Staudt, M

    doi:10.1137/0205021 [SBK+] C. Staudt, M. Blacher, J. Klaus, F. Lippmann, J. Giesen. Improved Cut Strategy for Tensor Net- work Contraction Orders. In22nd International Symposium on Experimental Algorithms (SEA ALGORITHMIC ALGORITHM DEVELOPMENT WITH LLMS 25 2024). doi:10.4230/LIPIcs.SEA.2024.27 [SG18] D. G. A. Smith, J. Gray. opt einsum - A Python package ...

  55. [55]

    Schlag, T

    doi:10.21105/joss.00753 [SHGa23] S. Schlag, T. Heuer, L. Gottesb¨ uren, et al. High-Quality Hypergraph Partitioning.ACM J. Exp. Algorithmics27, Feb

  56. [56]

    doi:10.1145/3529090 [SISa25] M. L. Siddiq, A. Islam-Gomes, N. Sekerak, et al. Large Language Models for Software Engineering: A Reproducibility Crisis

  57. [57]

    Schindler, A

    https://arxiv.org/abs/2512.00651 [SJ20] F. Schindler, A. S. Jermyn. Algorithms for tensor network contraction ordering.Machine Learn- ing: Science and Technology1(3):035001,

  58. [58]

    Stoian, R

    doi:10.1088/2632-2153/ab94c5 [SMM24] M. Stoian, R. M. Milbradt, C. B. Mendl. On the Optimal Linear Contraction Order of Tree Tensor Networks, and Beyond.SIAM Journal on Scientific Computing46(5):B647–B668,

  59. [59]

    ˇSurina, A

    doi:10.1137/23M161286X [ˇSMQa25] A. ˇSurina, A. Mansouri, L. C. P. M. Quaedvlieg, et al. Algorithm Discovery With LLMs: Evolu- tionary Search Meets Reinforcement Learning

  60. [60]

    Siegmund, J

    https://arxiv.org/abs/2504.05108 [SS15] J. Siegmund, J. Schumann. Confounding parameters on program comprehension: a literature survey.Empirical Software Engineering20(4):1159–1192, Aug

  61. [61]

    Strasser

    doi:10.1007/s10664-014-9318-8 [Str17] B. Strasser. Computing Tree Decompositions with FlowCutter: PACE 2017 Submission

  62. [62]

    https://arxiv.org/abs/1709.08949 [Tam] H. Tamaki. Positive-instance driven dynamic programming for treewidth. In25th Annual Euro- pean Symposium on Algorithms (ESA 2017). doi:10.4230/LIPIcs.ESA.2017.68 [TDPa26] A. Torri, P. Dominikowski, B. Pointal, et al. Near-Optimal Contraction Strategies for the Scalar Product in the Tensor-Train Format. In Nagel et a...

  63. [63]

    doi:10.1007/978-3-031-99872-0 5 [TGZa24] M. Tian, L. Gao, S. D. Zhang, et al. SciCode: A Research Coding Benchmark Curated by Scientists

  64. [64]

    Thach, A

    https://arxiv.org/abs/2407.13168 [TRHC25] N. Thach, A. Riahifar, N. Huynh, H. Chan. RedAHD: Reduction-Based End-to-End Automatic Heuristic Design with Large Language Models

  65. [65]

    Verstraete, D

    https://arxiv.org/abs/2505.20242 [VPC04] F. Verstraete, D. Porras, J. I. Cirac. Density Matrix Renormalization Group and Periodic Bound- ary Conditions: A Quantum Information Perspective.Phys. Rev. Lett.93:227205, Nov

  66. [66]

    doi:10.1103/PhysRevLett.93.227205 [WQBa26] J. Wen, L. Qiu, J. Benton, et al. Automated Weak-to-Strong Researcher

  67. [67]

    https://alignment.anthropic.com/2026/automated-w2s-researcher/ [WSZa25] Y

    Anthropic Align- ment Science Blog. https://alignment.anthropic.com/2026/automated-w2s-researcher/ [WSZa25] Y. Wang, S.-R. Su, Z. Zeng, et al. ThetaEvolve: Test-time Learning on Open Problems

  68. [68]

    https://arxiv.org/abs/2511.23473 [XZLa23] J. Xu, H. Zhang, L. Liang, et al. NP-Hardness of Tensor Network Contraction Ordering

  69. [69]

    https://arxiv.org/abs/2310.06140 [YLDa25] J. Yuan, H. Li, X. Ding, et al. Understanding and Mitigating Numerical Sources of Nondetermin- ism in LLM Inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  70. [70]

    HOPPE, M

    https://arxiv.org/abs/2506.09501 26 F. HOPPE, M. R ¨OHRIG-Z ¨OLLNER, P. KNECHTGES [YLLa25] Y. Yamada, R. T. Lange, C. Lu, et al. The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

  71. [71]

    https://arxiv.org/abs/2504.08066 [YWCa24] H. Ye, J. Wang, Z. Cao, et al. ReEvo: large language models as hyper-heuristics with reflective evolution. InProceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24. Curran Associates Inc., Red Hook, NY, USA,

  72. [72]

    https://dl.acm.org/doi/10.5555/3737916.3739297 [YZLL24] J. Yang, K. Zhou, Y. Li, Z. Liu. Generalized Out-of-Distribution Detection: A Survey.Interna- tional Journal of Computer Vision132:5635–5662,

  73. [73]

    doi:10.1007/s11263-024-02117-4 [ZLSa24] J. Zeng, C. Li, Z. Sun, et al. tnGPS: Discovering Unknown Tensor Network Structure Search Algorithms via Large Language Models (LLMs). InProceedings of the 41st International Confer- ence on Machine Learning

  74. [74]

    EVOLVE-BLOCK-START

    https://arxiv.org/abs/2602.08253 ALGORITHMIC ALGORITHM DEVELOPMENT WITH LLMS 27 System Message for OpenEvolve You are an expert programmer and expert in tensor networks. Your goal is to evolve and improve the code of the function ‘find_edge_path‘ in between the markers "EVOLVE-BLOCK-START" and "EVOLVE- BLOCK-END". CONTEXT: The function ‘find_edge_path‘ ge...