pith. sign in

arxiv: 2605.19633 · v1 · pith:B5OTAX5Lnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG· cs.NE· cs.SE

optimize_anything: A Universal API for Optimizing any Text Parameter

Pith reviewed 2026-05-20 05:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.NEcs.SE
keywords text optimizationLLM-based searchuniversal APIagent architecturesCUDA kernelsscheduling algorithmscross-task transferARC-AGI
0
0 comments X

The pith

A single LLM-based system for optimizing text artifacts achieves state-of-the-art results across six diverse tasks including agent design and code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that optimization problems can be reframed as improving a text artifact evaluated by a scoring function, letting one AI system handle single-task search, multi-task transfer, and generalization to new inputs. This unifies tasks that normally need separate domain-specific algorithms under a shared text-optimization framework. A sympathetic reader would care because the approach reportedly triples ARC-AGI accuracy, cuts cloud scheduling costs by 40 percent, and produces competitive CUDA kernels without hand-crafted code. Ablations show that side information beyond raw scores speeds improvement and that sharing search across related tasks boosts final performance. The work positions text-based LLM search as a general-purpose solver rather than a collection of narrow tools.

Core claim

When optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system—supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs—achieves state-of-the-art results across six diverse tasks.

What carries the argument

The optimize_anything API, which casts any parameter optimization as refinement of a text artifact guided by an external scoring function and solved via LLM-driven search.

If this is right

  • Actionable side information produces faster convergence and higher final scores than score-only feedback.
  • Multi-task search outperforms independent per-task optimization under equal total budget through cross-task transfer.
  • Benefits from multi-task search increase as the number of related tasks grows.
  • The system generalizes its discovered solutions to inputs not seen during optimization.
  • The same framework discovers agent architectures, scheduling policies, CUDA kernels, and geometric packings without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results suggest that representing candidate solutions as editable text may let LLMs serve as universal black-box optimizers across additional domains such as materials design or financial strategies.
  • If cross-task transfer continues to scale, future versions could maintain a shared library of successful text patterns that accelerate optimization on entirely new problems.
  • A practical test would measure how much the performance edge shrinks when the scoring functions are replaced by noisier or more expensive real-world evaluators.

Load-bearing premise

The scoring functions supplied for each task serve as reliable proxies for real performance that need no further domain-specific engineering to deliver the reported gains.

What would settle it

Applying the system to a new task outside the original six where it fails to match or exceed the best existing specialized method on that task would falsify the universality claim.

Figures

Figures reproduced from arXiv: 2605.19633 by Alexandros G. Dimakis, Dan Klein, Donghyun Lee, Ion Stoica, Joseph E. Gonzalez, Karim Elmaaroufi, Koushik Sen, Lakshya A Agrawal, Matei Zaharia, Omar Khattab, Rohit Sandadi, Sanjit A. Seshia, Shangyin Tan, Wenjie Ma.

Figure 1
Figure 1. Figure 1: The optimize_anything loop: a text artifact 𝑥 is passed to an evaluator 𝑓 (𝑥) which returns a score plus diagnostic feedback (SI), which is consumed by an LLM proposer to produce an improved artifact. The same API instantiates across domains: code optimization, prompt tuning, agent architecture search, and policy discovery. We observe that a wide range of problems can be formulated as optimizing a text art… view at source ↗
Figure 2
Figure 2. Figure 2: Claude Code on the Bleve repository. Optimized [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optimization trajectories for cloud scheduling. Both [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: ARC-AGI agent architecture evolution with Gem [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AIME prompt optimization for GPT-4.1-mini. Val [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation: prompt optimization with vs. without SI [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Single-task vs. multi-task mode on 10 selected Ker [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Architecture of the optimized ARC-AGI agent. The [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison between zero-shot generations (left) and optimize_anything candidates (right) across four [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces optimize_anything, a universal API for optimizing any text parameter via an LLM-based system. It claims that formulating optimization problems as improving a text artifact evaluated by a scoring function enables a single system—supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs—to achieve state-of-the-art results across six diverse tasks. Specific results include nearly tripling Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), cutting cloud costs by 40% via improved scheduling, generating CUDA kernels where 87% match or beat PyTorch, and outperforming AlphaEvolve on circle packing (n=26). Ablations indicate that actionable side information outperforms score-only feedback and that multi-task search yields benefits through cross-task transfer that scale with the number of related tasks. The system is open-sourced as part of the GEPA project.

Significance. If the results hold after addressing methodological details, the work would be significant as a demonstration that LLM-based text optimization can serve as a general-purpose paradigm unifying tasks traditionally addressed by domain-specific algorithms. The empirical breadth across six tasks, the ablations on side information and multi-task transfer, and the open-source release with multiple backends are strengths that support reproducibility and potential adoption. The approach could reduce the need for specialized tools if the optimizer's contribution can be isolated from task-specific components.

major comments (3)
  1. [Abstract] Abstract: The central claim that a single system achieves SOTA results across six tasks rests on reported quantitative gains (e.g., ARC-AGI 32.5% to 89.5%, 40% cost reduction, 87% CUDA kernels matching or beating PyTorch). However, the abstract supplies no explicit construction of the scoring functions, no baselines, no statistical tests, and no ablation controls, making it impossible to determine whether these gains are attributable to the optimizer or to unstated properties of the scorers.
  2. [Methods] Scoring function definitions (Methods section): The universality argument requires that scoring functions serve as minimal, fixed, neutral proxies. If the paper does not demonstrate via sensitivity analysis or comparison to generic accuracy/latency metrics that equivalent gains cannot be obtained without task-specific test suites or reward shaping inside the scorer, then the reported advantage may partly reflect scorer engineering rather than the search procedure itself.
  3. [Results] Ablations (Results section): The claim that multi-task search outperforms independent optimization 'given equivalent per-problem budget' is load-bearing for the cross-problem transfer result. Without a precise description of how the per-problem budget is defined and allocated in the independent baseline (including total LLM calls or wall-clock time), the comparison cannot be evaluated for fairness.
minor comments (2)
  1. [Results] The manuscript should include a table summarizing the six tasks, their scoring functions, and the exact baselines used for each SOTA comparison.
  2. [Ablations] Notation for 'actionable side information' versus 'score-only feedback' should be defined explicitly in the first use to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, clarifying the manuscript's contributions while making targeted revisions to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that a single system achieves SOTA results across six tasks rests on reported quantitative gains (e.g., ARC-AGI 32.5% to 89.5%, 40% cost reduction, 87% CUDA kernels matching or beating PyTorch). However, the abstract supplies no explicit construction of the scoring functions, no baselines, no statistical tests, and no ablation controls, making it impossible to determine whether these gains are attributable to the optimizer or to unstated properties of the scorers.

    Authors: We acknowledge that the abstract's brevity limits inclusion of full methodological details. Scoring functions are defined in the Methods section as minimal, task-appropriate metrics (e.g., test-set accuracy for ARC-AGI without additional shaping, measured cost for scheduling). Baselines, statistical comparisons where applicable, and ablations appear in Results. To improve accessibility, we will revise the abstract to briefly note the scoring-function formulation and reference the main text for baselines and controls. The controlled single- vs. multi-task experiments and side-information ablations are designed to isolate the optimizer's contribution from scorer properties. revision: partial

  2. Referee: [Methods] Scoring function definitions (Methods section): The universality argument requires that scoring functions serve as minimal, fixed, neutral proxies. If the paper does not demonstrate via sensitivity analysis or comparison to generic accuracy/latency metrics that equivalent gains cannot be obtained without task-specific test suites or reward shaping inside the scorer, then the reported advantage may partly reflect scorer engineering rather than the search procedure itself.

    Authors: Scoring functions use standard, fixed metrics without reward shaping: accuracy on held-out ARC-AGI examples, runtime and correctness for CUDA kernels, and direct cost for scheduling. The same optimizer is applied across all tasks, with ablations showing gains from search strategy rather than scorer changes. While a dedicated sensitivity analysis to generic metrics was not present in the original submission, the cross-task transfer results and side-information comparisons provide evidence that the optimizer drives performance. We will add a subsection in Methods explicitly discussing scorer neutrality and comparing against generic metrics where feasible. revision: yes

  3. Referee: [Results] Ablations (Results section): The claim that multi-task search outperforms independent optimization 'given equivalent per-problem budget' is load-bearing for the cross-problem transfer result. Without a precise description of how the per-problem budget is defined and allocated in the independent baseline (including total LLM calls or wall-clock time), the comparison cannot be evaluated for fairness.

    Authors: The per-problem budget is defined as an equal allocation of total LLM calls (optimization steps) to each task in the independent baseline, with the multi-task setting using the same aggregate budget but allowing information sharing. This is stated in Results, but we agree a more precise specification would strengthen the claim. We will revise the section to include an explicit definition, a budget-allocation table, and pseudocode distinguishing the two regimes, ensuring the comparison accounts for total compute and wall-clock time. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on independent experimental outcomes

full rationale

The paper formulates optimization as improving text artifacts under provided scoring functions and reports empirical gains across six tasks (e.g., ARC-AGI accuracy lift, cloud-cost reduction, CUDA kernel performance). No equations, derivations, or self-referential definitions appear. Claims of universality and cross-task transfer are supported by ablations and observed performance rather than reducing by construction to fitted inputs, self-citations, or renamed known results. Scoring functions are treated as external inputs whose quality is an assumption, not a load-bearing derivation step internal to the optimizer. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; the approach implicitly assumes LLMs can perform effective black-box search over text spaces when given a scorer, with no explicit free parameters or new physical entities described.

axioms (1)
  • domain assumption LLM-based iterative text editing guided by scalar scores can discover high-performing solutions across unrelated domains
    This is the load-bearing premise enabling the universal API claim.

pith-pipeline@v0.9.0 · 5826 in / 1184 out tokens · 45038 ms · 2026-05-20T05:38:42.642229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 14 internal anchors

  1. [1]

    Lakshya A Agrawal. 2025. ARC-AGI Agent Architecture Optimization with GEPAAdapter. https://github.com/gepa-ai/gepa/blob/ebe0cd71/src/gepa/ examples/dspy_full_program_evolution/arc_agi.ipynb. Committed September 1,

  2. [2]

    Readable version: https://gepa-ai.github.io/gepa/tutorials/arc_agi/

  3. [3]

    Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E

    Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, and Matei Zaharia. 2026. In- troducing optimize_anything: A Unified Text Optimization API. https://gepa- ai.github.io/gepa/blog/2026/02/18/introducing-opti...

  4. [4]

    Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. 2026. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. InInternational...

  5. [5]

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. arXiv:1907.10902 [cs.LG] https://arxiv.org/abs/1907.10902

  6. [6]

    Angelica Chen, David Dohan, and David So. 2023. EvoPrompting: Language Mod- els for Code-Level Neural Architecture Search. InAdvances in Neural Information Processing Systems (NeurIPS)

  7. [7]

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. 2025. Barbarians at the Gate: How AI is Upending Systems Research. arXiv:2510.06189 [cs.AI] https://arxiv.org/abs/2510.06189

  8. [8]

    François Chollet. 2019. On the Measure of Intelligence.arXiv preprint arXiv:1911.01547(2019)

  9. [9]

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. 2023. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution. arXiv:2309.16797 [cs.CL] https://arxiv.org/abs/2309.16797

  10. [10]

    Shamil I Galiev and Maria S Lisafina. 2013. Linear models for the approximate solution of the problem of packing equal circles into a given domain.European Journal of Operational Research230, 3 (2013), 505–514

  11. [11]

    Ronald L Graham and Boris D Lubachevsky. 1996. Dense packings of equal disks in an equilateral triangle: from 22 to 34 and beyond.The Electronic Journal of Combinatorics2 (1996)

  12. [12]

    Shengran Hu, Cong Lu, and Jeff Clune. 2024. Automated Design of Agentic Systems. InarXiv preprint arXiv:2408.08435

  13. [13]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714 [cs.CL] https://arxiv.org/abs/2310.03714

  14. [14]

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. 2025. ShinkaE- volve: Towards Open-Ended And Sample-Efficient Program Evolution. arXiv:2509.19349 [cs.CL] https://arxiv.org/abs/2509.19349

  15. [15]

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Ken- neth O. Stanley. 2022. Evolution through Large Models. arXiv:2206.08896 [cs.NE] https://arxiv.org/abs/2206.08896

  16. [16]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  17. [17]

    Self-Refine: Iterative Refinement with Self-Feedback.Advances in Neural Information Processing Systems (NeurIPS)(2023)

  18. [18]

    Michael McCourt. 2016. Optimization Test Functions. https://github.com/sigopt/ evalset. https://github.com/sigopt/evalset

  19. [19]

    Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminating search spaces by mapping elites. arXiv:1504.04909 [cs.AI] https://arxiv.org/abs/1504.04909

  20. [20]

    Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. 2025. AlphaEvolve: A coding agent for scientific an...

  21. [21]

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christo- pher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing In- structions and Demonstrations for Multi-Stage Language Model Programs. arXiv:2406.11695 [cs.CL] https://arxiv.org/abs/2406.11695

  22. [22]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels? arXiv:2502.10517 [cs.LG] https://arxiv.org/abs/2502.10517

  23. [23]

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng

  24. [24]

    Gradient Descent

    Automatic Prompt Optimization with “Gradient Descent” and Beam Search. InEmpirical Methods in Natural Language Processing (EMNLP)

  25. [25]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi

  26. [26]

    Pawan Kumar, Emilien Dupont, Francisco J

    Mathematical discoveries from program search with large language models. Nature625, 7995 (2024), 468–475. doi:10.1038/s41586-023-06924-6

  27. [27]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

  28. [28]

    2025.OpenEvolve: an open-source evolutionary coding agent

    Asankhaya Sharma. 2025.OpenEvolve: an open-source evolutionary coding agent. https://github.com/algorithmicsuperintelligence/openevolve

  29. [29]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI] https://arxiv.org/abs/2303. 11366

  30. [30]

    Dimakis, and Matei Zaharia

    Shangyin Tan, Lakshya A Agrawal, Rohit Sandadi, Dan Klein, Koushik Sen, Alexandros G. Dimakis, and Matei Zaharia. 2026. Automatically Learning Skills for Coding Agents. https://gepa-ai.github.io/gepa/blog/2026/02/18/automatically- learning-skills-for-coding-agents/. Blog post, February 18, 2026

  31. [31]

    Large Language Models as Optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024. Large Language Models as Optimizers. arXiv:2309.03409 [cs.LG] https://arxiv.org/abs/2309.03409

  32. [32]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad: Automatic "Differentiation" via Text. arXiv:2406.07496 [cs.CL] https://arxiv.org/abs/2406.07496

  33. [33]

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. 2025. AFlow: Automating Agentic Workflow Generation. arXiv:2410.10762 [cs.AI] https://arxiv.org/abs/2410.10762

  34. [34]

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large Language Models Are Human-Level Prompt Engineers. arXiv:2211.01910 [cs.LG] https://arxiv.org/abs/2211.01910 optimize_anything: A Universal API for Optimizing any Text Parameter CAIS ’26, May 26–29, 2026, San Jose, CA, USA A Use of Generative...

  35. [35]

    optimize_anything tailors the solver to each problem— discovering L-BFGS-B for boundary optima and multi-start search for deceptive traps

    The mechanism: Optuna’s fixed TPE-CMA-ES pipeline fails in predictable, structural ways (e.g., TPE’s per-dimension sampling converges to trap basins; CMA-ES assumes smooth unimodal land- scapes). optimize_anything tailors the solver to each problem— discovering L-BFGS-B for boundary optima and multi-start search for deceptive traps. C Seedless Mode: 3D Un...

  36. [36]

    Restate the problem briefly in your own words

  37. [37]

    - Define variables explicitly

    Set up notation and equations cleanly before manipulating them. - Define variables explicitly. - State all constraints (e.g., integrality, ranges, geometric conditions) before using them

  38. [38]

    - Justify each important algebraic or geometric step

    Show clear, logically ordered reasoning. - Justify each important algebraic or geometric step. - When you split into cases, state why each case is necessary and what assumptions define it. - If you invoke a known theorem (e.g., Ptolemy, Power of a Point, similarity, Vieta), name it and show exactly how it applies in this context

  39. [39]

    - If you realize a line of reasoning leads to a contradiction or dead end, explicitly say so

    Handle dead ends correctly. - If you realize a line of reasoning leads to a contradiction or dead end, explicitly say so. - Then restart from the last correct point; do not guess or hand-wave

  40. [40]

    - Avoid unnecessary numerical approximations if an exact approach is available

    Keep the reasoning focused and minimal while still being rigorous. - Avoid unnecessary numerical approximations if an exact approach is available. - Do not approximate exact values unless the problem explicitly asks for a decimal. - Prefer algebraic or structural arguments over trial-and-error or random guessing. - You may test candidate values only after...

  41. [41]

    - Do not include any extra words, symbols, or explanation on that final line

    At the end, clearly isolate the answer: - Provide the final answer as a single number or expression on its own line. - Do not include any extra words, symbols, or explanation on that final line. J Discovered solutions We present excerpts of the final optimized artifacts discovered by optimize_anythingfor each domain. J.1 Coding Agent Skills: Bleve Reposit...

  42. [42]

    expected vs got

    Run tests early and iterate from failures (tests are the bug report) - Start broad when feasible: ‘cd /testbed && go test ./...‘ (or project equivalent). - Narrow quickly: - package: ‘go test ./path/to/pkg‘ - single test: ‘go test ./path/to/pkg -run TestName -count=1‘ (add -v only if needed) - For panics: follow the stack trace top frame in repo code firs...

  43. [43]

    - Add focused unit tests when coverage is missing; keep them in the same package and table-driven where sensible (include short words + accented/Unicode edge cases)

    Make minimal, reviewable changes and verify continuously - Change one behavior at a time; rerun the smallest reproducing test after each change. - Add focused unit tests when coverage is missing; keep them in the same package and table-driven where sensible (include short words + accented/Unicode edge cases). - Avoid scratch main.go files in repo root. J....

  44. [44]

    Provider-Aware Weighting: biases path finding towards intra-provider links to minimize egress

  45. [45]

    Pareto-Frontier Candidate Selection: Explicitly keeps candidates that offer distinct cost/time tradeoffs

  46. [46]

    Diverse Steiner Strategies: Includes MST-like approximations for cost and bottleneck-widest paths for throughput

  47. [47]

    ""Bilevel L-BFGS with exact LP sensitivities + SLP block boosts + CMA/Evolution fallback

    Robust Greedy Allocation: Accurately models bandwidth contention across partitions. """ # --- Constants & Configuration --- EST_DATA_VOL_GB = 300.0 EST_INSTANCE_COST_PER_HR = 10.0 PARTITION_VOL_GB = EST_DATA_VOL_GB /max(1, num_partitions) # Sweep parameters for Cost vs Time tradeoff alphas = [0.0, 1e-5, 0.001, 0.01, 0.05, 0.1, 0.5, 2.0] bw_thresholds = [0...