pith. sign in

arxiv: 2605.22613 · v1 · pith:BPYFFJSRnew · submitted 2026-05-21 · 💻 cs.LG

Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery

Pith reviewed 2026-05-22 06:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords evolutionary multi-task optimizationLLM-guided program discoveryshared-then-adaptprogram evolutionmulti-task learningalgorithm discoveryoptimization
0
0 comments X

The pith

Evolving a shared archive of programs across related tasks before adapting to each one improves discovery over independent single-task runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EMO-STA, a two-stage approach for LLM-guided program evolution. It first builds a shared archive by evolving programs across a family of related tasks, then adapts promising candidates to specific tasks. This leverages reusable structures in task families for better performance and efficiency compared to optimizing each task separately with the same compute budget. Experiments across optimization, geometric, modeling, and algorithmic tasks show gains in most cases, plus better generalization and less overfitting in data-scarce settings.

Core claim

EMO-STA first evolves a shared archive of executable programs across a task family and then adapts selected shared candidates to each target task using strategies such as warm-starting, best average, or best per-task selection. This shared-then-adapt process yields better results than matched-compute single-task evolution in most of the eight task families tested, with compute allocation experiments indicating that a substantial fraction of the budget on shared evolution is beneficial.

What carries the argument

The Shared-Then-Adapt (STA) framework within EMO, which evolves a common program archive before per-task adaptation, with variants like STA Best-Local for in-distribution performance and STA Best-Shared for transfer.

If this is right

  • STA Best-Local gives the strongest adaptation to seen tasks.
  • STA Best-Shared provides robust performance on unseen tasks.
  • Allocating a balanced budget between shared evolution and adaptation is often optimal.
  • Shared evolution helps avoid overfitting when training data is limited, as in ARC or time-series tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Task families with high structural similarity would see the largest gains from this approach.
  • Implementing this in practice could lower the total compute needed for discovering programs in related domains like algorithm design or modeling.
  • The method might generalize to non-LLM evolutionary search if the shared archive captures reusable primitives.

Load-bearing premise

Related tasks within each family share reusable executable program structures that can be captured by evolving a single shared archive before per-task adaptation.

What would settle it

On a task family where individual tasks have no shared program structures, such as completely unrelated optimization problems, EMO-STA would fail to improve over or match single-task evolution performance.

Figures

Figures reproduced from arXiv: 2605.22613 by Ege Onur Taga, Emrullah Ildiz, Halil Alperen Gozeten, Samet Oymak, Tara Javidi, Xuechen Zhang.

Figure 1
Figure 1. Figure 1: EMO-STA shared-then-adapt framework. A unified candidate-program interface lets the same evolving program p be evaluated across a related task family T1, . . . , TK. EMO-STA first runs shared evolution with aggregate objective sshared(p) = 1 K PK i=1 si(p), producing a shared archive Ashared of reusable programs. It then initializes task-specific adaptation from this archive using one of three STA variants… view at source ↗
Figure 2
Figure 2. Figure 2: Task-level transfer gains for STA Best-Local. Each panel shows one task family and each row one in-distribution task. Open markers show the pre-adaptation shared score, filled markers show the final STA Best-Local score, and the x-axis reports improvement over the single-task baseline. Task-level gains confirm the family￾level improvement trends [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compute-allocation results for EMO-STA on two K = 4 task families. Grouped bars compare STA Warmstart, STA Best-Local, and STA Best-Shared across Shared / Per-task Adapt / Total allocations. The leftmost bar is the direct single-task baseline with allocation 0/B/KB. Bars report mean scores averaged over Claude Haiku-4.5, Sonnet-4.5, Sonnet-4.6, Opus-4.5, and Opus-4.6. adaptation budget, i.e., S ≈ KA. In Fi… view at source ↗
Figure 4
Figure 4. Figure 4: OOD circle-packing results for STA Best-Local at budget 60/15/120. Rows are held-out sizes, columns are source tasks, and cells report mean OOD normalized score across LLMs and seeds. We evaluate out-of-distribution (OOD) transfer in geometric families. For each EMO-STA variant, we take programs adapted to each in-distribution task, evaluate them on each held-out task, and in [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 5
Figure 5. Figure 5: Held-out task-size evaluation at fixed 60 / 15 / 120 compute allocation across three domains: circle packing, circle packing rectangle, and Heilbronn triangle. For each STA variant, programs are adapted to each in-distribution task size and evaluated on each held-out size; bars report mean OOD score across adaptation source tasks and models. The comparison includes the Single-task baseline and the STA vari… view at source ↗
Figure 6
Figure 6. Figure 6: Shared evolution mitigates overfitting in two low-sample settings. In time-series feature engineering, one shared transformation improves held-out test performance over per-series evolution. In ARC, transformation￾based shared evolution mainly resolves training-example overfitting. univariate series and a 30-step prediction horizon (Aksu et al., 2024). The target is confirmed COVID￾19 deaths, whose daily c… view at source ↗
Figure 7
Figure 7. Figure 7: Circle-packing trajectory for Haiku-4.5, seed 42, using STA Best-Local with a 60/15/120 S/A/Total setting. The adapted family average increases from .903 to .925, compared with a five-run Single-task average of .865. 0 10 20 30 40 50 60 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 Normalized score Shared Evolution Task Breakdown N=20 N=21 N=22 N=23 0 10 20 30 40 50 60 0.78 0.80 0.82 0.84 0.86 0.88 0.90 … view at source ↗
Figure 8
Figure 8. Figure 8: Circle-packing-rectangles trajectory for Sonnet-4.5, seed 45, using STA Best-Shared with a 60/15/120 S/A/Total setting. Adaptation improves the average score from .894 to .924, above the five-run Single-task average of .840. proximity between the adaptation task and the held-out size, so it is most useful when the target 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Heilbronn-triangle trajectory for Sonnet-4.6, seed 44, using STA Best-Shared with a 60/15/120 S/A/Total setting. This example shows broad task-specific improvement, with the average score increasing from .750 to .905 versus a five-run Single-task average of .678. 0 10 20 30 40 50 60 0.4 0.5 0.6 0.7 Normalized score Shared Evolution Task Breakdown Trend Multifreq Chirp Step 0 10 20 30 40 50 60 0.48 0.50 0.5… view at source ↗
Figure 10
Figure 10. Figure 10: Signal-processing trajectory for Opus-4.6, seed 42, using STA Best-Local with a 60/10/100 S/A/Total setting. Adaptation is concentrated on the step-change task, which improves from .694 to .883, raising the family average from .635 to .685 above the five-run Single-task average of .648. task is close to the evaluation size. Together, these patterns indicate that shared evolution improves transfer beyond t… view at source ↗
Figure 11
Figure 11. Figure 11: OOD holdout evaluation for circle packing across EMO-STA budget allocations with the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: OOD holdout evaluation for circle packing in rectangles across EMO-STA budget [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: OOD holdout evaluation for the Heilbronn triangle task across EMO-STA budget alloca [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Compute-allocation results for EMO-STA on [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Compute-allocation results for EMO-STA on [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Compute-allocation results for the Heilbronn triangle family with the shared budget fixed [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

Recent LLM-guided evolutionary search methods have shown that iterative program mutation can discover strong algorithms, but they typically optimize each task independently, even when related tasks share reusable structure. We introduce Evolutionary Multi-Task Optimization (EMO) for LLM-guided program discovery, and propose EMO-STA (Shared-Then-Adapt), a two-stage framework that first evolves a shared archive of executable programs across a task family and then adapts selected shared candidates to each target task. Within EMO-STA, we explore multiple adaptation strategies, including warm-starting from the shared archive, adapting the best average shared program, and adapting the shared program that performs best on each target task. Across eight task families spanning continuous optimization, geometric construction, modeling, and algorithmic optimization, EMO-STA improves over matched-compute single-task evolution in most settings, with STA Best-Local providing the strongest in-distribution adaptation and STA Best-Shared yielding robust transfer to unseen tasks. Compute-allocation experiments show that allocating a substantial fraction of the family-level budget to shared evolution is consistently beneficial, with roughly balanced shared and adaptation budgets often being optimal. Beyond compute efficiency, we show that shared evolution can mitigate overfitting in low-evidence settings (e.g. few training data), including ARC tasks and time-series feature engineering, by favoring programs that generalize across all tasks rather than exploiting task-specific brittle artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Evolutionary Multi-Task Optimization (EMO) for LLM-guided program discovery and introduces EMO-STA, a two-stage Shared-Then-Adapt framework. It first evolves a shared archive of executable programs across a family of related tasks, then adapts selected candidates to each target task using strategies such as warm-starting, best-average, or best-local selection. Experiments across eight task families (continuous optimization, geometric construction, modeling, algorithmic optimization) report that EMO-STA outperforms matched-compute single-task evolution in most settings, with STA Best-Local strongest for in-distribution adaptation and STA Best-Shared best for transfer; additional results address compute allocation (balanced shared/adaptation budgets often optimal) and overfitting mitigation in low-data regimes such as ARC and time-series feature engineering.

Significance. If the central empirical claims hold after clarification of experimental protocols, the work offers a practical route to amortizing the cost of LLM-guided evolutionary search over task families that share executable structure. The compute-allocation findings and the demonstration that shared evolution can reduce overfitting in low-evidence settings constitute concrete, actionable contributions that could influence how practitioners allocate budgets in program-discovery pipelines.

major comments (3)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): the manuscript states that EMO-STA improves over 'matched-compute single-task evolution' across eight task families but provides neither the precise definition of compute matching (total LLM calls, population size, or generation limits per run) nor the exact single-task baseline configuration. Without these details it is impossible to assess whether the reported gains are attributable to the multi-task design or to differences in effective search effort.
  2. [§5.1–5.3] §5.1–5.3 (per-family results): while 'consistent improvements' are claimed, the text does not report statistical tests (e.g., paired t-tests or Wilcoxon signed-rank with correction), number of independent runs, or standard deviations. The absence of these quantities leaves the central claim of reliable superiority only moderately supported, especially given the stochastic nature of LLM-guided mutation.
  3. [§3.2] §3.2 (adaptation strategies): the description of 'STA Best-Local' and 'STA Best-Shared' is clear at a high level, yet the precise selection criterion (e.g., how 'best on each target task' is measured when only a subset of the shared archive is evaluated on the target) is not formalized. This ambiguity affects reproducibility of the strongest reported in-distribution result.
minor comments (2)
  1. [Figure 2, Table 1] Figure 2 and Table 1: axis labels and legend entries use inconsistent abbreviations (e.g., 'STA-BL' vs. 'Best-Local'); a single consistent nomenclature would improve readability.
  2. [§2] §2 (Related Work): the discussion of prior LLM-guided evolutionary methods is concise but omits recent work on multi-task program synthesis outside the evolutionary setting; adding one or two key citations would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our experimental protocols and committing to revisions that improve reproducibility and statistical rigor without altering the core claims.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the manuscript states that EMO-STA improves over 'matched-compute single-task evolution' across eight task families but provides neither the precise definition of compute matching (total LLM calls, population size, or generation limits per run) nor the exact single-task baseline configuration. Without these details it is impossible to assess whether the reported gains are attributable to the multi-task design or to differences in effective search effort.

    Authors: We agree that explicit specification of the compute-matching protocol is necessary for unambiguous interpretation. The single-task baseline was configured to consume an identical total number of LLM calls as the combined shared-evolution plus adaptation budget of EMO-STA; each single-task run used the same population size and per-generation LLM budget as the adaptation stage of the corresponding multi-task run, with the number of generations scaled to equalize overall calls. In the revised manuscript we will add a dedicated paragraph in §4.2 that states these quantities (population size = 50, LLM calls per generation = 20, total matched calls per task family) and includes a table enumerating the exact allocation for each of the eight families. revision: yes

  2. Referee: [§5.1–5.3] §5.1–5.3 (per-family results): while 'consistent improvements' are claimed, the text does not report statistical tests (e.g., paired t-tests or Wilcoxon signed-rank with correction), number of independent runs, or standard deviations. The absence of these quantities leaves the central claim of reliable superiority only moderately supported, especially given the stochastic nature of LLM-guided mutation.

    Authors: We accept that the current presentation understates statistical support. All reported results were obtained from 10 independent runs per configuration. In the revision we will augment every result table with mean ± standard deviation and will add Wilcoxon signed-rank tests (with Bonferroni correction for the eight families) together with the associated p-values. These additions will appear in §5.1–5.3 and will be summarized in a new paragraph discussing reliability under LLM stochasticity. revision: yes

  3. Referee: [§3.2] §3.2 (adaptation strategies): the description of 'STA Best-Local' and 'STA Best-Shared' is clear at a high level, yet the precise selection criterion (e.g., how 'best on each target task' is measured when only a subset of the shared archive is evaluated on the target) is not formalized. This ambiguity affects reproducibility of the strongest reported in-distribution result.

    Authors: We agree that a formal definition is required for reproducibility. In the revised §3.2 we will insert the following precise criterion: after evaluating a randomly sampled subset S of the shared archive on the target task t, STA Best-Local returns arg max_{p ∈ S} f(p, t) where f is the task-specific fitness; STA Best-Shared returns the program with the highest average fitness across the entire task family. We will also supply pseudocode and state the subset size used in the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical

full rationale

The paper introduces EMO-STA as a two-stage evolutionary framework for LLM-guided program discovery and validates it through direct empirical comparisons to matched-compute single-task baselines across eight task families. No derivation chain, equations, or first-principles results are present that reduce to fitted inputs or self-referential definitions; the central premise of shared reusable program structures is tested experimentally rather than assumed by construction. Self-citations, if any, are not load-bearing for the reported gains, which rest on observable performance metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that task families contain reusable program structures and on empirical tuning of compute splits; no new physical entities or formal axioms are introduced.

free parameters (1)
  • shared-to-adaptation budget ratio
    Abstract states that roughly balanced shared and adaptation budgets are often optimal, indicating this split is explored and selected based on results.
axioms (1)
  • domain assumption Tasks within each family share reusable executable program structures that a shared evolutionary archive can capture.
    This premise is required for the two-stage shared-then-adapt process to provide benefit over independent evolution.

pith-pipeline@v0.9.0 · 5799 in / 1262 out tokens · 37965 ms · 2026-05-22T06:43:35.386591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    2025 , eprint=

    AlphaEvolve: A coding agent for scientific and algorithmic discovery , author=. 2025 , eprint=

  2. [2]

    2026 , eprint=

    AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization , author=. 2026 , eprint=

  3. [3]

    2025 , howpublished =

    Sharma, Asankhaya , title =. 2025 , howpublished =

  4. [4]

    2026 , eprint=

    EvoX: Meta-Evolution for Automated Discovery , author=. 2026 , eprint=

  5. [5]

    2026 , eprint=

    Can Language Models Discover Scaling Laws? , author=. 2026 , eprint=

  6. [6]

    Pawan and Dupont, Emilien and Ruiz, Francisco J

    Romera-Paredes, Bernardino and Barekatain, Mohammadamin and Novikov, Alexander and Balog, Matej and Kumar, M. Pawan and Dupont, Emilien and Ruiz, Francisco J. R. and Ellenberg, Jordan S. and Wang, Pengming and Fawzi, Omar and Kohli, Pushmeet and Fawzi, Alhussein , title =. Nature , volume =. 2024 , doi =

  7. [7]

    Gupta, Abhishek and Ong, Yew-Soon and Feng, Liang , title =. Trans. Evol. Comp , month = jun, pages =. 2016 , issue_date =. doi:10.1109/TEVC.2015.2458037 , abstract =

  8. [8]

    and De Jong, Kenneth A

    Scott, Eric O. and De Jong, Kenneth A. , title =. Proceedings of the Genetic and Evolutionary Computation Conference Companion , pages =. 2017 , publisher =

  9. [9]

    Information Sciences , volume =

    Cai, Yiqiao and Peng, Deming and Liu, Peizhong and Guo, Jing-Ming , title =. Information Sciences , volume =. 2021 , doi =

  10. [10]

    International Journal of Mathematical Modelling and Numerical Optimisation , volume =

    Jamil, Momin and Yang, Xin-She , title =. International Journal of Mathematical Modelling and Numerical Optimisation , volume =. 2013 , doi =

  11. [11]

    2026 , howpublished =

    Friedman, Erich , title =. 2026 , howpublished =

  12. [12]

    , title =

    Weisstein, Eric W. , title =. 2026 , howpublished =

  13. [13]

    2024 , eprint=

    LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations , author=. 2024 , eprint=

  14. [14]

    , title =

    Koza, John R. , title =. 1992 , isbn =

  15. [15]

    and Thomson, Peter , title =

    Miller, Julian F. and Thomson, Peter , title =. Genetic Programming, Proceedings of EuroGP 2000 , editor =. 2000 , doi =

  16. [16]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Agrawal, Lakshya A. and Tan, Shangyin and Soylu, Dilara and Ziems, Noah and Khare, Rishi and Opsahl-Ong, Krista and Singhvi, Arnav and Shandilya, Herumb and Ryan, Michael J. and Jiang, Meng and Potts, Christopher and Sen, Koushik and Dimakis, Alexandros G. and Stoica, Ion and Klein, Daniel and Zaharia, Matei and Khattab, Omar , year =. doi:10.48550/arXiv....

  17. [17]

    ThetaEvolve: Test-time Learning on Open Problems

    Wang, Yiping and Su, Shao-Rong and Zeng, Zhiyuan and Xu, Eva and Ren, Liliang and Yang, Xinyu and Huang, Zeyi and He, Xuehai and Ma, Luyao and Peng, Baolin and Cheng, Hao and He, Pengcheng and Chen, Weizhu and Wang, Shuohang and Du, Simon Shaolei and Shen, Yelong , year =. 2511.23473 , archivePrefix =

  18. [18]

    doi:10.48550/arXiv.2602.02919 , url =

    Jiang, Jiachen and Ding, Tianyu and Zhu, Zhihui , year =. doi:10.48550/arXiv.2602.02919 , url =. 2602.02919 , archivePrefix =

  19. [19]

    IEEE Transactions on Evolutionary Computation , volume =

    Bali, Kavitesh Kumar and Ong, Yew-Soon and Gupta, Abhishek and Tan, Puay Siew , title =. IEEE Transactions on Evolutionary Computation , volume =. 2020 , doi =

  20. [20]

    Feng, Liang and Zhou, Lei and Zhong, Jinghui and Gupta, Abhishek and Ong, Yew-Soon and Tan, Kay Chen and Qin, A. K. , title =. IEEE Transactions on Cybernetics , volume =. 2019 , doi =

  21. [21]

    2024 , eprint=

    GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation , author=. 2024 , eprint=

  22. [22]

    2025 , eprint=

    Chronos-2: From Univariate to Universal Forecasting , author=. 2025 , eprint=

  23. [23]

    2019 , eprint=

    On the Measure of Intelligence , author=. 2019 , eprint=

  24. [24]

    2025 , eprint=

    The Surprising Effectiveness of Test-Time Training for Few-Shot Learning , author=. 2025 , eprint=

  25. [25]

    Kazimipour, Borhan and Li, Xiaodong and Qin, A. K. , booktitle=. A review of population initialization techniques for evolutionary algorithms , year=

  26. [26]

    2014 , eprint=

    Seeding the Initial Population of Multi-Objective Evolutionary Algorithms: A Computational Study , author=. 2014 , eprint=

  27. [27]

    2020 , eprint=

    Warm Starting CMA-ES for Hyperparameter Optimization , author=. 2020 , eprint=

  28. [28]

    2022 , eprint=

    Practical Transfer Learning for Bayesian Optimization , author=. 2022 , eprint=

  29. [29]

    Multi-Task Bayesian Optimization , url =

    Swersky, Kevin and Snoek, Jasper and Adams, Ryan , booktitle =. Multi-Task Bayesian Optimization , url =

  30. [30]

    2016 , eprint=

    Warm Starting Bayesian Optimization , author=. 2016 , eprint=

  31. [31]

    2023 , eprint=

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution , author=. 2023 , eprint=

  32. [32]

    Machine Learning , volume=

    Multitask Learning , author=. Machine Learning , volume=. 1997 , doi=

  33. [33]

    Journal of Artificial Intelligence Research , volume=

    A Model of Inductive Bias Learning , author=. Journal of Artificial Intelligence Research , volume=. 2000 , doi=

  34. [34]

    Machine Learning , volume=

    Convex multi-task feature learning , author=. Machine Learning , volume=. 2008 , doi=

  35. [35]

    2017 , eprint=

    An Overview of Multi-Task Learning in Deep Neural Networks , author=. 2017 , eprint=

  36. [36]

    IEEE Transactions on Knowledge and Data Engineering , volume=

    A Survey on Multi-Task Learning , author=. IEEE Transactions on Knowledge and Data Engineering , volume=. 2022 , doi=

  37. [37]

    Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages=

    Understanding Inverse Scaling and Emergence in Multitask Representation Learning , author=. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages=. 2024 , editor=

  38. [38]

    2025 , eprint=

    Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information , author=. 2025 , eprint=

  39. [39]

    2025 , eprint=

    NAACL2025 Tutorial: Adaptation of Large Language Models , author=. 2025 , eprint=

  40. [40]

    2025 , eprint=

    PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models , author=. 2025 , eprint=

  41. [41]

    2024 , eprint=

    Beyond Model Adaptation at Test Time: A Survey , author=. 2024 , eprint=

  42. [42]

    2026 , eprint=

    Test-Time Training Provably Improves Transformers as In-context Learners , author=. 2026 , eprint=

  43. [43]

    2025 , eprint=

    Test-Time Learning for Large Language Models , author=. 2025 , eprint=