pith. sign in

arxiv: 2605.30148 · v1 · pith:RVVQGMDFnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

Pith reviewed 2026-06-29 08:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords evolution strategieslarge language modelsfine-tuningperformance driftcontinual learningweight decaycatastrophic forgetting
0
0 comments X

The pith

Anchored weight decay largely eliminates prior-task drift during evolution strategies fine-tuning of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that performance changes on prior tasks during ES fine-tuning of LLMs are mostly reversible drift rather than permanent loss, and that the same drift appears under reinforcement learning methods as well. It traces the drift to random movement in directions of the weight space that are weakly constrained by the new task objective. Anchored Weight Decay counters this by adding a penalty term that pulls parameters back toward their starting values. The result is stable prior-task performance alongside effective learning on the target task, achieved at lower computational cost than simply increasing the ES population size.

Core claim

Prior-task forgetting under ES is largely avoidable. Drift arises from random walk behavior in weakly constrained directions of the weight space. Anchored Weight Decay constrains optimization toward the initial model parameters, stabilizing prior-task performance while preserving target-task performance and delivering benefits comparable to large population sizes at reduced cost.

What carries the argument

Anchored Weight Decay (AWD), a regularization term added to the ES objective that penalizes Euclidean distance between current and initial parameters.

If this is right

  • ES fine-tuning becomes practical for sequential task learning without requiring replay of old data.
  • The same stabilization effect can be obtained at lower cost than scaling the ES population size.
  • Drift is a general optimization issue, not unique to ES, so similar anchoring may help other fine-tuning methods.
  • Prior-task performance can recover spontaneously even without regularization once the new-task signal strengthens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other black-box optimizers that also rely on population-based updates.
  • Combining AWD with modest population sizes might allow continual learning on models too large for full gradient methods.
  • If drift is mainly a consequence of under-constrained directions, then task-specific loss weighting might interact with AWD in predictable ways.

Load-bearing premise

The premise that penalizing deviation from the initial parameters will block harmful random walks without also blocking the parameter changes needed to learn the new task.

What would settle it

A controlled run in which AWD is applied, prior-task metrics remain flat, but target-task metrics fail to improve beyond the starting point.

Figures

Figures reproduced from arXiv: 2605.30148 by Conor F. Hayes, Kajetan Schweighofer, Risto Miikkulainen, Roberto Dailey, Xin Qiu.

Figure 1
Figure 1. Figure 1: Anchored Weight Decay (AWD) mitigates prior task forgetting. (a) Target task accuracy (Countdown) vs. average prior task accuracy. The dotted line denotes prior task accuracy for the original model with weights θ0. Standard ES shows unstable prior-task accuracy during training iterations (denoted by color), while AWD stabilizes it. (b) Prior tasks Pi and target task T occupy different high-performance regi… view at source ↗
Figure 2
Figure 2. Figure 2: Target task accuracy vs. average prior task accuracy throughout training. Colors denote the iteration within the training process. For ES, prior task accuracy exhibits noticeable drift, where accuracy decreases initially, but often recovers later in training. Moreover, forgetting does not arise under all settings. GRPO exhibits less drift in prior task accuracy, but can also lead to severe forgetting in so… view at source ↗
Figure 3
Figure 3. Figure 3: Individual prior task accuracies training on Countdown as target task. Colors denote the iteration within the training process. A performance drift rather than irreversible forgetting is observed across multiple tasks. For example, performance on HellaSwag drops by 8% accuracy over the first 300 iterations, but subsequently recovers to its original level by the final iteration. Countdown, GSM8K and ProofWr… view at source ↗
Figure 4
Figure 4. Figure 4: Change in final prior-task accuracy under different target tasks and base models. Boxplots for the change (∆) in prior task accuracy for ES and GRPO. Each boxplot aggregates results across all experimental dimensions (target-tasks, prior-tasks, model types) not defined via the x-axis. Target task, model family (restricting aggregation to 3B models) and model size (restricting aggregation to Qwen models) ar… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of the AWD penalty factor λ on target- and prior-task accuracy for (a) ℓ1 and (b) ℓ2 penalty function. Target task is Countdown, dashed line shows target- and prior task accuracies for ES without AWD, dotted line shows prior-task accuracy of the base model. For both penalty functions, there is an intermediate range of λ that preserves target-task performance while mitigating forgetting. 8 [PITH_FUL… view at source ↗
Figure 6
Figure 6. Figure 6: Norm ||θt−θ0||2 of weight updates. Larger ES populations and AWD reduce up￾date norms. ES + AWD (ℓ2) converges to levels similar to ES with population size 128. GRPO updates remain over an order of mag￾nitude smaller. Weight updates [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: KL-divergence KL(pθ0 ||pθT ) between base and final model trained on Countdown. ES with a small population size (30) causes larger shifts on prior tasks than GRPO; increasing population size or adding AWD mostly closes this gap. On the target task Countdown, GRPO shows very high divergence while it remains low for all ES variants, despite all having similar target task accuracy. norm of weight updates. Whi… view at source ↗
Figure 8
Figure 8. Figure 8: Individual prior task accuracies training on GSM8K as target task. There is no systematic forgetting observed for ES nor for GRPO. 8 10 12 14 16 Countdown Accuracy 50 60 70 ProofWriter Accuracy 65 70 75 80 85 GSM8K Accuracy 62 63 64 65 HellaSwag Accuracy ES (30) GRPO 40 41 42 43 44 MMLU-Pro Accuracy 50 60 70 ProofWriter Accuracy 82 83 84 85 86 ARC-Challenge Accuracy 79 80 81 82 PIQA Accuracy 0 100 200 300 … view at source ↗
Figure 9
Figure 9. Figure 9: Individual prior task accuracies training on ProofWriter as target task. Strong forgetting is observed for GRPO on GSM8K as a prior task, as well as moderate forgetting on Countdown and MMLU-Pro. For ES, mild forgetting is observed for HellaSwag, ARC-Challenge, and PIQA. The penalty factor λ was not tuned for individual configurations, showing that the values suggested by the ablation in the main paper—for… view at source ↗
Figure 10
Figure 10. Figure 10: Individual prior task accuracies training on Countdown as target task with and without AWD. While forgetting is observed for standard ES, there is essentially no forgetting in any task with ES + AWD (ℓ2), except a slight degradation on MMLU-Pro [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: KL-divergence KL(pθ0 ||pθT ) for different target tasks across prior tasks for ES (population size 30) and GRPO. The y-axis shows training task and fine-tuning method, the x-axis shows the task used for evaluation. When evaluating on the target task (grey boxes), ES (30) attains lower KL￾divergence than GRPO, though both have similar accuracies. For the prior-tasks, GRPO mostly has lower KL-Divergences th… view at source ↗
read the original abstract

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that prior-task forgetting during ES fine-tuning of LLMs is better described as recoverable performance drift rather than irreversible loss, occurs under RL fine-tuning as well, arises from random-walk dynamics in weakly constrained weight-space directions, and can be mitigated by a new Anchored Weight Decay (AWD) regularizer that anchors parameters to their initial values. AWD is reported to stabilize prior-task performance while preserving target-task gains at lower cost than simply increasing ES population size.

Significance. If the empirical claims hold, the work would reposition ES as a practical method for continual LLM fine-tuning by showing that drift is avoidable via a lightweight parameter-space penalty, with potential advantages in simplicity and scalability over RL-based continual learning. The mechanistic analysis of drift in high-dimensional optimization could inform regularization choices more broadly.

major comments (2)
  1. [Abstract and the section presenting AWD results] The central claim that AWD 'stabilizes prior-task performance while preserving target-task performance' (abstract) is load-bearing and rests on the assumption that directions needed for the new task are either already well-constrained or that the AWD coefficient can be chosen so target progress is unaffected. If the target task optimum lies primarily along the weakly constrained directions identified in the random-walk analysis, the same L2 penalty to initial weights that halts prior-task drift will also impede target-task improvement; the manuscript must supply quantitative trade-off curves (e.g., target vs. prior performance vs. AWD strength) on tasks where initial and target optima are demonstrably distant.
  2. [Section comparing ES and RL forgetting] The assertion that drift 'is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods' requires direct side-by-side experiments under matched conditions (same model, tasks, and hyper-parameter regimes) to establish that the phenomenon is comparable in magnitude and mechanism; without such controls the generality claim remains under-supported.
minor comments (2)
  1. [Method section introducing AWD] Notation for the AWD term (e.g., the precise form of the anchor penalty and how its coefficient is scheduled) should be stated explicitly with an equation number rather than described only in prose.
  2. [Figures showing performance trajectories] Figure captions and axis labels for any recovery or trade-off plots should explicitly state the population size, number of generations, and AWD coefficient values used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, indicating where we agree that additional material will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and the section presenting AWD results] The central claim that AWD 'stabilizes prior-task performance while preserving target-task performance' (abstract) is load-bearing and rests on the assumption that directions needed for the new task are either already well-constrained or that the AWD coefficient can be chosen so target progress is unaffected. If the target task optimum lies primarily along the weakly constrained directions identified in the random-walk analysis, the same L2 penalty to initial weights that halts prior-task drift will also impede target-task improvement; the manuscript must supply quantitative trade-off curves (e.g., target vs. prior performance vs. AWD strength) on tasks where initial and target optima are demonstrably distant.

    Authors: We appreciate the referee's emphasis on potential trade-offs. Our task selection and random-walk analysis were chosen such that target-task progress occurs without requiring large movement along the weakly constrained directions that drive prior-task drift; the reported AWD results reflect coefficients that achieve this balance. To directly address the concern, we will add quantitative trade-off curves (prior-task and target-task performance versus AWD coefficient) for tasks where the initial and target optima are demonstrably distant. These will be included in the revised manuscript. revision: yes

  2. Referee: [Section comparing ES and RL forgetting] The assertion that drift 'is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods' requires direct side-by-side experiments under matched conditions (same model, tasks, and hyper-parameter regimes) to establish that the phenomenon is comparable in magnitude and mechanism; without such controls the generality claim remains under-supported.

    Authors: We agree that more tightly matched hyper-parameter regimes would provide stronger evidence. Our existing experiments already employ the same model and tasks for both ES and RL; hyperparameters were tuned separately because the optimizers have fundamentally different requirements. We will add a controlled comparison under more closely aligned hyper-parameter settings and will clarify the mechanistic similarities (random-walk behavior in weakly constrained directions) in the text. This addition will be made in revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central claims rest on empirical results and a new regularization term

full rationale

The paper characterizes forgetting as performance drift via experiments, attributes it to random-walk dynamics in weakly constrained weight directions through analysis of ES training, and introduces Anchored Weight Decay (AWD) as a parameter-space L2 penalty to initial weights. No equations reduce any prediction to a fitted input by construction, no self-citations are load-bearing for the core argument, and the effectiveness of AWD is demonstrated empirically rather than derived tautologically. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that drift is caused by random walks in weakly constrained directions and that AWD can selectively constrain those directions without harming target-task learning.

axioms (1)
  • domain assumption ES training dynamics involve random walk behavior in weakly constrained directions of the weight space that produces recoverable performance drift.
    Invoked to explain when and why drift arises.

pith-pipeline@v0.9.1-grok · 5757 in / 1187 out tokens · 25112 ms · 2026-06-29T08:35:07.528166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Knowing in Advance When an Evolutionary Outer Loop Will Not Help: A Pre-Registered Cheap-Baseline Screening Rule

    cs.CL 2026-06 unverdicted novelty 5.0

    A screening rule skips evolutionary outer loops when the ratio of best single-shot gain to best cheap gain meets or exceeds 90%, validated on pre-registered lab cases where the gate fired and loops were abandoned.

Reference graph

Works this paper leans on

6 extracted references · cited by 1 Pith paper

  1. [1]

    is using wrap to wrap a pair of skis

  2. [2]

    is ripping level tiles off

  3. [3]

    is holding a Rubik’s Cube

  4. [4]

    PIQAevaluates physical commonsense understanding, requiring the model to choose solutions that are feasible in real-world scenarios

    starts pulling up roofing on a roof. PIQAevaluates physical commonsense understanding, requiring the model to choose solutions that are feasible in real-world scenarios. Published under Academic Free License v.3.0 in the official codebase of Bisk et al. [2020]. 3https://huggingface.co/datasets/tasksource/proofwriter 17 Answer format Extraction logic <0/1/...

  5. [5]

    Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish

  6. [6]

    Answer format Extraction logic <1/2> The strict format is satisfied only if the entire output is exactly1 or2 up to surrounding whitespace

    Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish. Answer format Extraction logic <1/2> The strict format is satisfied only if the entire output is exactly1 or2 up to surrounding whitespace. For answer extraction, the first standalone 1 or 2...