arxiv: 2602.01003 · v2 · submitted 2026-02-01 · 💻 cs.LG · cs.AI

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

Zhishen Sun , Sizhe Dang , Guang Dai , Haishan Ye This is my paper

Pith reviewed 2026-05-16 08:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords evolution strategiessharpness-aware maximizationreinforcement learningLLM fine-tuningmemory efficiencymathematical reasoninggeneralization

0 comments

The pith

Evolution strategies combined with sharpness-aware maximization match RL accuracy for LLM math fine-tuning at 18 times lower GPU memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ESSAM as a full-parameter fine-tuning method for large language models on mathematical reasoning that replaces gradient-based reinforcement learning with zero-order search from evolution strategies integrated with sharpness-aware maximization. Experiments on GSM8K show it reaches 78.27 percent average accuracy, comparable to PPO at 77.72 percent and GRPO at 78.34 percent, while cutting average GPU memory by 18 times versus PPO and 10 times versus GRPO. The approach also produces stronger generalization, with best average results on five of six additional datasets, and an accelerated variant delivers nearly twice the speed at the same low memory cost.

Core claim

ESSAM tightly combines the zero-order search in parameter space from Evolution Strategies with Sharpness-Aware Maximization to enable full-parameter fine-tuning of LLMs for mathematical reasoning, achieving 78.27 percent average accuracy on GSM8K that is comparable to or better than PPO and GRPO on some models while reducing average GPU memory usage by 18 times compared to PPO and 10 times compared to GRPO.

What carries the argument

The ESSAM framework, which integrates zero-order parameter search from evolution strategies with sharpness-aware maximization to optimize without gradients and favor flatter minima for generalization.

If this is right

ESSAM achieves comparable or superior accuracy to PPO and GRPO on GSM8K while using far less GPU memory.
Models fine-tuned with ESSAM reach the best average performance on five of six held-out generalization datasets.
The accelerated ESSAM variant maintains the low memory footprint and delivers nearly twofold speedup while outperforming PPO in accuracy.
Full-parameter fine-tuning of LLMs for reasoning tasks becomes feasible under severe GPU memory constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory reduction could allow fine-tuning of models too large for standard RL methods on single high-end consumer GPUs.
Zero-order search might serve as a practical drop-in replacement for backpropagation in other memory-intensive LLM adaptation settings.
The observed generalization gains suggest the method could produce more robust models when applied to noisy or out-of-distribution real-world data.

Load-bearing premise

That zero-order evolution strategies paired with sharpness-aware maximization can substitute for gradient-based reinforcement learning optimization in high-dimensional LLM fine-tuning without meaningful performance loss.

What would settle it

A test applying ESSAM to a new model size or reasoning task where accuracy drops more than five points below the PPO or GRPO baseline while memory measurements confirm the claimed savings.

Figures

Figures reproduced from arXiv: 2602.01003 by Guang Dai, Haishan Ye, Sizhe Dang, Zhishen Sun.

**Figure 2.** Figure 2: The average accuracy of each algorithm on all models for the GSM8K task (%). usage and training procedure, including a standard split of training and evaluation sets, shuffling the training data, and performing multi step updates with small mini batches. This reduces randomness from small sample training and improves training stability and reproducibility. • We conduct experiments on GSM8K, and the results… view at source ↗

**Figure 3.** Figure 3: An example GSM8K problem and the prompt template. Reward function. For verifiable math reasoning tasks such as GSM8K and Countdown, we use a rule based reward function. The rules are as follows, we split the reward into an outcome accuracy reward and an output format reward: Raccuracy(ˆy, y) = ( 1, is equivalent(ˆy, y), 0, otherwise. (15) Rformat(a) =    1.0, the output follows the full … view at source ↗

**Figure 4.** Figure 4: The training mean reward curves of ESSAM and ES. These curves show that ESSAM has better training trend and converges earlier than ES, leading to better computational efficiency. More results are presented in the Appendix E. Qwen-2.5-0.5B-Instruct Qwen-2.5-1.5B-Instruct Qwen-2.5-3B-Instruct Qwen-2.5-7B-Instruct LLaMA-3.2-1B-Instruct LLaMA-3.2-3B-Instruct LLaMA-3.1-8B-Instruct Model 0 50 100 150 200 250 300… view at source ↗

**Figure 5.** Figure 5: GPU memory usage when fine-tuning different LLMs with different algorithms. More details can be found in Appendix D. and architectures on GSM8K, and the results in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: A schematic illustration of how the test set accuracy of models fine-tuned with different algorithms evolves over training. More results are shown in the Appendix F. adds an extra round of sampling and reward evaluation in each iteration and slightly increases the per iteration time compared to ES, ESSAM converges much faster. This allows us to stop training earlier once the target reward is reached, so E… view at source ↗

**Figure 7.** Figure 7: The training mean reward curves of ESSAM and ES. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: A schematic illustration of how the test set accuracy of models fine-tuned with different algorithms evolves over training. G. Example In this section, we present example responses from models fine-tuned with ESSAM. Qwen2.5-0.5b-Instruct Prompt: You are a helpful assistant. You first think about the reasoning process in your mind and then provide the user with the answer. Please solve the following problem… view at source ↗

read the original abstract

Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27\% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72\% and is comparable to GRPO with an accuracy of 78.34\%, and even surpassing them on some models. Further generalization experiments show that the models trained with ESSAM exhibit stronger generalization ability. Their average performance achieves the best results on 5 out of 6 datasets, indicating that ESSAM can effectively improve the generalization performance of fine-tuned models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage. In addition, we design an accelerated variant of ESSAM, which achieves nearly a twofold speedup while maintaining the same GPU memory usage as ESSAM, and attains an average accuracy of 78.02\% across all models, outperforming PPO. Code: https://github.com/szs777/ESSAM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESSAM shows you can swap gradients for evolution strategies plus SAM and still match PPO accuracy on GSM8K while cutting memory 10-18x.

read the letter

This paper's main point is that you don't need gradients for RL-style fine-tuning of large language models on reasoning benchmarks. By using evolution strategies for zero-order search in parameter space and pairing it with sharpness-aware maximization, they achieve an average accuracy of 78.27% on GSM8K across models, which edges out PPO at 77.72% and matches GRPO at 78.34%. The real win is the memory: 18 times less GPU memory than PPO and 10 times less than GRPO. They also show stronger generalization on five out of six other datasets and provide an accelerated variant that runs nearly twice as fast without increasing memory.

Referee Report

4 major / 2 minor

Summary. The manuscript proposes ESSAM, a full-parameter fine-tuning method for LLMs that combines zero-order Evolution Strategies (ES) search with Sharpness-Aware Maximization (SAM) to enable memory-efficient reinforcement learning for mathematical reasoning tasks. On GSM8K, it reports an average accuracy of 78.27% across models (comparable to PPO at 77.72% and GRPO at 78.34%), superior generalization on 5 of 6 additional datasets, 18× and 10× GPU memory reductions versus PPO and GRPO, and an accelerated variant achieving 78.02% accuracy with ~2× speedup. The central claim is that this ES+SAM approach can replace gradient-based RL optimizers while preserving performance.

Significance. If the empirical results are robustly verified, the work would be significant for enabling RL fine-tuning of billion-parameter models under severe memory constraints, as it eliminates backpropagation entirely. The integration of SAM for generalization is a plausible strength, and public code availability supports reproducibility. However, the current lack of implementation specifics and statistical controls limits the strength of the contribution relative to existing ES-for-RL literature.

major comments (4)

[Abstract and Experiments] Abstract and Experiments section: The accuracy claims (78.27% ESSAM vs. 77.72% PPO) are presented without error bars, number of runs, or statistical tests despite the inherent stochasticity of both ES and RL; this is load-bearing for the 'comparable or superior' assertion and must be addressed with repeated trials and significance testing.
[Method] Method section: No population size, sampling strategy (e.g., antithetic pairs), or variance-reduction details are given for the ES estimator in ~10^9-dimensional space; standard ES theory shows variance scales with dimension, so the reported near-parity performance and low memory usage cannot be evaluated without these parameters.
[Experiments] Experiments section: The accelerated variant is claimed to deliver ~2× speedup at identical memory and 78.02% accuracy, yet the acceleration mechanism and its effect on estimator variance are unspecified; this detail is required to substantiate the speedup claim.
[Generalization experiments] Generalization experiments: Superior results on 5/6 datasets are asserted after GSM8K fine-tuning, but the evaluation protocol, dataset identities, and whether any hyperparameter tuning occurred on the test sets are omitted; these omissions prevent assessment of the generalization benefit.

minor comments (2)

[Abstract] Abstract: The phrase 'surpassing them on some models' should specify the models and margins for precision.
[Code] Code repository: While the GitHub link is welcome, the manuscript should explicitly list all key hyperparameters (population size, σ, SAM radius, etc.) and random seeds to enable exact reproduction.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our manuscript. We address each major comment point by point below, and we will make the necessary revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The accuracy claims (78.27% ESSAM vs. 77.72% PPO) are presented without error bars, number of runs, or statistical tests despite the inherent stochasticity of both ES and RL; this is load-bearing for the 'comparable or superior' assertion and must be addressed with repeated trials and significance testing.

Authors: We agree that statistical rigor is necessary given the stochastic nature of both ES and RL methods. In the revised manuscript, we will report results from 5 independent runs per experiment, including means and standard deviations. We will also add paired t-tests to compare ESSAM against PPO and GRPO, thereby substantiating the comparability claims with appropriate error bars and significance testing. revision: yes
Referee: [Method] Method section: No population size, sampling strategy (e.g., antithetic pairs), or variance-reduction details are given for the ES estimator in ~10^9-dimensional space; standard ES theory shows variance scales with dimension, so the reported near-parity performance and low memory usage cannot be evaluated without these parameters.

Authors: We acknowledge the need for these implementation details. The ESSAM implementation uses a population size of 64 with antithetic sampling for variance reduction and a perturbation scale of 0.01. We will expand the Method section with a dedicated subsection describing the full ES estimator, including population size, sampling strategy, and variance-reduction techniques, to allow proper evaluation in high-dimensional spaces. revision: yes
Referee: [Experiments] Experiments section: The accelerated variant is claimed to deliver ~2× speedup at identical memory and 78.02% accuracy, yet the acceleration mechanism and its effect on estimator variance are unspecified; this detail is required to substantiate the speedup claim.

Authors: We will add a precise description of the acceleration mechanism (which reduces the number of perturbations per update via an efficient sampling schedule) in the revised Experiments section. This will include an analysis of its effect on estimator variance and why accuracy remains comparable, thereby substantiating the reported ~2× speedup at unchanged memory cost. revision: yes
Referee: [Generalization experiments] Generalization experiments: Superior results on 5/6 datasets are asserted after GSM8K fine-tuning, but the evaluation protocol, dataset identities, and whether any hyperparameter tuning occurred on the test sets are omitted; these omissions prevent assessment of the generalization benefit.

Authors: We will revise the relevant section to explicitly name the six datasets, describe the evaluation protocol (zero-shot greedy decoding), and confirm that hyperparameters were tuned solely on a GSM8K validation split with no test-set access. This will enable a transparent assessment of the reported generalization improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical method and results

full rationale

The paper introduces ESSAM by combining standard zero-order Evolution Strategies with Sharpness-Aware Maximization for LLM fine-tuning and reports direct empirical measurements: accuracy on GSM8K (78.27% average, comparable to PPO at 77.72% and GRPO at 78.34%), generalization on 5/6 external datasets, and GPU memory reductions (18× vs PPO, 10× vs GRPO). No equations or derivation steps reduce any claimed prediction to fitted parameters, self-definitions, or self-citation chains; the central claims rest on measured outcomes against independent baselines rather than constructed equivalences. The accelerated variant is described as an implementation detail without altering the empirical reporting structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard assumptions from evolution strategies and sharpness-aware optimization without introducing new postulates.

pith-pipeline@v0.9.0 · 5619 in / 1053 out tokens · 44085 ms · 2026-05-16T08:20:55.566684+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Training Verifiers to Solve Math Word Problems

URL https://arxiv. org/abs/2110.14168. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

RLHF Workflow: From Reward Modeling to Online RLHF

URLhttps://arxiv.org/abs/2405.07863. Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

work page internal anchor Pith review arXiv 2010
[3]

Zero- order sharpness-aware minimization.arXiv preprint arXiv:2511.09156,

Fu, Y ., Jin, Y ., Zhang, C., Liu, J., and Ye, H. Zero- order sharpness-aware minimization.arXiv preprint arXiv:2511.09156,

work page arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

F., Liang, Q., Meyerson, E., Hodjat, B., and Miikkulainen, R

Qiu, X., Gan, Y ., Hayes, C. F., Liang, Q., Meyerson, E., Hodjat, B., and Miikkulainen, R. Evolution strategies at scale: Llm fine-tuning beyond reinforcement learning. arXiv preprint arXiv:2509.24372,

work page arXiv
[6]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Sharpness-aware black-box optimization.arXiv preprint arXiv:2410.12457,

Ye, F., Lyu, Y ., Wang, X., Sugiyama, M., Zhang, Y ., and Tsang, I. Sharpness-aware black-box optimization.arXiv preprint arXiv:2410.12457,

work page arXiv
[10]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Improving sharpness-aware minimization with fisher mask for better generalization on language models

Zhong, Q., Ding, L., Shen, L., Mi, P., Liu, J., Du, B., and Tao, D. Improving sharpness-aware minimization with fisher mask for better generalization on language models. arXiv preprint arXiv:2210.05497,

work page arXiv
[13]

Fine-Tuning Language Models from Human Preferences

9 Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909