ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning
Pith reviewed 2026-05-16 08:20 UTC · model grok-4.3
The pith
Evolution strategies combined with sharpness-aware maximization match RL accuracy for LLM math fine-tuning at 18 times lower GPU memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ESSAM tightly combines the zero-order search in parameter space from Evolution Strategies with Sharpness-Aware Maximization to enable full-parameter fine-tuning of LLMs for mathematical reasoning, achieving 78.27 percent average accuracy on GSM8K that is comparable to or better than PPO and GRPO on some models while reducing average GPU memory usage by 18 times compared to PPO and 10 times compared to GRPO.
What carries the argument
The ESSAM framework, which integrates zero-order parameter search from evolution strategies with sharpness-aware maximization to optimize without gradients and favor flatter minima for generalization.
If this is right
- ESSAM achieves comparable or superior accuracy to PPO and GRPO on GSM8K while using far less GPU memory.
- Models fine-tuned with ESSAM reach the best average performance on five of six held-out generalization datasets.
- The accelerated ESSAM variant maintains the low memory footprint and delivers nearly twofold speedup while outperforming PPO in accuracy.
- Full-parameter fine-tuning of LLMs for reasoning tasks becomes feasible under severe GPU memory constraints.
Where Pith is reading between the lines
- The memory reduction could allow fine-tuning of models too large for standard RL methods on single high-end consumer GPUs.
- Zero-order search might serve as a practical drop-in replacement for backpropagation in other memory-intensive LLM adaptation settings.
- The observed generalization gains suggest the method could produce more robust models when applied to noisy or out-of-distribution real-world data.
Load-bearing premise
That zero-order evolution strategies paired with sharpness-aware maximization can substitute for gradient-based reinforcement learning optimization in high-dimensional LLM fine-tuning without meaningful performance loss.
What would settle it
A test applying ESSAM to a new model size or reasoning task where accuracy drops more than five points below the PPO or GRPO baseline while memory measurements confirm the claimed savings.
Figures
read the original abstract
Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27\% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72\% and is comparable to GRPO with an accuracy of 78.34\%, and even surpassing them on some models. Further generalization experiments show that the models trained with ESSAM exhibit stronger generalization ability. Their average performance achieves the best results on 5 out of 6 datasets, indicating that ESSAM can effectively improve the generalization performance of fine-tuned models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage. In addition, we design an accelerated variant of ESSAM, which achieves nearly a twofold speedup while maintaining the same GPU memory usage as ESSAM, and attains an average accuracy of 78.02\% across all models, outperforming PPO. Code: https://github.com/szs777/ESSAM
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ESSAM, a full-parameter fine-tuning method for LLMs that combines zero-order Evolution Strategies (ES) search with Sharpness-Aware Maximization (SAM) to enable memory-efficient reinforcement learning for mathematical reasoning tasks. On GSM8K, it reports an average accuracy of 78.27% across models (comparable to PPO at 77.72% and GRPO at 78.34%), superior generalization on 5 of 6 additional datasets, 18× and 10× GPU memory reductions versus PPO and GRPO, and an accelerated variant achieving 78.02% accuracy with ~2× speedup. The central claim is that this ES+SAM approach can replace gradient-based RL optimizers while preserving performance.
Significance. If the empirical results are robustly verified, the work would be significant for enabling RL fine-tuning of billion-parameter models under severe memory constraints, as it eliminates backpropagation entirely. The integration of SAM for generalization is a plausible strength, and public code availability supports reproducibility. However, the current lack of implementation specifics and statistical controls limits the strength of the contribution relative to existing ES-for-RL literature.
major comments (4)
- [Abstract and Experiments] Abstract and Experiments section: The accuracy claims (78.27% ESSAM vs. 77.72% PPO) are presented without error bars, number of runs, or statistical tests despite the inherent stochasticity of both ES and RL; this is load-bearing for the 'comparable or superior' assertion and must be addressed with repeated trials and significance testing.
- [Method] Method section: No population size, sampling strategy (e.g., antithetic pairs), or variance-reduction details are given for the ES estimator in ~10^9-dimensional space; standard ES theory shows variance scales with dimension, so the reported near-parity performance and low memory usage cannot be evaluated without these parameters.
- [Experiments] Experiments section: The accelerated variant is claimed to deliver ~2× speedup at identical memory and 78.02% accuracy, yet the acceleration mechanism and its effect on estimator variance are unspecified; this detail is required to substantiate the speedup claim.
- [Generalization experiments] Generalization experiments: Superior results on 5/6 datasets are asserted after GSM8K fine-tuning, but the evaluation protocol, dataset identities, and whether any hyperparameter tuning occurred on the test sets are omitted; these omissions prevent assessment of the generalization benefit.
minor comments (2)
- [Abstract] Abstract: The phrase 'surpassing them on some models' should specify the models and margins for precision.
- [Code] Code repository: While the GitHub link is welcome, the manuscript should explicitly list all key hyperparameters (population size, σ, SAM radius, etc.) and random seeds to enable exact reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our manuscript. We address each major comment point by point below, and we will make the necessary revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The accuracy claims (78.27% ESSAM vs. 77.72% PPO) are presented without error bars, number of runs, or statistical tests despite the inherent stochasticity of both ES and RL; this is load-bearing for the 'comparable or superior' assertion and must be addressed with repeated trials and significance testing.
Authors: We agree that statistical rigor is necessary given the stochastic nature of both ES and RL methods. In the revised manuscript, we will report results from 5 independent runs per experiment, including means and standard deviations. We will also add paired t-tests to compare ESSAM against PPO and GRPO, thereby substantiating the comparability claims with appropriate error bars and significance testing. revision: yes
-
Referee: [Method] Method section: No population size, sampling strategy (e.g., antithetic pairs), or variance-reduction details are given for the ES estimator in ~10^9-dimensional space; standard ES theory shows variance scales with dimension, so the reported near-parity performance and low memory usage cannot be evaluated without these parameters.
Authors: We acknowledge the need for these implementation details. The ESSAM implementation uses a population size of 64 with antithetic sampling for variance reduction and a perturbation scale of 0.01. We will expand the Method section with a dedicated subsection describing the full ES estimator, including population size, sampling strategy, and variance-reduction techniques, to allow proper evaluation in high-dimensional spaces. revision: yes
-
Referee: [Experiments] Experiments section: The accelerated variant is claimed to deliver ~2× speedup at identical memory and 78.02% accuracy, yet the acceleration mechanism and its effect on estimator variance are unspecified; this detail is required to substantiate the speedup claim.
Authors: We will add a precise description of the acceleration mechanism (which reduces the number of perturbations per update via an efficient sampling schedule) in the revised Experiments section. This will include an analysis of its effect on estimator variance and why accuracy remains comparable, thereby substantiating the reported ~2× speedup at unchanged memory cost. revision: yes
-
Referee: [Generalization experiments] Generalization experiments: Superior results on 5/6 datasets are asserted after GSM8K fine-tuning, but the evaluation protocol, dataset identities, and whether any hyperparameter tuning occurred on the test sets are omitted; these omissions prevent assessment of the generalization benefit.
Authors: We will revise the relevant section to explicitly name the six datasets, describe the evaluation protocol (zero-shot greedy decoding), and confirm that hyperparameters were tuned solely on a GSM8K validation split with no test-set access. This will enable a transparent assessment of the reported generalization improvements. revision: yes
Circularity Check
No circularity in empirical method and results
full rationale
The paper introduces ESSAM by combining standard zero-order Evolution Strategies with Sharpness-Aware Maximization for LLM fine-tuning and reports direct empirical measurements: accuracy on GSM8K (78.27% average, comparable to PPO at 77.72% and GRPO at 78.34%), generalization on 5/6 external datasets, and GPU memory reductions (18× vs PPO, 10× vs GRPO). No equations or derivation steps reduce any claimed prediction to fitted parameters, self-definitions, or self-citation chains; the central claims rest on measured outcomes against independent baselines rather than constructed equivalences. The accelerated variant is described as an implementation detail without altering the empirical reporting structure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training Verifiers to Solve Math Word Problems
URL https://arxiv. org/abs/2110.14168. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
RLHF Workflow: From Reward Modeling to Online RLHF
URLhttps://arxiv.org/abs/2405.07863. Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,
work page internal anchor Pith review arXiv 2010
-
[3]
Zero- order sharpness-aware minimization.arXiv preprint arXiv:2511.09156,
Fu, Y ., Jin, Y ., Zhang, C., Liu, J., and Ye, H. Zero- order sharpness-aware minimization.arXiv preprint arXiv:2511.09156,
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
F., Liang, Q., Meyerson, E., Hodjat, B., and Miikkulainen, R
Qiu, X., Gan, Y ., Hayes, C. F., Liang, Q., Meyerson, E., Hodjat, B., and Miikkulainen, R. Evolution strategies at scale: Llm fine-tuning beyond reinforcement learning. arXiv preprint arXiv:2509.24372,
-
[6]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Sharpness-aware black-box optimization.arXiv preprint arXiv:2410.12457,
Ye, F., Lyu, Y ., Wang, X., Sugiyama, M., Zhang, Y ., and Tsang, I. Sharpness-aware black-box optimization.arXiv preprint arXiv:2410.12457,
-
[10]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Group Sequence Policy Optimization
Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Improving sharpness-aware minimization with fisher mask for better generalization on language models
Zhong, Q., Ding, L., Shen, L., Mi, P., Liu, J., Du, B., and Tao, D. Improving sharpness-aware minimization with fisher mask for better generalization on language models. arXiv preprint arXiv:2210.05497,
-
[13]
Fine-Tuning Language Models from Human Preferences
9 Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.