SPRIG: Improving Large Language Model Performance by System Prompt Optimization
Pith reviewed 2026-05-23 18:33 UTC · model grok-4.3
The pith
A single optimized system prompt performs as well as task-specific prompts for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages.
What carries the argument
SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios.
Load-bearing premise
Performance measured on the chosen collection of 47 task types is a sufficient proxy for general scenarios and the genetic search does not overfit to the particular evaluation distribution or specific models used during optimization.
What would settle it
Running the optimized system prompt on a fresh collection of tasks or models not seen during the genetic search and measuring whether the performance parity with task-specific prompts still holds.
read the original abstract
Large Language Models (LLMs) have shown impressive capabilities in many scenarios, but their performance depends, in part, on the choice of prompt. Past research has focused on optimizing prompts specific to a task. However, much less attention has been given to optimizing the general instructions included in a prompt, known as a system prompt. To address this gap, we propose SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios. We evaluate the performance of system prompts on a collection of 47 different types of tasks to ensure generalizability. Our study finds that a single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages. This study provides insights into the role of system-level instructions in maximizing LLM potential.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SPRIG, an edit-based genetic algorithm that iteratively constructs system prompts from prespecified components to maximize LLM performance across general scenarios. It evaluates on a collection of 47 task types and claims that a single optimized system prompt performs on par with per-task optimized prompts, that combining system- and task-level optimizations yields further gains (showing complementarity), and that the resulting prompts generalize across model families, parameter sizes, and languages.
Significance. If the results hold after addressing evaluation concerns, the work would be significant for shifting emphasis from per-task prompt engineering to reusable system-prompt optimization. The reported complementarity and cross-model generalization, if verified with proper controls, would provide practical value for LLM deployment and new directions for prompt research.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The genetic algorithm optimizes the system prompt by maximizing aggregate performance on the identical collection of 47 task types used for all reported comparisons and generalization tests. No held-out tasks, cross-validation split, or separate validation distribution is described. This makes the central claims—that the system prompt performs 'on par' with per-task optimization and generalizes to 'general scenarios'—potentially circular, as the prompt is selected precisely to exploit regularities within the evaluation distribution.
- [Experiments] Experiments section: The paper reports that the optimized system prompt generalizes across model families, sizes, and languages, but provides no quantitative details on whether the genetic search itself was performed on a single model family or whether the fitness evaluations during search used the same models as the final generalization tests. This leaves open whether the reported cross-model results reflect true transfer or merely that the prompt was already tuned to the evaluation models.
minor comments (1)
- [Abstract] The abstract states that prompts are constructed 'from prespecified components' but does not list or characterize those components; a table or appendix enumerating them would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major concern below, indicating where revisions will be made to improve clarity and address potential limitations.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The genetic algorithm optimizes the system prompt by maximizing aggregate performance on the identical collection of 47 task types used for all reported comparisons and generalization tests. No held-out tasks, cross-validation split, or separate validation distribution is described. This makes the central claims—that the system prompt performs 'on par' with per-task optimization and generalizes to 'general scenarios'—potentially circular, as the prompt is selected precisely to exploit regularities within the evaluation distribution.
Authors: We acknowledge that the genetic search maximized performance on the same collection of 47 tasks used for all evaluations. These tasks were deliberately chosen to span a broad range of general scenarios, and the optimization objective was aggregate performance rather than per-task exploitation. The resulting system prompt matching the performance of individually optimized task prompts provides evidence of its broad utility. That said, the absence of a held-out set or cross-validation does represent a methodological limitation for claims of generalization to entirely new scenarios. We will revise the manuscript to explicitly describe the evaluation setup, acknowledge this limitation, and outline plans for future held-out validation experiments. revision: yes
-
Referee: [Experiments] Experiments section: The paper reports that the optimized system prompt generalizes across model families, sizes, and languages, but provides no quantitative details on whether the genetic search itself was performed on a single model family or whether the fitness evaluations during search used the same models as the final generalization tests. This leaves open whether the reported cross-model results reflect true transfer or merely that the prompt was already tuned to the evaluation models.
Authors: The genetic algorithm was executed with fitness evaluations on a single base model, after which the resulting prompt was evaluated on additional models from different families, sizes, and languages. We will expand the Experiments section to include explicit details on the model(s) used during the search process versus those used in the generalization tests, thereby clarifying the transfer aspect of the results. revision: yes
Circularity Check
No significant circularity; empirical evaluation on optimization distribution is standard but not a definitional reduction.
full rationale
The paper describes an empirical genetic-algorithm procedure that optimizes a system prompt to maximize aggregate performance across a fixed collection of 47 task types and then reports measured performance on that same collection. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claims rest on direct experimental comparisons rather than any derivation that reduces to its own inputs by construction; therefore the work is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
-
Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions
LLMs for smart contract security analysis show lexical bias from identifier names causing high false positives, with prompting creating precision-recall trade-offs, positioning them as complements rather than replacem...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.