SPRIG: Improving Large Language Model Performance by System Prompt Optimization

David Jurgens; Lajanugen Logeswaran; Lechen Zhang; Moontae Lee; Tolga Ergen

arxiv: 2410.14826 · v3 · submitted 2024-10-18 · 💻 cs.CL · cs.AI· cs.HC· cs.LG

SPRIG: Improving Large Language Model Performance by System Prompt Optimization

Lechen Zhang , Tolga Ergen , Lajanugen Logeswaran , Moontae Lee , David Jurgens This is my paper

Pith reviewed 2026-05-23 18:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.LG

keywords system prompt optimizationgenetic algorithmLLM promptingprompt engineeringgeneral instructions

0 comments

The pith

A single optimized system prompt performs as well as task-specific prompts for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SPRIG, an edit-based genetic algorithm that builds system prompts from fixed components to improve LLM performance across many scenarios. It demonstrates that one such general prompt can match the results of prompts optimized separately for each of 47 task types. Combining the system-level prompt with task-level ones yields further gains, showing the two approaches complement each other. The optimized system prompts also transfer across model families, sizes, and languages.

Core claim

A single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages.

What carries the argument

SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios.

Load-bearing premise

Performance measured on the chosen collection of 47 task types is a sufficient proxy for general scenarios and the genetic search does not overfit to the particular evaluation distribution or specific models used during optimization.

What would settle it

Running the optimized system prompt on a fresh collection of tasks or models not seen during the genetic search and measuring whether the performance parity with task-specific prompts still holds.

read the original abstract

Large Language Models (LLMs) have shown impressive capabilities in many scenarios, but their performance depends, in part, on the choice of prompt. Past research has focused on optimizing prompts specific to a task. However, much less attention has been given to optimizing the general instructions included in a prompt, known as a system prompt. To address this gap, we propose SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios. We evaluate the performance of system prompts on a collection of 47 different types of tasks to ensure generalizability. Our study finds that a single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages. This study provides insights into the role of system-level instructions in maximizing LLM potential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SPRIG, an edit-based genetic algorithm that iteratively constructs system prompts from prespecified components to maximize LLM performance across general scenarios. It evaluates on a collection of 47 task types and claims that a single optimized system prompt performs on par with per-task optimized prompts, that combining system- and task-level optimizations yields further gains (showing complementarity), and that the resulting prompts generalize across model families, parameter sizes, and languages.

Significance. If the results hold after addressing evaluation concerns, the work would be significant for shifting emphasis from per-task prompt engineering to reusable system-prompt optimization. The reported complementarity and cross-model generalization, if verified with proper controls, would provide practical value for LLM deployment and new directions for prompt research.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The genetic algorithm optimizes the system prompt by maximizing aggregate performance on the identical collection of 47 task types used for all reported comparisons and generalization tests. No held-out tasks, cross-validation split, or separate validation distribution is described. This makes the central claims—that the system prompt performs 'on par' with per-task optimization and generalizes to 'general scenarios'—potentially circular, as the prompt is selected precisely to exploit regularities within the evaluation distribution.
[Experiments] Experiments section: The paper reports that the optimized system prompt generalizes across model families, sizes, and languages, but provides no quantitative details on whether the genetic search itself was performed on a single model family or whether the fitness evaluations during search used the same models as the final generalization tests. This leaves open whether the reported cross-model results reflect true transfer or merely that the prompt was already tuned to the evaluation models.

minor comments (1)

[Abstract] The abstract states that prompts are constructed 'from prespecified components' but does not list or characterize those components; a table or appendix enumerating them would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major concern below, indicating where revisions will be made to improve clarity and address potential limitations.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The genetic algorithm optimizes the system prompt by maximizing aggregate performance on the identical collection of 47 task types used for all reported comparisons and generalization tests. No held-out tasks, cross-validation split, or separate validation distribution is described. This makes the central claims—that the system prompt performs 'on par' with per-task optimization and generalizes to 'general scenarios'—potentially circular, as the prompt is selected precisely to exploit regularities within the evaluation distribution.

Authors: We acknowledge that the genetic search maximized performance on the same collection of 47 tasks used for all evaluations. These tasks were deliberately chosen to span a broad range of general scenarios, and the optimization objective was aggregate performance rather than per-task exploitation. The resulting system prompt matching the performance of individually optimized task prompts provides evidence of its broad utility. That said, the absence of a held-out set or cross-validation does represent a methodological limitation for claims of generalization to entirely new scenarios. We will revise the manuscript to explicitly describe the evaluation setup, acknowledge this limitation, and outline plans for future held-out validation experiments. revision: yes
Referee: [Experiments] Experiments section: The paper reports that the optimized system prompt generalizes across model families, sizes, and languages, but provides no quantitative details on whether the genetic search itself was performed on a single model family or whether the fitness evaluations during search used the same models as the final generalization tests. This leaves open whether the reported cross-model results reflect true transfer or merely that the prompt was already tuned to the evaluation models.

Authors: The genetic algorithm was executed with fitness evaluations on a single base model, after which the resulting prompt was evaluated on additional models from different families, sizes, and languages. We will expand the Experiments section to include explicit details on the model(s) used during the search process versus those used in the generalization tests, thereby clarifying the transfer aspect of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on optimization distribution is standard but not a definitional reduction.

full rationale

The paper describes an empirical genetic-algorithm procedure that optimizes a system prompt to maximize aggregate performance across a fixed collection of 47 task types and then reports measured performance on that same collection. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claims rest on direct experimental comparisons rather than any derivation that reduces to its own inputs by construction; therefore the work is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard evolutionary-algorithm machinery and that the 47-task collection is representative.

pith-pipeline@v0.9.0 · 5720 in / 1133 out tokens · 40742 ms · 2026-05-23T18:33:22.008062+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
cs.LG 2026-04 unverdicted novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions
cs.CR 2026-05 unverdicted novelty 5.0

LLMs for smart contract security analysis show lexical bias from identifier names causing high false positives, with prompting creating precision-recall trade-offs, positioning them as complements rather than replacem...