Optimization before Evaluation: Evaluation with Unoptimised Prompts Can be Misleading
Pith reviewed 2026-05-07 05:57 UTC · model grok-4.3
The pith
Optimizing prompts per model alters LLM benchmark rankings
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that applying prompt optimization per model greatly affects the final ranking of models in LLM evaluations, as shown on public academic and internal industry benchmarks, unlike the common practice of using static prompts across all models.
What carries the argument
The comparison between static prompt templates applied uniformly and prompt optimization techniques tailored to individual models, evaluated through performance metrics on benchmarks.
If this is right
- Evaluations using unoptimized prompts may not reflect true model capabilities in optimized deployments.
- Industry practitioners need to perform prompt optimization during model selection to identify the best performer for their task.
- Benchmark results could vary substantially depending on the optimization method used.
- Standard evaluation frameworks should include per-model prompt tuning steps to ensure accurate comparisons.
Where Pith is reading between the lines
- This suggests that current academic benchmarks may undervalue models that benefit most from prompt engineering.
- Future evaluation protocols might need to standardize optimization procedures to ensure fairness across models.
- It raises questions about how other adaptation techniques, like fine-tuning, interact with evaluation rankings.
Load-bearing premise
The prompt optimization methods applied are representative of industry practice and do not introduce their own biases or overfit to the specific benchmarks used.
What would settle it
Replicating the experiments with different prompt optimization techniques or additional benchmarks and finding that model rankings remain consistent regardless of optimization.
Figures
read the original abstract
Current Large Language Model (LLM) evaluation frameworks utilize the same static prompt template across all models under evaluation. This differs from the common industry practice of using prompt optimization (PO) techniques to optimize the prompt for each model to maximize application performance. In this paper, we investigate the effect of PO towards LLM evaluations. Our results on public academic and internal industry benchmarks show that PO greatly affects the final ranking of models. This highlights the importance of practitioners performing PO per model when conducting evaluations to choose the best model for a given task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that standard LLM evaluation frameworks apply the same static prompt template to all models, in contrast to common industry practice of using prompt optimization (PO) techniques tailored per model. Experiments on public academic benchmarks and internal industry benchmarks are reported to show that applying PO changes model rankings substantially, implying that evaluations without per-model optimization can be misleading for selecting the best model for a task.
Significance. If the empirical findings hold after addressing methodological gaps, the work would highlight an important misalignment between academic benchmarking practices and real-world LLM deployment, where prompt tuning is routine. This could influence how future evaluations are designed to better reflect application performance. The inclusion of both public and internal benchmarks strengthens potential relevance, though current lack of detail on methods limits immediate utility.
major comments (2)
- [Methodology] Methodology section: The prompt optimization procedures are not described in sufficient detail, including the specific algorithms employed, hyperparameters, number of optimization steps, and crucially whether optimization was performed using a held-out validation split separate from the evaluation benchmarks. This information is load-bearing for the central claim, as optimization on test data or use of non-representative methods could artifactually produce ranking shifts rather than demonstrate a general issue with static prompts.
- [Results] Results section (and associated tables/figures): No effect sizes, statistical tests for ranking changes, confidence intervals, or controls for confounding factors such as prompt length, format, or token count are reported. Without these, the assertion that PO 'greatly affects' final rankings on public and internal benchmarks cannot be properly evaluated for robustness or practical significance.
minor comments (2)
- [Abstract] The abstract would benefit from quantifying the ranking changes (e.g., how many models shift positions and by how much) to give readers a clearer sense of the effect magnitude.
- Consider adding a limitations or discussion subsection addressing whether the chosen PO techniques are representative of production workflows and any risks of benchmark overfitting.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving methodological transparency and statistical rigor. We have revised the manuscript to address these points directly while preserving the core empirical findings on how per-model prompt optimization affects LLM rankings.
read point-by-point responses
-
Referee: [Methodology] Methodology section: The prompt optimization procedures are not described in sufficient detail, including the specific algorithms employed, hyperparameters, number of optimization steps, and crucially whether optimization was performed using a held-out validation split separate from the evaluation benchmarks. This information is load-bearing for the central claim, as optimization on test data or use of non-representative methods could artifactually produce ranking shifts rather than demonstrate a general issue with static prompts.
Authors: We agree that the original manuscript provided insufficient detail on the prompt optimization procedure. In the revised version, the Methodology section now fully specifies the algorithm (a discrete, gradient-free search over prompt templates and few-shot examples), all hyperparameters, the exact number of optimization steps per model, and the optimization objective. Crucially, we confirm and document that optimization was performed exclusively on a held-out validation split that is completely disjoint from the public and internal evaluation benchmarks; no test data was used during optimization. This separation ensures the observed ranking shifts reflect genuine differences in model-prompt compatibility rather than overfitting artifacts. revision: yes
-
Referee: [Results] Results section (and associated tables/figures): No effect sizes, statistical tests for ranking changes, confidence intervals, or controls for confounding factors such as prompt length, format, or token count are reported. Without these, the assertion that PO 'greatly affects' final rankings on public and internal benchmarks cannot be properly evaluated for robustness or practical significance.
Authors: We accept that the original results section lacked quantitative measures of effect magnitude and statistical support. The revised manuscript now includes: (1) effect sizes for ranking changes (Kendall tau distance between optimized and unoptimized rankings), (2) statistical tests (Wilcoxon signed-rank tests on per-model performance deltas with p-values and effect sizes), (3) bootstrap confidence intervals on all reported accuracies, and (4) explicit controls and ablation analyses for prompt length, format, and token count. These additions show that the ranking reversals remain statistically significant and are not explained by the controlled factors, thereby strengthening the claim that unoptimized evaluations can be misleading. revision: yes
Circularity Check
No circularity: empirical comparison of optimized vs. static prompts
full rationale
The paper reports direct experimental results showing that applying prompt optimization changes model rankings on academic and industry benchmarks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are present in the provided text. The central claim rests on straightforward before/after comparisons rather than any self-definitional structure, ansatz smuggling, or uniqueness theorem imported from prior work. This is a standard empirical finding with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
GSM8K- For our work, the dataset is split into train/validation/test of 200, 300, 300 re- spectively. Metric used is by extracting the last detected integer of a model’s output string, which is then compared to the ground truth answer. The metric returns a final evaluation score of either one (match) or zero (no match) per sample, and the final reported s...
-
[2]
OpenbookQA- Usually, this dataset requires the LLM to perform information retrieval (IR) from the provided facts list and use it to gen- erate a final answer. However, for this paper a simplified version is used, skipping the IR step due to it being outside the scope of this article, and pairing the most relevant fact as context for each question. These c...
work page 2012
-
[3]
These topic choices are based on their diversity covering a wide range of sub- jects
MMLU- For this paper, we have chosen five subjects from the list supported by MMLU: abstract algebra, econometrics, conceptual physics, machine learning, and professional medicine. These topic choices are based on their diversity covering a wide range of sub- jects. Additionally, the similar performance of GPT-3 across these topics, as reported in the ori...
work page 2021
-
[4]
The final reported score is the average value of these scores
Digital Assistant Routing- The evaluation of model predictions for this dataset is done by direct comparison to the ground truth, lead- ing a score of zero or one. The final reported score is the average value of these scores. The train/validation/test split used for the experi- ment results shown is 735/157/158
-
[5]
Copilot Help Docs- To evaluate model pre- dictions, a human-aligned satisfactory answer is provided as the ground truth. The LLM an- swer and the ground truth are compared using an LLM as a judge setup, with GPT-4o (Ope- nAI, 2024) utilized as the ‘judge’ model. This setup uses a prompt that leads the judge LLM to rate the answer with a score from one to ...
work page 2024
-
[6]
This dataset has 374 available samples, segregated into 200/100/74 split for train/validation/test
Copilot Consultancy- Due to the similar open-ended nature of the task, the same LLM as a judge setup for Copilot Help Docs is used for the evaluation metric. This dataset has 374 available samples, segregated into 200/100/74 split for train/validation/test
-
[7]
Text-to-SQL- For evaluation metric, each predictions are scored by comparing how many fields and values (entries) in the pre- dicted JSON string match with the entries of ground truth JSON. The final score reported for this task is the average precision of the JSON entries generated by the model. This dataset is rather small with only 56 available samples...
-
[8]
The train/validation/test split used is 200/100/96
EDDE- Evaluating a prediction for EDDE works similarly to Text-to-SQL, however the final metric uses the F1 score of the predicted entries instead. The train/validation/test split used is 200/100/96. B Detailed Experiment Results Detailed numerical values for results shown in Fig- ures 3 and 5 is available in Table 3. C Inference Samples The following is ...
-
[9]
Tom starts with 50 trees
-
[10]
So, the net gain of trees per year is 10 - 2 = 8 trees
Every year, he plants 10 trees and chops down 2 trees. So, the net gain of trees per year is 10 - 2 = 8 trees
-
[11]
Since this process repeats for 10 years, the total net gain of trees in 10 years is 8 * 10 = 80 trees
-
[12]
After 10 years, the total number of trees Tom has is the initial number of trees (50) plus the net gain of trees (80), which is 50 + 80 = 130 trees
-
[13]
To find the number of trees that die, we calculate 30% of 130, which is 0.3 * 130 = 39 trees
However, after 10 years, 30% of the trees die. To find the number of trees that die, we calculate 30% of 130, which is 0.3 * 130 = 39 trees
-
[14]
Therefore, Tom has 91 trees left after 10 years
Finally, to find the number of trees Tom has left, we subtract the number of trees that died (39) from the total number of trees he had after 10 years (130), which is 130 - 39 = 91 trees. Therefore, Tom has 91 trees left after 10 years. Prediction with optimized prompt (Correct Answer) To find the number of trees Tom has left after 10 years, we need to fo...
-
[15]
Calculate the net number of trees Tom plants each year: 10 trees (planted) - 2 trees (chopped down) = 8 trees per year
-
[16]
Calculate the total number of trees Tom adds over 10 years: 8 trees/year * 10 years = 80 trees
-
[17]
Calculate the total number of trees Tom has after 10 years, before any die: Initial trees + trees added = 50 trees + 80 trees = 130 trees
-
[18]
Calculate the number of trees that die after 10 years: 30% of 130 trees = 0.3 * 130 = 39 trees
-
[19]
Calculate the final number of trees Tom has left: Total trees - trees that died = 130 trees - 39 trees = 91 trees. **The final answer is: 91** Initial Prompt As an expert problem solver solve step by step the following mathematical questions. Final Prompt Provide a clear and concise solution to the following mathematical question, detailing each step of t...
-
[20]
**Dimensions**: These are the categorical data that describe the characteristics or attributes of the data
-
[21]
**Measures**: These are the numerical data that quantify the dimensions and can be aggregated
-
[22]
**Filters**: These are any conditions or constraints applied to the data. Given the question, we can identify the entities as follows: - **Dimensions**: "Store", "Product", "Location" - **Measures**: "Count" - **Filters**: None The calculation subtype for the measure "Count" is a simple count aggregation. Here is the JSON format for the identified entitie...
-
[23]
The specific transaction code depends on your system configuration
**Access the Solution Quotation:** Navigate to the relevant solution quotation in our company. The specific transaction code depends on your system configuration
-
[24]
**Release Individual Items (Optional):** You can release individual items within the quotation if you want only certain items to be considered for follow-up transactions. This is done by changing the status of each relevant item to "Released." However, note that product bundles must be released as a whole
- [25]
-
[26]
**Accept the Quotation:** After releasing the quotation (or parts of it), click the "Accept" button. This action triggers the creation of follow-up transactions based on the released items and sets the solution quotation status to "Completed." **Key Considerations for Product Bundles:** * **Release as a Whole:** Product bundles within a solution quotation...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.