pith. sign in

arxiv: 2604.27637 · v1 · submitted 2026-04-30 · 💻 cs.AI

Optimization before Evaluation: Evaluation with Unoptimised Prompts Can be Misleading

Pith reviewed 2026-05-07 05:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM evaluationprompt optimizationmodel rankingbenchmarkslarge language modelsevaluation frameworksstatic prompts
0
0 comments X

The pith

Optimizing prompts per model alters LLM benchmark rankings

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model evaluations typically apply the same fixed prompt to every model. In contrast, real-world use often involves optimizing the prompt separately for each model to improve performance. This paper demonstrates that such optimization significantly alters the relative rankings of models on both public and private benchmarks. As a result, evaluations without per-model optimization may lead practitioners to select suboptimal models for their applications. The findings underscore the need to incorporate prompt optimization into standard evaluation practices.

Core claim

The central discovery is that applying prompt optimization per model greatly affects the final ranking of models in LLM evaluations, as shown on public academic and internal industry benchmarks, unlike the common practice of using static prompts across all models.

What carries the argument

The comparison between static prompt templates applied uniformly and prompt optimization techniques tailored to individual models, evaluated through performance metrics on benchmarks.

If this is right

  • Evaluations using unoptimized prompts may not reflect true model capabilities in optimized deployments.
  • Industry practitioners need to perform prompt optimization during model selection to identify the best performer for their task.
  • Benchmark results could vary substantially depending on the optimization method used.
  • Standard evaluation frameworks should include per-model prompt tuning steps to ensure accurate comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that current academic benchmarks may undervalue models that benefit most from prompt engineering.
  • Future evaluation protocols might need to standardize optimization procedures to ensure fairness across models.
  • It raises questions about how other adaptation techniques, like fine-tuning, interact with evaluation rankings.

Load-bearing premise

The prompt optimization methods applied are representative of industry practice and do not introduce their own biases or overfit to the specific benchmarks used.

What would settle it

Replicating the experiments with different prompt optimization techniques or additional benchmarks and finding that model rankings remain consistent regardless of optimization.

Figures

Figures reproduced from arXiv: 2604.27637 by Atin Ghosh, Daniel Dahlmeier, Nicholas Sadjoli, Tim Siefken, Yifan Mai.

Figure 1
Figure 1. Figure 1: Rank changes across all models for datasets after instruction-only PO. view at source ↗
Figure 2
Figure 2. Figure 2: Rank changes across all models for open-source datasets after instruction-with-exemplar PO. view at source ↗
Figure 3
Figure 3. Figure 3: Performance and ranking values for all models, before and after instruction-only PO on tested datasets: view at source ↗
Figure 4
Figure 4. Figure 4: Example of instruction-following improvement after instruction-only PO on the EDDE dataset - Model B view at source ↗
Figure 5
Figure 5. Figure 5: Performance and ranking changes after applying instruction-with-exemplar optimization. view at source ↗
Figure 6
Figure 6. Figure 6: Heatmap of performance changes across all view at source ↗
read the original abstract

Current Large Language Model (LLM) evaluation frameworks utilize the same static prompt template across all models under evaluation. This differs from the common industry practice of using prompt optimization (PO) techniques to optimize the prompt for each model to maximize application performance. In this paper, we investigate the effect of PO towards LLM evaluations. Our results on public academic and internal industry benchmarks show that PO greatly affects the final ranking of models. This highlights the importance of practitioners performing PO per model when conducting evaluations to choose the best model for a given task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard LLM evaluation frameworks apply the same static prompt template to all models, in contrast to common industry practice of using prompt optimization (PO) techniques tailored per model. Experiments on public academic benchmarks and internal industry benchmarks are reported to show that applying PO changes model rankings substantially, implying that evaluations without per-model optimization can be misleading for selecting the best model for a task.

Significance. If the empirical findings hold after addressing methodological gaps, the work would highlight an important misalignment between academic benchmarking practices and real-world LLM deployment, where prompt tuning is routine. This could influence how future evaluations are designed to better reflect application performance. The inclusion of both public and internal benchmarks strengthens potential relevance, though current lack of detail on methods limits immediate utility.

major comments (2)
  1. [Methodology] Methodology section: The prompt optimization procedures are not described in sufficient detail, including the specific algorithms employed, hyperparameters, number of optimization steps, and crucially whether optimization was performed using a held-out validation split separate from the evaluation benchmarks. This information is load-bearing for the central claim, as optimization on test data or use of non-representative methods could artifactually produce ranking shifts rather than demonstrate a general issue with static prompts.
  2. [Results] Results section (and associated tables/figures): No effect sizes, statistical tests for ranking changes, confidence intervals, or controls for confounding factors such as prompt length, format, or token count are reported. Without these, the assertion that PO 'greatly affects' final rankings on public and internal benchmarks cannot be properly evaluated for robustness or practical significance.
minor comments (2)
  1. [Abstract] The abstract would benefit from quantifying the ranking changes (e.g., how many models shift positions and by how much) to give readers a clearer sense of the effect magnitude.
  2. Consider adding a limitations or discussion subsection addressing whether the chosen PO techniques are representative of production workflows and any risks of benchmark overfitting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving methodological transparency and statistical rigor. We have revised the manuscript to address these points directly while preserving the core empirical findings on how per-model prompt optimization affects LLM rankings.

read point-by-point responses
  1. Referee: [Methodology] Methodology section: The prompt optimization procedures are not described in sufficient detail, including the specific algorithms employed, hyperparameters, number of optimization steps, and crucially whether optimization was performed using a held-out validation split separate from the evaluation benchmarks. This information is load-bearing for the central claim, as optimization on test data or use of non-representative methods could artifactually produce ranking shifts rather than demonstrate a general issue with static prompts.

    Authors: We agree that the original manuscript provided insufficient detail on the prompt optimization procedure. In the revised version, the Methodology section now fully specifies the algorithm (a discrete, gradient-free search over prompt templates and few-shot examples), all hyperparameters, the exact number of optimization steps per model, and the optimization objective. Crucially, we confirm and document that optimization was performed exclusively on a held-out validation split that is completely disjoint from the public and internal evaluation benchmarks; no test data was used during optimization. This separation ensures the observed ranking shifts reflect genuine differences in model-prompt compatibility rather than overfitting artifacts. revision: yes

  2. Referee: [Results] Results section (and associated tables/figures): No effect sizes, statistical tests for ranking changes, confidence intervals, or controls for confounding factors such as prompt length, format, or token count are reported. Without these, the assertion that PO 'greatly affects' final rankings on public and internal benchmarks cannot be properly evaluated for robustness or practical significance.

    Authors: We accept that the original results section lacked quantitative measures of effect magnitude and statistical support. The revised manuscript now includes: (1) effect sizes for ranking changes (Kendall tau distance between optimized and unoptimized rankings), (2) statistical tests (Wilcoxon signed-rank tests on per-model performance deltas with p-values and effect sizes), (3) bootstrap confidence intervals on all reported accuracies, and (4) explicit controls and ablation analyses for prompt length, format, and token count. These additions show that the ranking reversals remain statistically significant and are not explained by the controlled factors, thereby strengthening the claim that unoptimized evaluations can be misleading. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of optimized vs. static prompts

full rationale

The paper reports direct experimental results showing that applying prompt optimization changes model rankings on academic and industry benchmarks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are present in the provided text. The central claim rests on straightforward before/after comparisons rather than any self-definitional structure, ansatz smuggling, or uniqueness theorem imported from prior work. This is a standard empirical finding with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper draws on existing prompt optimization literature and standard benchmarks without introducing new free parameters, axioms, or invented entities in the abstract. No new mathematical constructs or unverified assumptions are stated.

pith-pipeline@v0.9.0 · 5390 in / 1001 out tokens · 35918 ms · 2026-05-07T05:57:34.980894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Metric used is by extracting the last detected integer of a model’s output string, which is then compared to the ground truth answer

    GSM8K- For our work, the dataset is split into train/validation/test of 200, 300, 300 re- spectively. Metric used is by extracting the last detected integer of a model’s output string, which is then compared to the ground truth answer. The metric returns a final evaluation score of either one (match) or zero (no match) per sample, and the final reported s...

  2. [2]

    OpenbookQA- Usually, this dataset requires the LLM to perform information retrieval (IR) from the provided facts list and use it to gen- erate a final answer. However, for this paper a simplified version is used, skipping the IR step due to it being outside the scope of this article, and pairing the most relevant fact as context for each question. These c...

  3. [3]

    These topic choices are based on their diversity covering a wide range of sub- jects

    MMLU- For this paper, we have chosen five subjects from the list supported by MMLU: abstract algebra, econometrics, conceptual physics, machine learning, and professional medicine. These topic choices are based on their diversity covering a wide range of sub- jects. Additionally, the similar performance of GPT-3 across these topics, as reported in the ori...

  4. [4]

    The final reported score is the average value of these scores

    Digital Assistant Routing- The evaluation of model predictions for this dataset is done by direct comparison to the ground truth, lead- ing a score of zero or one. The final reported score is the average value of these scores. The train/validation/test split used for the experi- ment results shown is 735/157/158

  5. [5]

    The LLM an- swer and the ground truth are compared using an LLM as a judge setup, with GPT-4o (Ope- nAI, 2024) utilized as the ‘judge’ model

    Copilot Help Docs- To evaluate model pre- dictions, a human-aligned satisfactory answer is provided as the ground truth. The LLM an- swer and the ground truth are compared using an LLM as a judge setup, with GPT-4o (Ope- nAI, 2024) utilized as the ‘judge’ model. This setup uses a prompt that leads the judge LLM to rate the answer with a score from one to ...

  6. [6]

    This dataset has 374 available samples, segregated into 200/100/74 split for train/validation/test

    Copilot Consultancy- Due to the similar open-ended nature of the task, the same LLM as a judge setup for Copilot Help Docs is used for the evaluation metric. This dataset has 374 available samples, segregated into 200/100/74 split for train/validation/test

  7. [7]

    The final score reported for this task is the average precision of the JSON entries generated by the model

    Text-to-SQL- For evaluation metric, each predictions are scored by comparing how many fields and values (entries) in the pre- dicted JSON string match with the entries of ground truth JSON. The final score reported for this task is the average precision of the JSON entries generated by the model. This dataset is rather small with only 56 available samples...

  8. [8]

    The train/validation/test split used is 200/100/96

    EDDE- Evaluating a prediction for EDDE works similarly to Text-to-SQL, however the final metric uses the F1 score of the predicted entries instead. The train/validation/test split used is 200/100/96. B Detailed Experiment Results Detailed numerical values for results shown in Fig- ures 3 and 5 is available in Table 3. C Inference Samples The following is ...

  9. [9]

    Tom starts with 50 trees

  10. [10]

    So, the net gain of trees per year is 10 - 2 = 8 trees

    Every year, he plants 10 trees and chops down 2 trees. So, the net gain of trees per year is 10 - 2 = 8 trees

  11. [11]

    Since this process repeats for 10 years, the total net gain of trees in 10 years is 8 * 10 = 80 trees

  12. [12]

    After 10 years, the total number of trees Tom has is the initial number of trees (50) plus the net gain of trees (80), which is 50 + 80 = 130 trees

  13. [13]

    To find the number of trees that die, we calculate 30% of 130, which is 0.3 * 130 = 39 trees

    However, after 10 years, 30% of the trees die. To find the number of trees that die, we calculate 30% of 130, which is 0.3 * 130 = 39 trees

  14. [14]

    Therefore, Tom has 91 trees left after 10 years

    Finally, to find the number of trees Tom has left, we subtract the number of trees that died (39) from the total number of trees he had after 10 years (130), which is 130 - 39 = 91 trees. Therefore, Tom has 91 trees left after 10 years. Prediction with optimized prompt (Correct Answer) To find the number of trees Tom has left after 10 years, we need to fo...

  15. [15]

    Calculate the net number of trees Tom plants each year: 10 trees (planted) - 2 trees (chopped down) = 8 trees per year

  16. [16]

    Calculate the total number of trees Tom adds over 10 years: 8 trees/year * 10 years = 80 trees

  17. [17]

    Calculate the total number of trees Tom has after 10 years, before any die: Initial trees + trees added = 50 trees + 80 trees = 130 trees

  18. [18]

    Calculate the number of trees that die after 10 years: 30% of 130 trees = 0.3 * 130 = 39 trees

  19. [19]

    dimensions

    Calculate the final number of trees Tom has left: Total trees - trees that died = 130 trees - 39 trees = 91 trees. **The final answer is: 91** Initial Prompt As an expert problem solver solve step by step the following mathematical questions. Final Prompt Provide a clear and concise solution to the following mathematical question, detailing each step of t...

  20. [20]

    **Dimensions**: These are the categorical data that describe the characteristics or attributes of the data

  21. [21]

    **Measures**: These are the numerical data that quantify the dimensions and can be aggregated

  22. [22]

    Store",

    **Filters**: These are any conditions or constraints applied to the data. Given the question, we can identify the entities as follows: - **Dimensions**: "Store", "Product", "Location" - **Measures**: "Count" - **Filters**: None The calculation subtype for the measure "Count" is a simple count aggregation. Here is the JSON format for the identified entitie...

  23. [23]

    The specific transaction code depends on your system configuration

    **Access the Solution Quotation:** Navigate to the relevant solution quotation in our company. The specific transaction code depends on your system configuration

  24. [24]

    Released

    **Release Individual Items (Optional):** You can release individual items within the quotation if you want only certain items to be considered for follow-up transactions. This is done by changing the status of each relevant item to "Released." However, note that product bundles must be released as a whole

  25. [25]

    Released

    **Release the Quotation Header:** Changing the status of the quotation header to "Released" will automatically release all items within the quotation that haven’t already been individually released. This is the most common way to release the entire quotation

  26. [26]

    Accept" button. This action triggers the creation of follow-up transactions based on the released items and sets the solution quotation status to

    **Accept the Quotation:** After releasing the quotation (or parts of it), click the "Accept" button. This action triggers the creation of follow-up transactions based on the released items and sets the solution quotation status to "Completed." **Key Considerations for Product Bundles:** * **Release as a Whole:** Product bundles within a solution quotation...