arxiv: 2603.10477 · v2 · submitted 2026-03-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong , Eunsoo Lee , SoHyun Park , Jihie Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt engineeringevaluation metricsLLM-as-a-judgeinterpretabilityprompt optimizationresponse qualityrubric-based evaluation

0 comments

The pith

PEEM uses a nine-criteria rubric scored by an LLM to jointly evaluate prompts and responses, aligning with standard accuracy and enabling up to 11.7-point accuracy gains via feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prompt design controls large language models but usual tests only check if the answer is right, hiding what makes a prompt good or bad. PEEM creates a single evaluation system with nine specific criteria split between prompt quality and response quality. An LLM acts as judge, giving scores from one to five and written reasons for each score based on the criteria. This accuracy score lines up closely with normal accuracy checks and keeps the same order of which models perform best. Using the scores and reasons to rewrite the prompt in a loop raises the final accuracy by as much as eleven point seven points, beating methods that need training data or reinforcement learning.

Core claim

The central claim is that a structured nine-axis rubric evaluated by an LLM produces scores and rationales that accurately reflect prompt and response quality, correlate strongly with conventional accuracy metrics, and can be used directly as feedback to optimize prompts in a zero-shot setting, achieving substantial improvements over baselines.

What carries the argument

The PEEM 9-axis rubric consisting of three prompt criteria (clarity/structure, linguistic quality, fairness) and six response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), with an LLM evaluator producing 1-5 scores and criterion-specific rationales.

If this is right

PEEM accuracy scores preserve model rankings across tasks with high correlation to ground truth accuracy.
Prompts can be optimized iteratively using only PEEM feedback without any labeled data or training.
The evaluation remains consistent when different LLMs serve as judges and stable under meaning-preserving changes to prompts.
Semantic adversarial attacks cause measurable drops in PEEM scores while paraphrases do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be applied to create more transparent prompt engineering tools for non-experts.
Future work might test whether adding domain-specific criteria further improves optimization for specialized tasks.
The reliance on LLM judges suggests potential for hybrid systems combining automated and human review.

Load-bearing premise

The LLM judge applies the rubric in a reliable, unbiased way that truly captures the intended quality aspects without hallucinations or systematic errors.

What would settle it

A replication study on new benchmarks where PEEM scores show low correlation with actual accuracy or where the feedback loop produces no accuracy improvement would falsify the central claims.

Figures

Figures reproduced from arXiv: 2603.10477 by Eunsoo Lee, Jihie Kim, Minki Hong, SoHyun Park.

**Figure 2.** Figure 2: Overview of PEEM. The framework operates in two stages: (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEEM's 9-axis rubric tracks accuracy reliably and powers a rewriting loop with measurable gains, but the non-accuracy axes rest on unvalidated LLM judgments.

read the letter

The paper's main contribution is a joint 9-axis rubric that scores prompts on clarity, structure, linguistic quality and fairness, and responses on accuracy, coherence, relevance, objectivity, clarity and conciseness. An LLM judge produces both 1-5 scores and short rationales for each axis. They then feed those scores and rationales back into a zero-shot rewriting loop and report accuracy lifts of up to 11.7 points over supervised and RL baselines across seven benchmarks.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PEEM, a unified 9-axis rubric (3 prompt criteria: clarity/structure, linguistic quality, fairness; 6 response criteria: accuracy, coherence, relevance, objectivity, clarity, conciseness) evaluated by an LLM judge that outputs 1-5 Likert scores plus criterion-specific rationales. Across 7 benchmarks and 5 task models, the accuracy axis aligns with conventional accuracy (aggregate Spearman ρ ≈ 0.97, Pearson r ≈ 0.94, p < 0.001) while preserving rankings; a multi-evaluator study shows LLM-LLM consistency (pairwise ρ = 0.68-0.85); prompt perturbations and semantic attacks produce expected score changes; and a zero-shot rewriting loop using only PEEM scores/rationales as feedback improves downstream accuracy by up to 11.7 points, outperforming supervised and RL baselines.

Significance. If the LLM judge produces faithful, unbiased scores and rationales, PEEM supplies an interpretable, criterion-driven protocol that links prompt formulation to response behavior and supports systematic diagnosis plus optimization. The reported accuracy-axis alignment and the rewriting-loop gains constitute concrete, reproducible evidence of utility beyond black-box metrics. The multi-evaluator consistency further supports evaluator-agnostic use. These strengths would make the work a useful contribution to prompt-engineering methodology provided the human-grounding gap is addressed.

major comments (2)

[Multi-Evaluator Study] Multi-Evaluator Study: The reported pairwise Spearman ρ = 0.68-0.85 establishes only LLM-LLM agreement. No human inter-annotator agreement, correlation with human judgments, or validation is provided for the eight non-accuracy axes. Because the rewriting-loop claim (up to +11.7 accuracy points) rests on the rationales supplying genuine quality signals, the absence of human grounding is load-bearing and must be remedied before the optimization results can be accepted at face value.
[Prompt Rewriting Experiment] Prompt Rewriting Experiment: The zero-shot loop that feeds PEEM scores and rationales back to the model is presented as outperforming supervised and RL baselines, yet the manuscript supplies no details on the exact evaluator prompt template, temperature settings, or controls for judge-model bias (e.g., whether the same model family is used for both evaluation and rewriting). Without these, it remains possible that observed gains are artifacts of self-reinforcement rather than external prompt-quality information.

minor comments (2)

[Robustness Analysis] The robustness rate (76.7-80.6 %) under meaning-preserving paraphrases is cited in the abstract but its exact definition, formula, and per-axis breakdown are not stated in the main text; this should be added for reproducibility.
[Methods] The paper would benefit from an explicit table or appendix listing the precise system prompt and few-shot examples (if any) used by the PEEM evaluator LLM, together with the exact Likert-scale anchoring instructions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Multi-Evaluator Study: The reported pairwise Spearman ρ = 0.68-0.85 establishes only LLM-LLM agreement. No human inter-annotator agreement, correlation with human judgments, or validation is provided for the eight non-accuracy axes. Because the rewriting-loop claim (up to +11.7 accuracy points) rests on the rationales supplying genuine quality signals, the absence of human grounding is load-bearing and must be remedied before the optimization results can be accepted at face value.

Authors: We agree that the current multi-evaluator study demonstrates only LLM-LLM consistency and that direct human validation for the non-accuracy axes is absent. While the accuracy axis is grounded via strong alignment with conventional (human-annotated) accuracy, the other axes rely on rubric-based LLM judgments without explicit human correlation. In the revised manuscript we will add a human annotation study on a representative subset of examples across the benchmarks, reporting inter-annotator agreement and Pearson/Spearman correlations between PEEM scores and human judgments for all nine axes. This will also provide direct evidence that the rationales used in the rewriting loop convey genuine quality signals. revision: yes
Referee: Prompt Rewriting Experiment: The zero-shot loop that feeds PEEM scores and rationales back to the model is presented as outperforming supervised and RL baselines, yet the manuscript supplies no details on the exact evaluator prompt template, temperature settings, or controls for judge-model bias (e.g., whether the same model family is used for both evaluation and rewriting). Without these, it remains possible that observed gains are artifacts of self-reinforcement rather than external prompt-quality information.

Authors: We acknowledge that the original manuscript omitted key implementation details. In the revision we will include the complete evaluator prompt template in the appendix, state that all evaluations used temperature 0 for determinism, and clarify the model separation: the judge model belongs to a different family from both the task models and the rewriting model. We will also add an ablation that repeats the rewriting loop with an alternative judge model to confirm that gains are not due to self-reinforcement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks

full rationale

The paper defines the 9-axis rubric independently and validates the accuracy axis through direct correlation with conventional accuracy metrics across 7 benchmarks (Spearman rho ~0.97). The zero-shot rewriting loop uses PEEM scores/rationales only as input feedback; success is measured by independent downstream accuracy gains (up to +11.7 points) on the same benchmarks, not by re-evaluating with PEEM itself. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. LLM-LLM consistency (rho 0.68-0.85) is reported as supporting evidence but is not the foundation of the accuracy alignment or rewriting results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that LLM judges can faithfully instantiate the rubric without systematic bias and that the nine chosen criteria are sufficient to capture prompt-response quality.

axioms (1)

domain assumption LLM-based evaluators can produce reliable, rubric-grounded scores and rationales for both prompts and responses
Central to all reported alignments and the rewriting loop; no human validation details given in abstract.

pith-pipeline@v0.9.0 · 5637 in / 1269 out tokens · 52072 ms · 2026-05-15T13:53:10.141705+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output scalar scores on a 1-5 Likert scale and criterion-specific natural-language rationales.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman ρ≈0.97, Pearson r≈0.94, p<0.001).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 10 internal anchors

[1]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,”Adv. Neu- ral Inf. Process. Syst., vol. 33, pp. 1877–1901, 2020

work page 1901
[2]

Large language models are human-level prompt engineers,

Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” inProc. 11th Int. Conf. Learn. Repre- sent. (ICLR), 2022

work page 2022
[3]

Chain-of-thought prompt- ing elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompt- ing elicits reasoning in large language models,”Adv. Neu- ral Inf. Process. Syst., vol. 35, pp. 24824–24837, 2022

work page 2022
[4]

Which prompting technique should I use? An empirical investigation of prompting tech- niques for software engineering tasks,

E. G. Santana Jr., G. Benjamin, M. Araujo, H. San- tos, D. Freitas, E. Almeida, P. A. da Neto, J. Li, J. Chun, and I. Ahmed, “Which prompting technique should I use? An empirical investigation of prompting tech- niques for software engineering tasks,”arXiv preprint arXiv:2506.05614, 2025

work page arXiv 2025
[5]

Effi- cient multi-prompt evaluation of LLMs,

F. M. Polo, R. Xu, L. Weber, M. Silva, O. Bhardwaj, L. Choshen, A. de Oliveira, Y . Sun, and M. Yurochkin, “Effi- cient multi-prompt evaluation of LLMs,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 22483–22512, 2024

work page 2024
[6]

Prompt sta- bility matters: Evaluating and optimizing auto-generated prompt in general-purpose systems,

K. Chen, Y . Zhou, X. Zhang, and H. Wang, “Prompt sta- bility matters: Evaluating and optimizing auto-generated prompt in general-purpose systems,”arXiv preprint arXiv:2505.13546, 2025

work page arXiv 2025
[7]

Automatic chain of thought prompting in large language models,

Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” arXiv preprint arXiv:2210.03493, 2022

work page arXiv 2022
[8]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inProc. Int. Conf. Learn. Representa- tions (ICLR), 2023

work page 2023
[9]

GPTScore: Evaluate as You Desire , publisher =

J. Fu, S.-K. Ng, Z. Jiang, and P. Liu, “Gptscore: Evaluate as you desire,”arXiv preprint arXiv:2302.04166, 2023

work page arXiv 2023
[10]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: NLG evaluation using GPT-4 with better human alignment,”arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

A closer look into automatic evaluation using large language models,

C.-H. Chiang and H.-y. Lee, “A closer look into automatic evaluation using large language models,”arXiv preprint arXiv:2310.05657, 2023

work page arXiv 2023
[12]

A survey of prompt engineer- ing methods in large language models for different NLP tasks,

S. Vatsal and H. Dubey, “A survey of prompt engineer- ing methods in large language models for different NLP tasks,”arXiv preprint arXiv:2407.12994, 2024

work page arXiv 2024
[13]

Benchmarking prompt sensitiv- ity in large language models,

A. Razavi, M. Soltangheis, N. Arabzadeh, S. Salamat, M. Zihayat, and E. Bagheri, “Benchmarking prompt sensitiv- ity in large language models,” inProc. Eur. Conf. Inf. Re- trieval, 2025, pp. 303–313

work page 2025
[14]

Prompt engineering and its implications on the energy consumption of Large Language Models,

R. Rubei, A. Moussaid, C. Di Sipio, and D. Di Ruscio, “Prompt engineering and its implications on the energy consumption of Large Language Models,”arXiv preprint arXiv:2501.05899, 2025

work page arXiv 2025
[15]

Is ChatGPT a good NLG evaluator? A preliminary study,

J. Wang, Y . Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, “Is ChatGPT a good NLG evaluator? A preliminary study,”arXiv preprint arXiv:2303.04048, 2023

work page arXiv 2023
[16]

POSIX: A prompt sensitivity index for large language models,

A. Chatterjee, H. K. Renduchintala, S. Bhatia, and T. Chakraborty, “POSIX: A prompt sensitivity index for large language models,”arXiv preprint arXiv:2410.02185, 2024

work page arXiv 2024
[17]

ProSA: Assessing and understanding the prompt sensitiv- ity of LLMs,

J. Zhuo, S. Zhang, X. Fang, H. Duan, D. Lin, and K. Chen, “ProSA: Assessing and understanding the prompt sensitiv- ity of LLMs,”arXiv preprint arXiv:2410.12405, 2024

work page arXiv 2024
[18]

What if you said that differently?: How explanation formats af- fect human feedback efficacy and user perception,

C. Malaviya, S. Lee, D. Roth, and M. Yatskar, “What if you said that differently?: How explanation formats af- fect human feedback efficacy and user perception,”arXiv preprint arXiv:2311.09558, 2023

work page arXiv 2023
[19]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4,

S. M. Bsharat, A. Myrzakhan, and Z. Shen, “Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4,”arXiv preprint arXiv:2312.16171, 2023

work page arXiv 2023
[21]

Character-level con- volutional networks for text classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level con- volutional networks for text classification,”Advances in Neural Information Processing Systems, vol. 28, 2015

work page 2015
[22]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and Ø. Tafjord, “Think you have solved ques- tion answering? Try ARC, the AI2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou,et al., “Challenging BIG-Bench tasks and whether chain-of-thought can solve them,”arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. 23

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring mas- sive multitask language understanding,”arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[26]

Recursive deep models for semantic compositionality over a sentiment treebank,

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Man- ning, A. Y . Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proc. EMNLP, 2013, pp. 1631–1642

work page 2013
[27]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ram ´eet al., “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

The LLaMA 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan,et al., “The LLaMA 3 herd of models,”arXiv e-prints, p. arXiv:2407, 2024

work page 2024
[29]

Qwen2.5-Coder Technical Report

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu,et al., “Qwen2.5-coder technical report,”arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Prewrite: Prompt rewriting with reinforce- ment learning,

W. Kong, S. A. Hombaiah, M. Zhang, Q. Mei, and M. Bendersky, “Prewrite: Prompt rewriting with reinforce- ment learning,”arXiv preprint arXiv:2401.08189, 2024

work page arXiv 2024
[31]

Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts,

T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts,” arXiv preprint arXiv:2010.15980, 2020

work page arXiv 2010
[32]

RLPrompt: Optimizing discrete text prompts with reinforcement learning,

M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu, “RLPrompt: Optimizing discrete text prompts with reinforcement learning,”arXiv preprint arXiv:2205.12548, 2022

work page arXiv 2022
[33]

Tempera: Test-time prompting via reinforce- ment learning,

T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez, “Tempera: Test-time prompting via reinforce- ment learning,”arXiv preprint arXiv:2211.11890, 2022

work page arXiv 2022
[34]

PaLM 2 Technical Report

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chenet al., “Palm 2 technical report,”arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Judging the judges: A systematic study of position bias in LLM-as-a- judge,

Y . Shi, D. Y . Zhang, F. Yang, and P. S. Yu, “Judging the judges: A systematic study of position bias in LLM-as-a- judge,”arXiv preprint arXiv:2406.07791, 2024

work page arXiv 2024
[36]

Self-preference bias in LLM-as-a-judge,

W. Zheng, Y . Zhang, L. Zhu, M. Shen, X. Chen, and Y . Gao, “Self-preference bias in LLM-as-a-judge,” inProc. OpenReview NeurIPS Workshop, 2024

work page 2024
[37]

Humans or LLMs as the judge? A study on judgement bias,

X. Chen, Y . Wang, M. Lin, and W. Che, “Humans or LLMs as the judge? A study on judgement bias,” inProc. EMNLP, 2024, pp. 3912–3927

work page 2024
[38]

The intricacies of evaluating large lan- guage models with LLM-as-a-judge,

A. Veetil, “The intricacies of evaluating large lan- guage models with LLM-as-a-judge,”arXiv preprint arXiv:2409.08231, 2024

work page arXiv 2024
[39]

Cross-lingual auto evaluation for assessing multilingual LLMs,

K. Yang, M. Zhang, J. Li, and H. Zhao, “Cross-lingual auto evaluation for assessing multilingual LLMs,”arXiv preprint arXiv:2410.13394, 2024

work page arXiv 2024
[40]

UMP- Net: Uncertainty-Aware Mixture of Prompts Network for Efficient Instruction Tuning,

F. Daneshfar, A. abas, M. Abdar, and P. Lio, “UMP- Net: Uncertainty-Aware Mixture of Prompts Network for Efficient Instruction Tuning,”Trans. Mach. Learn. Res. (TMLR), 2025

work page 2025
[41]

A human-AI compara- tive analysis of prompt sensitivity in LLM-based relevance judgment,

N. Arabzadeh and C. L. A. Clarke, “A human-AI compara- tive analysis of prompt sensitivity in LLM-based relevance judgment,”arXiv preprint arXiv:2504.12408, 2025

work page arXiv 2025
[42]

A uni- fied evaluation-instructed framework for query-dependent prompt optimization,

K. Chen, Y . Wang, H. Almosapeeh, and H. Wang, “A uni- fied evaluation-instructed framework for query-dependent prompt optimization,”arXiv preprint arXiv:2511.19829, 2025

work page arXiv 2025
[43]

PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Z. Gong, and X. Xie, “PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,”arXiv preprint arXiv:2306.04528, 2023

work page arXiv 2023
[44]

ASSERT: Auto- mated safety scenario red teaming for evaluating the robustness of large language models,

A. Mei, S. Levy, and W. Y . Wang, “ASSERT: Auto- mated safety scenario red teaming for evaluating the robustness of large language models,”arXiv preprint arXiv:2310.09624, 2023

work page arXiv 2023
[45]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasu- pat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Prompt- ception: How Sensitive Are Large Multimodal Models to Prompts?,

M. I. Ismithdeen, M. U. Khattak, and S. Khan, “Prompt- ception: How Sensitive Are Large Multimodal Models to Prompts?,” inFindings of the Association for Computa- tional Linguistics: EMNLP 2025, Suzhou, China, 2025, pp. 23950–23985, Association for Computational Lin- guistics, doi: 10.18653/v1/2025.findings-emnlp.1302. 24

work page doi:10.18653/v1/2025.findings-emnlp.1302 2025