pith. sign in

arxiv: 2606.17634 · v1 · pith:4AME2ISUnew · submitted 2026-06-16 · 💻 cs.CL

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

Pith reviewed 2026-06-27 01:00 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationpairwise comparisonprompt perturbationcomparison graphsranking consistencyintransitivityLLM rankingstructural consistency
0
0 comments X

The pith

Perturbing prompts generates multiple comparison graphs whose structural consistency filters out cyclic inconsistencies before ranking LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets intransitivity in pairwise LLM evaluations, where judgments on responses can produce cycles or tie inconsistencies that block any coherent global ranking. It generates perturbed variants of each prompt, builds comparison graphs from the resulting judgments, and removes patterns that fail to align structurally across those graphs. Standard ranking methods are then applied only to the retained comparisons. A sympathetic reader would care because the filtered comparisons produce more stable leaderboards whose order is less sensitive to small prompt changes. The framework inserts the consistency check explicitly before any aggregation step.

Core claim

By generating perturbed variants of each prompt, constructing comparison graphs from the resulting judgments, and filtering out structurally inconsistent comparison patterns across those graphs, the approach reduces cyclic inconsistencies and yields more reliable LLM rankings when standard aggregation methods are applied to the filtered set.

What carries the argument

Prompt perturbation framework that builds multiple comparison graphs per original prompt and applies graph-level structural consistency checks to filter comparisons before ranking.

If this is right

  • Leaderboards become less sensitive to minor prompt wording changes.
  • Standard ranking algorithms can be used without first solving the full intransitivity problem.
  • The same filtering step applies to any pairwise comparison setup that produces directed graphs.
  • Inconsistencies involving ties are also reduced when they fail the cross-graph consistency test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested by holding out a subset of prompts and checking whether filtered rankings better predict held-out human judgments than unfiltered ones.
  • If perturbation strength is varied, one could measure the trade-off between the amount of data removed and the drop in observed cycles.
  • The approach suggests that consistency across prompt variants may serve as a proxy for evaluation reliability in other open-ended generation tasks.

Load-bearing premise

Structural consistency across graphs from perturbed prompt variants reliably flags and removes only noisy or invalid comparisons without discarding valid preference data or creating new selection bias.

What would settle it

Run the method on a dataset with known human-validated preferences and measure whether the fraction of removed comparisons that humans later judge as correct exceeds the fraction retained, or whether intransitivity rates remain high after filtering.

Figures

Figures reproduced from arXiv: 2606.17634 by Dong Huang, Jianbo Sun, Pengkun Yang.

Figure 1
Figure 1. Figure 1: Comparison between the traditional pipeline and our prompt-perturbation-based aggregation [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the keep- [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity to the relative weighting of bad 3-cycles and bad 4-cycles in the truncation score [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Descriptive statistics of the comparison graphs by category. Prompt perturbation usually increases [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: shows that the same qualitative pattern holds under all four distances and for both judges. Each curve drops quickly when K grows from very small values, reaches its best region at an intermediate budget, and then flattens or slightly rebounds as additional graphs are retained. The absolute scales differ, as expected. The Chebyshev distance is the noisiest because it depends only on the single worst-displa… view at source ↗
Figure 8
Figure 8. Figure 8: Semantic prompt perturbation versus a sampling-only control under cycle-aware truncation. Lower [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
read the original abstract

Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a prompt perturbation framework for pairwise LLM evaluation over comparison graphs. It generates perturbed variants of each prompt, constructs comparison graphs from the resulting judgments, identifies and filters out structurally inconsistent comparison patterns, and applies standard ranking methods to the filtered comparisons, with the goal of reducing cyclic inconsistencies such as A ≻ B ≻ C ≻ A and improving the reliability of LLM rankings.

Significance. If the filtering step based on graph-level structural consistency across perturbations can be shown to remove noise-induced inconsistencies while preserving underlying preference signal, the framework could offer a practical improvement to the stability of LLM leaderboards. The approach explicitly incorporates consistency checks before aggregation, which is a clear conceptual contribution, but the absence of any reported experiments, datasets, or analysis leaves the practical impact unassessed.

major comments (2)
  1. [Abstract] Abstract (framework description paragraph): the central claim that the method 'provides a simple and principled way to reduce cyclic inconsistencies' rests on the unstated assumption that perturbations preserve preferences while exposing only noise; no formal definition of the perturbation operator or the precise criterion for 'structurally inconsistent comparison patterns' is supplied, making it impossible to verify whether the filter introduces selection bias.
  2. [Abstract] Abstract: no experimental results, error analysis, validation datasets, or comparison against baselines are presented, so the claim that the filtered comparisons improve reliability cannot be evaluated; this is load-bearing because the soundness of the pipeline cannot be determined from the method description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below and commit to revisions that strengthen the formalization and add empirical validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract (framework description paragraph): the central claim that the method 'provides a simple and principled way to reduce cyclic inconsistencies' rests on the unstated assumption that perturbations preserve preferences while exposing only noise; no formal definition of the perturbation operator or the precise criterion for 'structurally inconsistent comparison patterns' is supplied, making it impossible to verify whether the filter introduces selection bias.

    Authors: We agree that the abstract does not supply formal definitions or explicitly state the core assumption. In the revised manuscript we will add a precise definition of the perturbation operator (as a distribution over prompt variants) and a graph-theoretic criterion for structural inconsistency (e.g., violation of transitivity or tie consistency across the perturbation ensemble). We will also include a short discussion of the modeling assumption and a brief analysis of possible selection bias induced by the filter. revision: yes

  2. Referee: [Abstract] Abstract: no experimental results, error analysis, validation datasets, or comparison against baselines are presented, so the claim that the filtered comparisons improve reliability cannot be evaluated; this is load-bearing because the soundness of the pipeline cannot be determined from the method description alone.

    Authors: The submitted manuscript presents only the conceptual framework. We acknowledge that the reliability claim cannot be assessed without experiments. In the revision we will add (i) experiments on standard pairwise LLM evaluation benchmarks, (ii) error analysis of filtered vs. unfiltered graphs, (iii) comparison against baseline aggregation methods, and (iv) ablation studies on the perturbation and filtering steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a procedural pipeline (generate prompt perturbations, build comparison graphs, filter inconsistent patterns, apply ranking) without equations, fitted parameters, predictions, or derivations. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim is a methodological proposal whose validity rests on external validation rather than internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the text.

pith-pipeline@v0.9.1-grok · 5733 in / 1041 out tokens · 36487 ms · 2026-06-27T01:00:34.650982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    [AAA+23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Enhancing LLM robustness to perturbed instructions: An empirical study.arXiv preprint arXiv:2504.02733,

    [AAHR25] Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, and Marek Rei. Enhancing LLM robustness to perturbed instructions: An empirical study.arXiv preprint arXiv:2504.02733,

  3. [3]

    Judgelrm: Large reasoning models as a judge.arXiv preprint arXiv:2504.00050,

    38 [CHZ+25] Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. Judgelrm: Large reasoning models as a judge.arXiv preprint arXiv:2504.00050,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    [CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    [DGLH24] Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length- controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

  6. [6]

    Re-evaluating automatic LLM system ranking for alignment with human preference

    [GLH+25] MingqiGao, YixinLiu, XinyuHu, XiaojunWan, JonathanBragg, andArmanCohan. Re-evaluating automatic LLM system ranking for alignment with human preference. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 4605–4629,

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    [GYZ+25] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-R1: Incen- tivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

  8. [8]

    Measuring Massive Multitask Language Understanding

    [HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  9. [9]

    Prometheus 2: An open source language model specialized in evaluating other language models

    [KSL+24] Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, GrahamNeubig, MoontaeLee, KyungjaeLee, andMinjoonSeo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353,

  10. [10]

    LLMs Get Lost In Multi-Turn Conversation

    [LHZN25] Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120,

  11. [11]

    G-Eval: NLG evaluation using GPT-4 with better human alignment

    [LIX+23] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522,

  12. [12]

    RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    [LPM+23] Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, KellieLu, ColtonBishop, EthanHall, VictorCarbune, AbhinavRastogi, etal. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267,

  13. [13]

    Large language models sensitivity to the order of options in multiple-choice questions

    [PH24] Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017,

  14. [14]

    Prompt perturbation consistency learning for robust language models

    40 [QNM+24] Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan. Prompt perturbation consistency learning for robust language models. InFindings of the Association for Computational Linguis- tics: EACL 2024, pages 1357–1370,

  15. [15]

    Evaluating the zero-shot robust- ness of instruction-tuned language models.arXiv preprint arXiv:2306.11270,

    [SSW23] Jiuding Sun, Chantal Shaib, and Byron C Wallace. Evaluating the zero-shot robust- ness of instruction-tuned language models.arXiv preprint arXiv:2306.11270,

  16. [16]

    LLaMA: Open and Efficient Foundation Language Models

    [TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  17. [17]

    A large-scale study of rele- vance assessments with large language models: An initial look.arXiv preprint arXiv:2411.08275,

    [UPT+24] Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. A large-scale study of rele- vance assessments with large language models: An initial look.arXiv preprint arXiv:2411.08275,

  18. [18]

    Trust- judge: Inconsistencies of LLM-as-a-judge and how to alleviate them.arXiv preprint arXiv:2509.21117,

    [WSZ+25] Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, et al. Trust- judge: Inconsistencies of LLM-as-a-judge and how to alleviate them.arXiv preprint arXiv:2509.21117,

  19. [19]

    Investigating non-transitivity in LLM-as-a-judge.arXiv preprint arXiv:2502.14074,

    41 [XRRK25] Yi Xu, Laura Ruis, Tim Rocktäschel, and Robert Kirk. Investigating non-transitivity in LLM-as-a-judge.arXiv preprint arXiv:2502.14074,

  20. [20]

    JudgeLM: Fine-tuned Large Language Models are Scalable Judges, March 2025

    [ZWW23] Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large lan- guage models are scalable judges.arXiv preprint arXiv:2310.17631,