pith. machine review for the scientific record. sign in

arxiv: 2604.03742 · v1 · submitted 2026-04-04 · 💻 cs.AI

Recognition: no theorem link

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM evaluationAnalytic Hierarchy ProcessFuzzy AHPDualJudgemulti-criteria decision makingepistemic uncertaintyJudgeBenchhybrid evaluation
0
0 comments X

The pith

Fuzzy AHP with confidence-modulated uncertainty and a DualJudge hybrid produces more consistent LLM evaluations than direct scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the Analytic Hierarchy Process to break LLM judgments into explicit criteria such as coherence and relevance, then extends it to a fuzzy version that encodes epistemic uncertainty through triangular fuzzy numbers weighted by the model's own confidence scores. On JudgeBench this structured approach, both crisp and fuzzy, yields better calibration and stability than simple direct scoring across model sizes and data splits. The authors then introduce DualJudge, which adaptively blends the structured AHP output with holistic direct scores using consistency-aware weighting. The hybrid reaches state-of-the-art agreement with human preferences, showing that intuitive and deliberative evaluation routes are complementary rather than competing.

Core claim

By representing pairwise comparisons as triangular fuzzy numbers whose spread is scaled by LLM-generated confidence, the Fuzzy AHP decomposes evaluation into explicit criteria, checks consistency, and aggregates via uncertainty-aware operators; when these outputs are fused with direct holistic scores inside DualJudge through a consistency-weighted mechanism, the resulting judgments outperform both pure direct scoring and standalone AHP on JudgeBench.

What carries the argument

Confidence-aware Fuzzy Analytic Hierarchy Process (FAHP) that modulates triangular fuzzy numbers with LLM-generated confidence scores to aggregate multi-criteria judgments while tracking epistemic uncertainty.

If this is right

  • Evaluations remain more stable when comparison pairs contain high uncertainty.
  • Explicit criteria make the sources of disagreement between judges traceable.
  • Hybrid fusion improves performance on both easy and hard items in JudgeBench splits.
  • The same decomposition can be applied to new criteria without retraining the underlying LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be ported to non-LLM decision tasks where raters have varying confidence.
  • If criteria weights were learned from data rather than set by experts, calibration might improve further.
  • Extending the fuzzy aggregation to larger hierarchies might reveal which criteria drive most of the uncertainty.

Load-bearing premise

LLM confidence scores reliably reflect epistemic uncertainty and that breaking evaluations into AHP criteria adds accuracy without injecting new systematic biases.

What would settle it

A human-labeled test set in which direct scoring matches human preferences more closely than either AHP or DualJudge, or in which FAHP judgments become less stable precisely when confidence scores are low.

Figures

Figures reproduced from arXiv: 2604.03742 by Dmitry Fedrushkov, Ilya Revin, Ivan Smirnov, Sergey Kovalchuk, Yulong He.

Figure 1
Figure 1. Figure 1: Overview of the proposed hybrid AHP-based evaluation framework. The pipeline integrates direct scoring, crisp AHP, and fuzzy AHP through consistency checking, weight derivation, and result aggregation. direct scoring with both crisp and fuzzy AHP formulations, and adaptively fuses their outputs based on consistency-aware reliability estimation. The resulting framework enables context-aware criterion constr… view at source ↗
Figure 2
Figure 2. Figure 2: Despite a tie in direct scoring, the structured AHP-based evaluations (crisp and fuzzy) consistently favor Response A. – Claude Split: Response pairs generated by Claude-3-class models (270 sam￾ples total). These splits ensure that observed improvements are not artifacts of a specific response style or generator bias [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbf{DualJudge}, a hybrid framework inspired by Dual-Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency-aware weighting. DualJudge achieves state-of-the-art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty-aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at https://github.com/hreyulog/AHP_llm_judge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript adapts the Analytic Hierarchy Process (AHP) and its fuzzy extension (FAHP) to LLM evaluation by decomposing judgments into explicit criteria and modeling uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. It introduces DualJudge, a hybrid framework that fuses direct holistic scores with structured AHP outputs through consistency-aware weighting, and reports that both crisp and fuzzy variants outperform direct scoring on JudgeBench, with DualJudge achieving state-of-the-art results.

Significance. If the results hold, the work supplies a structured, uncertainty-aware alternative to direct scoring that could improve consistency and interpretability in LLM assessment. The open-source code at the provided GitHub link is a clear strength that aids reproducibility.

major comments (2)
  1. Methods section on FAHP: the central claim that modulating triangular fuzzy numbers with LLM confidence scores improves calibration and stability rests on the untested assumption that these scores track epistemic uncertainty rather than overconfidence or bias; no Brier score, expected calibration error, or correlation with judgment accuracy on JudgeBench is reported to support this modulation step.
  2. Experiments section: the abstract states consistent outperformance across model scales and dataset splits with FAHP showing superior stability, yet provides no details on statistical significance tests, exact train/test splits, or controls for post-hoc hyperparameter choices, undermining verification of the SOTA claim for DualJudge.
minor comments (1)
  1. Abstract: the description of 'consistency-aware fusion weights' should explicitly state whether these are derived solely from the AHP consistency ratio or involve additional fitting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, and we plan to incorporate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: Methods section on FAHP: the central claim that modulating triangular fuzzy numbers with LLM confidence scores improves calibration and stability rests on the untested assumption that these scores track epistemic uncertainty rather than overconfidence or bias; no Brier score, expected calibration error, or correlation with judgment accuracy on JudgeBench is reported to support this modulation step.

    Authors: We agree that direct validation of the confidence scores as measures of epistemic uncertainty would strengthen the methodological justification. Although the original manuscript did not include Brier scores or expected calibration error, the superior performance of FAHP on JudgeBench provides indirect support for the approach. In the revision, we will add a correlation analysis between the LLM confidence scores and judgment accuracy on the benchmark to better substantiate the modulation step. revision: yes

  2. Referee: Experiments section: the abstract states consistent outperformance across model scales and dataset splits with FAHP showing superior stability, yet provides no details on statistical significance tests, exact train/test splits, or controls for post-hoc hyperparameter choices, undermining verification of the SOTA claim for DualJudge.

    Authors: We acknowledge the need for more rigorous experimental details to support the claims. The experiments utilized fixed dataset splits from JudgeBench without post-hoc hyperparameter optimization. We will include statistical significance testing (such as Wilcoxon signed-rank tests) and precise descriptions of the train/test splits and hyperparameter settings in the revised manuscript to allow for better verification of the results and the SOTA performance of DualJudge. revision: yes

Circularity Check

0 steps flagged

Minor risk from AHP weight derivation but no equations reduce to fitted parameters by construction

full rationale

The paper's central claims rest on empirical validation against the external JudgeBench dataset rather than any self-referential derivation. Standard AHP pairwise comparison matrices and triangular fuzzy number modulation are applied without the modulation step being fitted to the target judgments or renamed as a prediction. No self-citation chain is load-bearing for the uniqueness of the DualJudge fusion rule, and the consistency-aware weighting follows directly from the stated Dual-Process inspiration without reducing to prior author results by definition. The only minor elevation above zero arises from the conventional derivation of AHP criterion weights, which is not shown to be circular in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Based on abstract only; main additions are the fuzzy modulation and hybrid weighting. No new physical entities. AHP pairwise consistency is treated as a domain assumption.

free parameters (2)
  • AHP criteria weights
    Derived from pairwise comparisons, potentially fitted to LLM responses.
  • consistency-aware fusion weights
    Adaptive weights in DualJudge that depend on comparison consistency.
axioms (1)
  • domain assumption Standard AHP consistency ratio assumptions hold when applied to LLM judgments
    Paper relies on AHP properties without re-deriving them for LLM context.

pith-pipeline@v0.9.0 · 5520 in / 1144 out tokens · 37673 ms · 2026-05-13T17:08:47.622507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

  1. [1]

    Fuzzy Sets and Systems17(3), 233–247 (1985).https://doi.org/https://doi.org/10.1016/0165-0114(85)90090-9, https://www.sciencedirect.com/science/article/pii/0165011485900909

    Buckley, J.: Fuzzy hierarchical analysis. Fuzzy Sets and Systems17(3), 233–247 (1985).https://doi.org/https://doi.org/10.1016/0165-0114(85)90090-9, https://www.sciencedirect.com/science/article/pii/0165011485900909

  2. [2]

    In: Bouamor, H., Pino, J., Bali, K

    Chiang, C.H., Lee, H.y.: A closer look into using large language models for automatic evaluation. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 8928–8942. Association for Computational Linguistics, Singapore (Dec 2023).https:// doi.org/10.18653/v1/2023.findings-emnlp.599,https://aclan...

  3. [3]

    Journal of Mathematical Psychology29(4), 387–405 (1985)

    Crawford, G., Williams, C.: A note on the analysis of subjective judg- ment matrices. Journal of Mathematical Psychology29(4), 387–405 (1985). https://doi.org/https://doi.org/10.1016/0022-2496(85)90002-1,https:// www.sciencedirect.com/science/article/pii/0022249685900021

  4. [4]

    Mathematics in Sci- ence and Engineering, Academic Press (1980),https://books.google.ru/books? id=JmjfHUUtMkMC

    Dubois, D.: Fuzzy Sets and Systems: Theory and Applications. Mathematics in Sci- ence and Engineering, Academic Press (1980),https://books.google.ru/books? id=JmjfHUUtMkMC

  5. [5]

    Applied Sciences15(10) (2025).https://doi.org/10.3390/app15105683, https://www.mdpi.com/2076-3417/15/10/5683

    Emirtekin, E.: Large language model-powered automated assessment: A systematic review. Applied Sciences15(10) (2025).https://doi.org/10.3390/app15105683, https://www.mdpi.com/2076-3417/15/10/5683

  6. [6]

    José,A.S.,Alonso,Mª,T.,Lamata:Consistencyintheanalytichierarchyprocess:a new approach. Int. J. Uncertain. Fuzziness Knowl. Based Syst.14, 445–459 (2006), https://api.semanticscholar.org/CorpusID:18104088 Multi-Criteria LLM Evaluation with DualJudge and FAHP 15

  7. [7]

    macmillan (2011)

    Kahneman, D.: Thinking, fast and slow. macmillan (2011)

  8. [8]

    org/abs/2512.10487

    Kampourakis, V., Kavallieratos, G., Spathoulas, G., Gkioulos, V., Katsikas, S.: Llm-assisted ahp for explainable cyber range evaluation (2025),https://arxiv. org/abs/2512.10487

  9. [9]

    Li, W., Zhao, M., Dong, W., Cai, J., Wei, Y., Pocress, M., Li, Y., Yuan, W., Wang, X., Hou, R., Lou, K., Zeng, W., Yang, Y., Du, Y., Wang, M.: Grading scale impact on llm-as-a-judge: Human-llm alignment is highest on 0-5 grading scale (2026), https://arxiv.org/abs/2601.03444

  10. [10]

    Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: Nlg evaluation using gpt-4 with better human alignment (2023),https://arxiv.org/abs/2303.16634

  11. [11]

    In Ku, L.-W., Martins, A

    Lu, X., Li, J., Takeuchi, K., Kashima, H.: AHP-powered LLM reasoning for multi-criteria evaluation of open-ended responses. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Findings of the Association for Computational Lin- guistics: EMNLP 2024. pp. 1847–1856. Association for Computational Linguis- tics, Miami, Florida, USA (Nov 2024).https://doi.org/10....

  12. [12]

    OpenAI: gpt-oss-120b & gpt-oss-20b model card (2025),https://arxiv.org/abs/ 2508.10925

  13. [13]

    Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026),https: //qwen.ai/blog?id=qwen3.5

  14. [14]

    Mathematical Modelling9(3), 161–176 (1987).https://doi.org/https: //doi.org/10.1016/0270-0255(87)90473-8,https://www.sciencedirect.com/ science/article/pii/0270025587904738

    Saaty, R.: The analytic hierarchy process—what it is and how it is used. Mathematical Modelling9(3), 161–176 (1987).https://doi.org/https: //doi.org/10.1016/0270-0255(87)90473-8,https://www.sciencedirect.com/ science/article/pii/0270025587904738

  15. [15]

    Advanced book program, McGraw-Hill International Book Company (1980),https://books.google.ru/books?id=Xxi7AAAAIAAJ

    Saaty, T.: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. Advanced book program, McGraw-Hill International Book Company (1980),https://books.google.ru/books?id=Xxi7AAAAIAAJ

  16. [16]

    International Series in Operations Research & Management Science, Springer (2012),https://books.google.ru/books?id=6J9XI8I1qjwC

    Saaty, T., Vargas, L.: Models, Methods, Concepts & Applications of the Analytic Hierarchy Process. International Series in Operations Research & Management Science, Springer (2012),https://books.google.ru/books?id=6J9XI8I1qjwC

  17. [17]

    In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=G0dksFayVq

    Tan, S., Zhuang, S., Montgomery, K., Tang, W.Y., Cuadron, A., Wang, C., Popa, R., Stoica, I.: Judgebench: A benchmark for evaluating LLM-based judges. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=G0dksFayVq

  18. [18]

    Tan, S., Zhuang, S., Montgomery, K., Tang, W.Y., Cuadron, A., Wang, C., Popa, R.A., Stoica, I.: Judgebench: A benchmark for evaluating llm-based judges (2024), https://arxiv.org/abs/2410.12784

  19. [19]

    Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., Chen, W.: Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (2024),https://arxiv.org/abs/2406.01574

  20. [20]

    White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Dey, S., Shubh-Agrawal, Sandha, S.S., Naidu, S., Hegde, C., LeCun, Y., Goldstein, T., Neiswanger, W., Goldblum, M.: Livebench: A challeng- ing, contamination-limited llm benchmark (2025),https://arxiv.org/abs/2406. 19314

  21. [21]

    Wu, H., Zhou, S., Zhang, H., Chen, W.: Doc2ahp: Inferring structured multi- criteria decision models via semantic trees with llms (2026),https://arxiv.org/ abs/2601.16479 16 He et al

  22. [22]

    Xie, J., Li, Y., Yin, X., Wan, X.: Dsgram: dynamic weighting sub-metrics for grammatical error correction in the era of large language models. In: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intel- ligence and Fifteenth Symposium on Educational Advances i...

  23. [23]

    1965 , issn =

    Zadeh, L.: Fuzzy sets. Information and Control8(3), 338–353 (1965). https://doi.org/https://doi.org/10.1016/S0019-9958(65)90241-X, https://www.sciencedirect.com/science/article/pii/S001999586590241X

  24. [24]

    European Journal of Operational Research116(2), 443–449 (1999)

    Zeshui, X., Cuiping, W.: A consistency improving method in the an- alytic hierarchy process1research supported by nsf of china and shan- dong.1. European Journal of Operational Research116(2), 443–449 (1999). https://doi.org/https://doi.org/10.1016/S0377-2217(98)00109-X, https://www.sciencedirect.com/science/article/pii/S037722179800109X Multi-Criteria LL...