Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop
Pith reviewed 2026-05-08 03:27 UTC · model grok-4.3
The pith
LLMs used as judges for sustainable city trip lists display model-specific biases and high dimension-level variance even when overall rankings align.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across two recommendation settings, multiple LLMs acting as judges exhibit model-specific biases and high variance when scoring trip lists on relevance, diversity, sustainability, and popularity balance, even in cases where the models agree on the overall ranking. The three-phase calibration framework—baseline multi-LLM judging, expert detection of misalignments, and targeted calibration through rules plus few-shot examples—makes per-dimension reasoning more explicit yet reveals divergent interpretations of sustainability, thereby demonstrating the necessity of transparent, bias-aware methods when LLMs evaluate nuanced, multi-objective recommendations.
What carries the argument
Three-phase calibration framework that sequences baseline multi-LLM judging, expert human review to locate systematic misalignments, and dimension-specific adjustments using rules and few-shot examples.
If this is right
- Overall ranking agreement between LLM judges can conceal large disagreements on individual dimensions such as sustainability.
- Dimension-specific calibration increases transparency of LLM reasoning but does not eliminate divergent interpretations of criteria.
- Model selection or ensemble methods become necessary because each LLM carries its own bias pattern.
- Releasing prompts and code enables direct testing of the calibration steps on other recommendation domains.
- Multi-dimensional evaluation requires explicit handling of stakeholder goals rather than reliance on aggregate scores alone.
Where Pith is reading between the lines
- The same calibration structure could be tested on non-travel domains such as health or policy recommendations where multiple stakeholder criteria must be balanced.
- If variance persists after calibration, an iterative loop that feeds expert corrections back into prompt refinement might further reduce inconsistencies.
- Standardized operational definitions for abstract dimensions like sustainability would be needed before LLM judges can achieve high inter-rater reliability.
- A controlled study comparing the calibrated LLM outputs against large-scale user preference data could quantify whether the remaining divergences affect real-world choice.
Load-bearing premise
Expert humans in the second phase can reliably spot and correct systematic LLM misalignments without introducing their own inconsistent standards or new subjective biases across the four dimensions.
What would settle it
An independent replication in which separate expert panels apply the same calibration steps and obtain low variance plus no detectable model-specific biases in dimension scores would falsify the reported misalignments.
Figures
read the original abstract
Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates LLMs as judges for sustainable city-trip recommendations across four dimensions (relevance, diversity, sustainability, popularity balance) and proposes a three-phase human-in-the-loop calibration framework: (1) baseline multi-LLM judging, (2) expert evaluation to detect systematic misalignments, and (3) dimension-specific calibration with rules and few-shot examples. Across two recommendation settings, it reports model-specific biases and high dimension-level variance even when overall rankings align, with calibration clarifying per-dimension reasoning but exposing divergent sustainability interpretations. Prompts and code are released for reproducibility.
Significance. If the results hold, the work is significant for AI evaluation of nuanced, multi-stakeholder recommendations. It provides concrete evidence of LLM judge limitations in subjective domains like sustainability and offers a practical calibration pipeline. The open release of prompts and code is a clear strength that supports reproducibility and follow-on research.
major comments (1)
- [Phase 2 (expert evaluation)] Phase 2 (expert evaluation): no inter-annotator agreement metrics (Cohen’s kappa, Fleiss’ kappa, or pairwise rates) or explicit dimension rubrics are reported. The central claims—that expert labels reliably surface LLM misalignments and that calibration is effective—depend on these judgments being stable and consistent; without them the observed model biases and dimension variances could reflect shifting human standards rather than LLM properties, especially given the paper’s own observation of divergent sustainability interpretations.
minor comments (1)
- [Abstract] The abstract would benefit from naming the two recommendation settings and the specific LLMs used in the baseline phase to give readers immediate context for the reported biases.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the reliability of Phase 2 expert evaluation. We agree that inter-annotator agreement and explicit rubrics are important for substantiating claims about LLM misalignments, and we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Phase 2 (expert evaluation): no inter-annotator agreement metrics (Cohen’s kappa, Fleiss’ kappa, or pairwise rates) or explicit dimension rubrics are reported. The central claims—that expert labels reliably surface LLM misalignments and that calibration is effective—depend on these judgments being stable and consistent; without them the observed model biases and dimension variances could reflect shifting human standards rather than LLM properties, especially given the paper’s own observation of divergent sustainability interpretations.
Authors: We acknowledge that the submitted manuscript did not report inter-annotator agreement metrics or reproduce the explicit dimension rubrics provided to experts. This omission weakens the ability to fully rule out human variability as a source of the observed dimension-level variances. We will revise the paper to include appropriate IAA statistics (e.g., Fleiss’ kappa across the multiple experts who performed the Phase 2 annotations) and to append the full dimension-specific rubrics and annotation guidelines. These additions will directly address the concern that the reported model biases and calibration effects might partly reflect inconsistent human standards rather than LLM properties. The paper already notes divergent sustainability interpretations as an outcome of the calibration process; the revised reporting will make clearer how the three-phase framework mitigates such issues while remaining transparent about residual disagreements. revision: yes
Circularity Check
No significant circularity in empirical evaluation study
full rationale
The paper describes an empirical three-phase human-in-the-loop calibration framework for assessing LLM judges across relevance, diversity, sustainability, and popularity balance dimensions. All reported observations (model-specific biases, dimension-level variance, and effects of calibration) derive directly from external human expert annotations and comparative LLM outputs rather than from any internal equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing steps reduce claims to their own inputs by construction; the work is self-contained against external human benchmarks and releases prompts/code for reproducibility.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can serve as initial judges for nuanced recommendation lists when provided with dimension-specific prompts
Reference graph
Works this paper leans on
- [1]
-
[2]
Ashmi Banerjee, Paromita Banik, and Wolfgang Wörndl. 2023. A review on indi- vidual and multistakeholder fairness in tourism recommender systems.Frontiers in big Data6 (2023), 1168692
work page 2023
-
[3]
Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, and Yashar Deldjoo. 2025. SynthTRIPs: A Knowledge-Grounded Framework for Benchmark Data Generation for Personalized Tourism Recommenders. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval. 3743–3752
work page 2025
-
[4]
Michael D Ekstrand, F Maxwell Harper, Martijn C Willemsen, and Joseph A Konstan. 2014. User perception of differences in recommender algorithms. In Proceedings of the 8th ACM Conference on Recommender systems. 161–168
work page 2014
-
[5]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on llm-as- a-judge.arXiv preprint arXiv:2411.15594(2024)
work page internal anchor Pith review arXiv 2024
-
[6]
Carmen Lam and Bob McKercher. 2013. The tourism data gap: The utility of official tourism information for the hospitality and tourism industry.Tourism Management Perspectives6 (2013), 82–94
work page 2013
-
[7]
ChaeHun Park, Minseok Choi, Dohyun Lee, and Jaegul Choo. 2024. Paireval: Open- domain dialogue evaluation metric with pairwise comparisons. InFirst Conference on Language Modeling
work page 2024
-
[8]
Shreyas Sundara Raman, Vanya Cohen, Eric Rosen, Ifrah Idrees, David Paulius, and Stefanie Tellex. 2022. Planning with large language models via corrective re-prompting. InNeurIPS 2022 Foundation Models for Decision Making Workshop
work page 2022
- [9]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.