pith. sign in

arxiv: 2604.24158 · v1 · submitted 2026-04-27 · 💻 cs.AI

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Pith reviewed 2026-05-08 03:27 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM-as-a-Judgesustainable tourismrecommendation evaluationhuman-in-the-loopmulti-dimensional assessmentbias calibrationconversational recommendationstravel planning
0
0 comments X

The pith

LLMs used as judges for sustainable city trip lists display model-specific biases and high dimension-level variance even when overall rankings align.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests large language models as evaluators of conversational recommendations for city trips, focusing on four dimensions: relevance, diversity, sustainability, and popularity balance. It develops a three-phase calibration method that begins with multiple LLMs producing baseline scores, moves to expert humans identifying where those scores systematically diverge from intended criteria, and ends with dimension-specific adjustments using explicit rules and few-shot examples. The work shows that different models carry distinct biases and that agreement on top-ranked trips often hides sharp disagreements when each dimension is examined separately. This matters for any setting where automated evaluation must serve stakeholder goals that are hard to quantify, such as sustainability in travel planning, because it demonstrates that uncalibrated LLM judgments can produce inconsistent or opaque results.

Core claim

Across two recommendation settings, multiple LLMs acting as judges exhibit model-specific biases and high variance when scoring trip lists on relevance, diversity, sustainability, and popularity balance, even in cases where the models agree on the overall ranking. The three-phase calibration framework—baseline multi-LLM judging, expert detection of misalignments, and targeted calibration through rules plus few-shot examples—makes per-dimension reasoning more explicit yet reveals divergent interpretations of sustainability, thereby demonstrating the necessity of transparent, bias-aware methods when LLMs evaluate nuanced, multi-objective recommendations.

What carries the argument

Three-phase calibration framework that sequences baseline multi-LLM judging, expert human review to locate systematic misalignments, and dimension-specific adjustments using rules and few-shot examples.

If this is right

  • Overall ranking agreement between LLM judges can conceal large disagreements on individual dimensions such as sustainability.
  • Dimension-specific calibration increases transparency of LLM reasoning but does not eliminate divergent interpretations of criteria.
  • Model selection or ensemble methods become necessary because each LLM carries its own bias pattern.
  • Releasing prompts and code enables direct testing of the calibration steps on other recommendation domains.
  • Multi-dimensional evaluation requires explicit handling of stakeholder goals rather than reliance on aggregate scores alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same calibration structure could be tested on non-travel domains such as health or policy recommendations where multiple stakeholder criteria must be balanced.
  • If variance persists after calibration, an iterative loop that feeds expert corrections back into prompt refinement might further reduce inconsistencies.
  • Standardized operational definitions for abstract dimensions like sustainability would be needed before LLM judges can achieve high inter-rater reliability.
  • A controlled study comparing the calibrated LLM outputs against large-scale user preference data could quantify whether the remaining divergences affect real-world choice.

Load-bearing premise

Expert humans in the second phase can reliably spot and correct systematic LLM misalignments without introducing their own inconsistent standards or new subjective biases across the four dimensions.

What would settle it

An independent replication in which separate expert panels apply the same calibration steps and obtain low variance plus no detectable model-specific biases in dimension scores would falsify the reported misalignments.

Figures

Figures reproduced from arXiv: 2604.24158 by Adithi Satish, Ashmi Banerjee, Wolfgang W\"orndl, Yashar Deldjoo.

Figure 1
Figure 1. Figure 1: Three-phase LLM calibration framework: baseline view at source ↗
Figure 2
Figure 2. Figure 2: Web-based survey interface used by human experts view at source ↗
read the original abstract

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper evaluates LLMs as judges for sustainable city-trip recommendations across four dimensions (relevance, diversity, sustainability, popularity balance) and proposes a three-phase human-in-the-loop calibration framework: (1) baseline multi-LLM judging, (2) expert evaluation to detect systematic misalignments, and (3) dimension-specific calibration with rules and few-shot examples. Across two recommendation settings, it reports model-specific biases and high dimension-level variance even when overall rankings align, with calibration clarifying per-dimension reasoning but exposing divergent sustainability interpretations. Prompts and code are released for reproducibility.

Significance. If the results hold, the work is significant for AI evaluation of nuanced, multi-stakeholder recommendations. It provides concrete evidence of LLM judge limitations in subjective domains like sustainability and offers a practical calibration pipeline. The open release of prompts and code is a clear strength that supports reproducibility and follow-on research.

major comments (1)
  1. [Phase 2 (expert evaluation)] Phase 2 (expert evaluation): no inter-annotator agreement metrics (Cohen’s kappa, Fleiss’ kappa, or pairwise rates) or explicit dimension rubrics are reported. The central claims—that expert labels reliably surface LLM misalignments and that calibration is effective—depend on these judgments being stable and consistent; without them the observed model biases and dimension variances could reflect shifting human standards rather than LLM properties, especially given the paper’s own observation of divergent sustainability interpretations.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the two recommendation settings and the specific LLMs used in the baseline phase to give readers immediate context for the reported biases.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the reliability of Phase 2 expert evaluation. We agree that inter-annotator agreement and explicit rubrics are important for substantiating claims about LLM misalignments, and we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Phase 2 (expert evaluation): no inter-annotator agreement metrics (Cohen’s kappa, Fleiss’ kappa, or pairwise rates) or explicit dimension rubrics are reported. The central claims—that expert labels reliably surface LLM misalignments and that calibration is effective—depend on these judgments being stable and consistent; without them the observed model biases and dimension variances could reflect shifting human standards rather than LLM properties, especially given the paper’s own observation of divergent sustainability interpretations.

    Authors: We acknowledge that the submitted manuscript did not report inter-annotator agreement metrics or reproduce the explicit dimension rubrics provided to experts. This omission weakens the ability to fully rule out human variability as a source of the observed dimension-level variances. We will revise the paper to include appropriate IAA statistics (e.g., Fleiss’ kappa across the multiple experts who performed the Phase 2 annotations) and to append the full dimension-specific rubrics and annotation guidelines. These additions will directly address the concern that the reported model biases and calibration effects might partly reflect inconsistent human standards rather than LLM properties. The paper already notes divergent sustainability interpretations as an outcome of the calibration process; the revised reporting will make clearer how the three-phase framework mitigates such issues while remaining transparent about residual disagreements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

The paper describes an empirical three-phase human-in-the-loop calibration framework for assessing LLM judges across relevance, diversity, sustainability, and popularity balance dimensions. All reported observations (model-specific biases, dimension-level variance, and effects of calibration) derive directly from external human expert annotations and comparative LLM outputs rather than from any internal equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing steps reduce claims to their own inputs by construction; the work is self-contained against external human benchmarks and releases prompts/code for reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from the LLM evaluation literature with no free parameters, no new invented entities, and only domain-level assumptions about the feasibility of human calibration.

axioms (1)
  • domain assumption LLMs can serve as initial judges for nuanced recommendation lists when provided with dimension-specific prompts
    Invoked in the baseline judging phase described in the abstract.

pith-pipeline@v0.9.0 · 5450 in / 1303 out tokens · 67792 ms · 2026-05-08T03:27:57.228997+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Himan Abdollahpouri, Gediminas Adomavicius, Robin Burke, Ido Guy, Dietmar Jannach, Toshihiro Kamishima, Jan Krasnodebski, and Luiz Pizzato. 2019. Beyond personalization: Research directions in multistakeholder recommendation.arXiv preprint arXiv:1905.01986(2019)

  2. [2]

    Ashmi Banerjee, Paromita Banik, and Wolfgang Wörndl. 2023. A review on indi- vidual and multistakeholder fairness in tourism recommender systems.Frontiers in big Data6 (2023), 1168692

  3. [3]

    Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, and Yashar Deldjoo. 2025. SynthTRIPs: A Knowledge-Grounded Framework for Benchmark Data Generation for Personalized Tourism Recommenders. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval. 3743–3752

  4. [4]

    Michael D Ekstrand, F Maxwell Harper, Martijn C Willemsen, and Joseph A Konstan. 2014. User perception of differences in recommender algorithms. In Proceedings of the 8th ACM Conference on Recommender systems. 161–168

  5. [5]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on llm-as- a-judge.arXiv preprint arXiv:2411.15594(2024)

  6. [6]

    Carmen Lam and Bob McKercher. 2013. The tourism data gap: The utility of official tourism information for the hospitality and tourism industry.Tourism Management Perspectives6 (2013), 82–94

  7. [7]

    ChaeHun Park, Minseok Choi, Dohyun Lee, and Jaegul Choo. 2024. Paireval: Open- domain dialogue evaluation metric with pairwise comparisons. InFirst Conference on Language Modeling

  8. [8]

    Shreyas Sundara Raman, Vanya Cohen, Eric Rosen, Ifrah Idrees, David Paulius, and Stefanie Tellex. 2022. Planning with large language models via corrective re-prompting. InNeurIPS 2022 Foundation Models for Decision Making Workshop

  9. [9]

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2024. Judging the judges: Evaluating align- ment and vulnerabilities in llms-as-judges.arXiv preprint arXiv:2406.12624(2024)