pith. machine review for the scientific record. sign in

arxiv: 2604.26679 · v1 · submitted 2026-04-29 · 💻 cs.HC

Recognition: unknown

MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:54 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM evaluationcollaborative criteriaLLM-as-a-judgeconsensus buildinghuman oversightevaluation criteriaHCI toolcase study
0
0 comments X

The pith

MultEval lets multiple evaluators collaboratively develop and align criteria for LLM-as-a-judge systems by surfacing disagreements and tracking revisions with examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-as-a-judge approaches depend on evaluation criteria that capture desired model behaviors, yet these criteria are usually written by one person even though defining them involves multiple stakeholders who hold different values and interpretations. A formative study revealed recurring difficulties in creating shared understanding across expertise levels, reconciling conflicting priorities, and turning subtle human judgments into criteria that LLMs can apply consistently. MultEval responds by giving teams tools to diagnose disagreements through consensus-building methods, refine criteria iteratively while attaching concrete examples and preserving proposal histories, and keep visible records of how judgments become encoded for automation. The approach matters because misaligned criteria can produce inconsistent or biased automated evaluations that fail to reflect the full range of stakeholder concerns. A case study with domain experts showed the system guiding how criteria evolve through coordinated team input.

Core claim

The paper establishes that collaborative creation of LLM evaluation criteria faces distinct human-oversight challenges in establishing shared understanding, aligning values across stakeholders, and translating nuanced judgments into actionable LLM prompts. Drawing on consensus-building theory, the authors built MultEval to let multiple evaluators surface and diagnose disagreements, iteratively revise criteria while attaching examples and maintaining a proposal history, and preserve transparency over how human decisions are encoded into an automated judge. A case study with a team of domain experts illustrated how these coordination features shape the ongoing development of criteria.

What carries the argument

MultEval, a system that supports collaborative criteria development by enabling disagreement surfacing via consensus-building theory, iterative revisions with attached examples and proposal history, and transparency in how judgments are encoded for automated evaluation.

If this is right

  • Teams can identify and resolve value or interpretation conflicts earlier by explicitly surfacing disagreements.
  • Attached examples make criteria more concrete and reduce ambiguity when LLMs apply them.
  • Proposal histories preserve the reasoning behind changes, supporting consistent future revisions.
  • Transparency in encoding helps stakeholders understand and audit how criteria shape automated judgments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations could use similar systems to incorporate wider input when setting evaluation standards, potentially reducing single-perspective bias.
  • The same disagreement-diagnosis and history-tracking approach might apply to other contested design tasks such as defining model safety guidelines.
  • Automated analysis of disagreement patterns over time could surface recurring friction points for teams to address proactively.

Load-bearing premise

That the collaboration challenges seen in the formative study apply broadly enough to justify a dedicated tool and that the consensus features will produce more aligned or higher-quality criteria than single-author or informal processes.

What would settle it

A controlled comparison in which teams using ordinary shared documents achieve equivalent or better inter-rater agreement, criteria stability, and perceived alignment than teams using MultEval would undermine the claim that the system's specific features add value.

Figures

Figures reproduced from arXiv: 2604.26679 by Annalisa Szymanski, Charles Chiang, Diego Gomez-Zara, Hyo Jin Do, Simret Gebreegziabher, Toby Li, Werner Geyer, Yukun Yang, Zahra Ashktorab.

Figure 1
Figure 1. Figure 1: The initial evaluation screen. Criteria and assertions are shown on the left, while the dataset is shown on the right. At view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MultEval System: A) The global criteria set, showing assertions, authorship, and criteria weight. B) The project’s dataset, with input and output pairs. C) The results of the evaluation - red circles mean a failed criterion, while green means a pass. D) The option to reorder the dataset to show a diverse subset of data first. This feature helps surface disagreements faster. E) The private s… view at source ↗
Figure 3
Figure 3. Figure 3: Especially for evaluators without a technical background, view at source ↗
Figure 3
Figure 3. Figure 3: Private Sandbox. Users see the original criteria alongside an editable version on the left side panel. Users can view at source ↗
Figure 4
Figure 4. Figure 4: A proposal as seen by a user with administrative view at source ↗
Figure 5
Figure 5. Figure 5: The system diagnoses the type of disagreement view at source ↗
Figure 6
Figure 6. Figure 6: Example output of a data point trace and prompt. view at source ↗
Figure 7
Figure 7. Figure 7: Timeline showing different versions of a criterion, view at source ↗
read the original abstract

LLM-as-a-judge approaches have emerged as a scalable solution for evaluating model behaviors, yet they rely on evaluation criteria often created by a single individual, embedding that person's assumptions, priorities, and interpretive lens. In practice, defining such criteria is a collaborative and contested process involving multiple stakeholders with different values, interpretations, and priorities; an aspect largely unsupported by existing tools. To examine this problem in depth, we present a formative study examining how stakeholders collaboratively create, negotiate, and refine evaluation criteria for LLM-as-a-judge systems. Our findings reveal challenges in human oversight, including difficulties in establishing shared understanding, aligning values across stakeholders with different expertise and priorities, and translating nuanced human judgments into criteria that are interpretable and actionable for LLM judges. Based on these insights, we developed MultEval, a system that supports collaborative criteria by enabling multiple evaluators to surface and diagnose disagreements using consensus-building theory, iteratively revise criteria with attached examples and proposal history, and maintain transparency over how judgments are encoded into an automated evaluator. We further report a case study in which a team of domain experts used MultEval to collaboratively author criteria, illustrating how coordination and collaborative consensus-making shape criteria evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a formative study of how multiple stakeholders collaboratively define and negotiate evaluation criteria for LLM-as-a-judge systems, surfaces challenges around shared understanding, value alignment, and translation of judgments into actionable criteria, and introduces the MultEval system that applies consensus-building theory to surface disagreements, attach examples to revisions, track proposal history, and expose how criteria are encoded for automated judges. A case study with domain experts illustrates iterative use of these features.

Significance. If the designed features demonstrably improve criteria quality or alignment over ad-hoc processes, the work would address a practical gap in LLM evaluation tooling and contribute design knowledge for collaborative human-AI evaluation workflows. The formative insights on coordination challenges are useful for the HCI and LLM-evaluation communities, but the current evidence base is limited to qualitative illustration.

major comments (3)
  1. [Case Study] The case study (described in the final section) reports iterative criteria evolution and coordination but supplies no outcome metrics (e.g., inter-rater reliability on held-out items, downstream judge accuracy against human gold labels, or criteria stability over time) and no control condition (e.g., ad-hoc collaboration without MultEval). This leaves the central claim that the consensus-building, example-attached revision, and transparency features produce better-aligned criteria untested.
  2. [Formative Study] The formative study section provides no participant count, recruitment details, session structure, or analysis method (e.g., thematic analysis protocol or inter-coder reliability). Without these, it is difficult to evaluate whether the reported challenges in shared understanding and value alignment are robust or generalizable enough to motivate the specific features of MultEval.
  3. [System Description] The abstract and system-description sections assert that MultEval enables evaluators to “surface and diagnose disagreements using consensus-building theory,” yet the manuscript does not detail which specific consensus mechanisms (e.g., Delphi, nominal group technique) are implemented, how they are operationalized in the UI, or any validation that they reduce disagreement more effectively than standard comment threads.
minor comments (2)
  1. [Figures] Figure captions and the system-overview diagram would benefit from explicit labels indicating which UI elements correspond to disagreement surfacing, example attachment, and proposal history.
  2. [Related Work] The related-work section could more clearly distinguish MultEval from existing collaborative annotation platforms (e.g., those supporting multi-annotator disagreement resolution) by highlighting the unique requirement of producing criteria that are directly executable by an LLM judge.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the scope and presentation of our work. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Case Study] The case study (described in the final section) reports iterative criteria evolution and coordination but supplies no outcome metrics (e.g., inter-rater reliability on held-out items, downstream judge accuracy against human gold labels, or criteria stability over time) and no control condition (e.g., ad-hoc collaboration without MultEval). This leaves the central claim that the consensus-building, example-attached revision, and transparency features produce better-aligned criteria untested.

    Authors: We agree that the case study provides no quantitative outcome metrics or control condition. The manuscript frames the case study as an illustration of how domain experts used MultEval's features to iteratively develop criteria and navigate coordination challenges, rather than as an empirical test of superiority. The abstract and conclusion emphasize demonstration of the process and insights into consensus-making, without claiming measurable improvements in alignment or accuracy. We will revise the case study section and discussion to explicitly articulate this illustrative scope and limitations, ensuring no overstatement of the contribution. revision: partial

  2. Referee: [Formative Study] The formative study section provides no participant count, recruitment details, session structure, or analysis method (e.g., thematic analysis protocol or inter-coder reliability). Without these, it is difficult to evaluate whether the reported challenges in shared understanding and value alignment are robust or generalizable enough to motivate the specific features of MultEval.

    Authors: We acknowledge that the formative study section would benefit from greater methodological transparency. We will revise the manuscript to include the participant count, recruitment details, session structure, and a description of the analysis method employed. This addition will allow readers to better assess the robustness of the identified challenges and their role in motivating MultEval's design. revision: yes

  3. Referee: [System Description] The abstract and system-description sections assert that MultEval enables evaluators to “surface and diagnose disagreements using consensus-building theory,” yet the manuscript does not detail which specific consensus mechanisms (e.g., Delphi, nominal group technique) are implemented, how they are operationalized in the UI, or any validation that they reduce disagreement more effectively than standard comment threads.

    Authors: We agree that the system description would be strengthened by greater specificity on the consensus-building elements. MultEval incorporates features such as disagreement surfacing via structured proposals and example attachments, drawing from consensus-building principles, but we will expand the relevant sections to explicitly name the mechanisms used, describe their UI operationalization (e.g., proposal history views and example linking), and clarify that the case study offers initial qualitative illustration rather than comparative validation. A dedicated validation study comparing against standard threads is beyond the current scope but noted as future work. revision: yes

Circularity Check

0 steps flagged

No circularity: design and empirical case study with no derivations or self-referential reductions

full rationale

The paper describes a formative study identifying challenges in collaborative criteria creation for LLM-as-a-judge systems, followed by the design of MultEval and a single-team case study illustrating its use. No mathematical equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claims rest on qualitative observations and system features rather than any derivation chain that reduces to prior inputs by construction. Self-citations, if present, are not load-bearing for any core result, and the work does not rename known patterns or smuggle ansatzes via citation. The lack of comparative controls in the case study affects evidential strength but introduces no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an HCI design paper; the central contribution rests on the assumption that observed collaboration challenges are addressable by the described features and that the case study demonstrates utility. No free parameters, mathematical axioms, or invented physical entities are involved.

pith-pipeline@v0.9.0 · 5538 in / 1199 out tokens · 66543 ms · 2026-05-07T12:54:48.999811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    [n. d.]. Secure & reliable LLMs | Promptfoo. https://www.promptfoo.dev/

  2. [2]

    Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L Glassman. 2024. Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18

  3. [3]

    Johnson, Martin San- tillan Cooper, Elizabeth M

    Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin San- tillan Cooper, Elizabeth M. Daly, Rahul Nair, Tejaswini Pedapati, Hyo Jin Do, and Werner Geyer. 2025. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences. arXiv:2410.00873 (Aug. 2025). doi:10.48550/a...

  4. [4]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)

  5. [5]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101

  6. [6]

    Robert O Briggs, Gwendolyn L Kolfschoten, and Gert-Jan de Vreede. 2005. Toward a theoretical model of consensus building.AMCIS 2005 Proceedings(2005), 12

  7. [7]

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv:2308.07201 (Aug. 2023). doi:10. 48550/arXiv.2308.07201

  8. [8]

    Herbert H Clark and Susan E Brennan. 1991. Grounding in communication. (1991)

  9. [9]

    Michael Desmond, Zahra Ashktorab, Qian Pan, Casey Dugan, and James M. Johnson. 2024. EvaluLLM: LLM assisted evaluation of generative outputs. In Companion Proceedings of the 29th International Conference on Intelligent User Interfaces. ACM, Greenville SC USA, 30–32. doi:10.1145/3640544.3645216

  10. [10]

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475(2024)

  11. [11]

    Thomas Erickson and Wendy A Kellogg. 2000. Social translucence: an approach to designing systems that support social processes.ACM transactions on computer- human interaction (TOCHI)7, 1 (2000), 59–83

  12. [12]

    K. J. Kevin Feng, Tzu-Sheng Kuo, Quan Ze (Jim) Chen, Inyoung Cheong, Kenneth Holstein, and Amy X. Zhang. 2026. PolicyPad: Collaborative Prototyping of LLM Policies. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Article 39, 25 pages. doi:10.1145/3772318.3791689

  13. [13]

    Batya Friedman, Peter H Kahn Jr, Alan Borning, and Alina Huldtgren. 2013. Value sensitive design and information systems. InEarly engagement and new technologies: Opening up the laboratory. Springer, 55–95

  14. [14]

    Iason Gabriel. 2020. Artificial intelligence, values, and alignment.Minds and machines30, 3 (2020), 411–437

  15. [15]

    Jie Gao, Yuchen Guo, Gionnieve Lim, Tianqin Zhang, Zheng Zhang, Toby Jia-Jun Li, and Simon Tangi Perrault. 2024. CollabCoder: A Lower-barrier, Rigorous Workflow for Inductive Collaborative Qualitative Analysis with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Mac...

  16. [16]

    Simret Araya Gebreegziabher, Charles Chiang, Zichu Wang, Zahra Ashktorab, Michelle Brachman, Werner Geyer, Toby Jia-Jun Li, and Diego Gómez-Zará. 2025. MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a- Judge Workflow. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work. 1–18

  17. [17]

    Kummerfeld, and Elena L

    Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K. Kummerfeld, and Elena L. Glassman. 2024. Supporting Sensemaking of Large Language Model Outputs at Scale. arXiv:2401.13726 (Jan. 2024). doi:10.48550/arXiv.2401.13726

  18. [18]

    DeChurch, and Noshir Contractor

    Diego Gómez-Zará, Mengzi Guo, Leslie A. DeChurch, and Noshir Contractor

  19. [19]

    InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’20)

    The Impact of Displaying Diversity Information on the Formation of Self- assembling Teams. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–15. doi:10.1145/3313831.3376654

  20. [20]

    Mitchell L Gordon, Michelle S Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S Bernstein. 2022. Jury learning: Integrating dissenting voices into machine learning models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19

  21. [21]

    Eduardo Graells-Garrido, Mounia Lalmas, and Ricardo Baeza-Yates. 2016. Data Portraits and Intermediary Topics: Encouraging Exploration of Politically Diverse Profiles. InProceedings of the 21st International Conference on Intelligent User Interfaces. ACM, Sonoma California USA, 228–240. doi:10.1145/2856767.2856776

  22. [22]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594 (March 2025). doi:10.48550/arXiv.2411.15594

  23. [23]

    Yu-Tang Hsiao, Shu-Yang Lin, Audrey Tang, Darshana Narayanan, and Claudina Sarahe. 2018. vTaiwan: An empirical study of open consultation process in Taiwan.Taiwan: Center for Open Science(2018)

  24. [24]

    Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective constitutional ai: Aligning a language model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 1395–1417

  25. [25]

    Karen A Jehn, Gregory B Northcraft, and Margaret A Neale. 1999. Why differ- ences make a difference: A field study of diversity, conflict and performance in workgroups.Administrative science quarterly44, 4 (1999), 741–763

  26. [26]

    Tae Soo Kim, Nitesh Goyal, Jeongyeon Kim, Juho Kim, and Sungsoo Ray Hong

  27. [27]

    Supporting collaborative sequencing of small groups through visual aware- ness.Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (2021), 1–29

  28. [28]

    Tae Soo Kim, Heechan Lee, Yoonjoo Lee, Joseph Seering, and Juho Kim. 2025. Evalet: Evaluating Large Language Models by Fragmenting Outputs into Func- tions. arXiv:2509.11206 (2025). doi:10.48550/arXiv.2509.11206

  29. [29]

    Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2024. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–21

  30. [30]

    Michelle S Lam, Fred Hohman, Dominik Moritz, Jeffrey P Bigham, Kenneth Hol- stein, and Mary Beth Kery. 2025. Policy Maps: Tools for Guiding the Unbounded Space of LLM Behaviors. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–24

  31. [31]

    Sean A Munson and Paul Resnick. 2010. Presenting diverse political opinions: how and how much. InProceedings of the SIGCHI conference on human factors in computing systems. 1457–1466

  32. [32]

    Bidhan L Parmar, R Edward Freeman, Jeffrey S Harrison, Andrew C Wicks, Lauren Purnell, and Simone De Colle. 2010. Stakeholder theory: The state of the art. Academy of Management Annals4, 1 (2010), 403–445

  33. [33]

    Samir Passi and Solon Barocas. 2019. Problem formulation and fairness. In Proceedings of the conference on fairness, accountability, and transparency. 39–48

  34. [34]

    Shreya Shankar, JD Zamfirescu-Pereira, Björn Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. Who validates the validators? aligning llm-assisted evalu- ation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

  35. [35]

    Atousa Soltani, Kasun Hewage, Bahareh Reza, and Rehan Sadiq. 2015. Multiple stakeholders in multi-criteria decision-making in the context of municipal solid waste management: a review.Waste Management35 (2015), 318–328

  36. [36]

    Stax. [n. d.]. Stax - The complete toolkit for AI evaluation. https://stax.withgoogle. com/landing

  37. [37]

    Hariharan Subramonyam, Jane Im, Colleen Seifert, and Eytan Adar. 2022. Solving separation-of-concerns problems in collaborative design of human-AI systems through leaky abstractions. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–21

  38. [38]

    Metoyer, and Toby Jia-Jun Li

    Annalisa Szymanski, Simret Araya Gebreegziabher, Oghenemaro Anuyah, Ronald A. Metoyer, and Toby Jia-Jun Li. 2024. Comparing Criteria Develop- ment Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation. arXiv:2410.02054 (Oct. 2024). doi:10.48550/arXiv.2410.02054

  39. [39]

    Michael Terry, Chinmay Kulkarni, Martin Wattenberg, Lucas Dixon, and Mered- ith Ringel Morris. 2024. Interactive AI Alignment: Specification, Process, and Evaluation Alignment. arXiv:2311.00710 (2024). doi:10.48550/arXiv.2311.00710

  40. [40]

    Michael Williams and Tami Moser. 2019. The art of coding and thematic ex- ploration in qualitative research.International management review15, 1 (2019), 45–55

  41. [41]

    Marianne Wilson, David Brazier, Dimitra Gkatzia, and Peter Robertson. 2024. Participatory Design with Domain Experts: A Delphi Study for a Career Sup- port Chatbot. InACM Conversational User Interfaces 2024. ACM, Luxembourg Luxembourg, 1–12. doi:10.1145/3640794.3665534

  42. [42]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

  43. [43]

    results": [

    Tan Zhi-Xuan, Micah Carroll, Matija Franklin, and Hal Ashton. 2025. Beyond Preferences in AI Alignment: T. Zhi-Xuan et al.Philosophical Studies182, 7 (2025), 1813–1863. MultEval CHIWORK ’26, June 22–25, 2026, Linz, Austria A System Prompts A.1 LLM-Judge System Judge Prompt: "You are a helpful assistant that can check the quality of an input-output pair ba...