pith. sign in

arxiv: 2509.11206 · v4 · submitted 2025-09-14 · 💻 cs.HC · cs.AI· cs.CL

Evalet: Evaluating Large Language Models through Functional Fragmentation

Pith reviewed 2026-05-18 16:53 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CL
keywords LLM evaluationLLM-as-a-Judgefunctional fragmentationevaluation misalignmentsinteractive visualizationgenerative AIuser study
0
0 comments X

The pith

Functional fragmentation breaks LLM evaluations into specific rhetoric functions so users can see exactly what drives the scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces functional fragmentation to dissect generative outputs into key fragments and label the rhetoric functions each fragment performs relative to given evaluation criteria. This is implemented in the Evalet interactive system, which visualizes fragment-level functions across many outputs to support inspection, rating, and comparison. A user study with ten practitioners found that the method enabled identification of 48% more evaluation misalignments than holistic scores alone. The added visibility helps users calibrate their trust in the evaluations and surface more actionable problems in the model outputs. The work therefore advocates moving LLM evaluation away from single numeric scores toward fine-grained qualitative analysis.

Core claim

Functional fragmentation dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria, surfacing the elements of interest and revealing how they fulfill or hinder user goals. The approach is realized in Evalet, an interactive visualization system that supports inspection, rating, and comparison of evaluations across outputs. A study with ten practitioners showed the method helped identify 48% more misalignments, which supported calibrating trust in LLM evaluations and finding more actionable issues in model outputs.

What carries the argument

Functional fragmentation, the process of dividing outputs into fragments and assigning rhetoric functions relative to evaluation criteria to show alignment or misalignment with user goals.

If this is right

  • Users can validate specific elements of an evaluation instead of accepting or rejecting an overall score.
  • Trust in LLM-as-a-Judge outputs can be adjusted based on which functions align with or contradict the criteria.
  • More actionable issues in model outputs become visible through the fragment-level view.
  • Evaluation practices shift from quantitative scores toward qualitative analysis of model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fragmentation logic could be applied to evaluate non-text generative outputs such as images or code.
  • Future tools might combine the method with automated labeling to scale beyond manual interpretation.
  • Insights from many such analyses could guide improvements in how prompts instruct LLM judges.

Load-bearing premise

The functional fragmentation process and resulting labels accurately capture the rhetoric functions that matter to users without introducing interpreter bias or missing important context.

What would settle it

A follow-up study with a larger participant pool that measures no meaningful increase in identified misalignments or trust calibration when using functional fragmentation versus holistic scores.

Figures

Figures reproduced from arXiv: 2509.11206 by Heechan Lee, Joseph Seering, Juho Kim, Tae Soo Kim, Yoonjoo Lee.

Figure 1
Figure 1. Figure 1: Illustration of the functional fragmentation approach supported by Evalet. Unlike prior approaches that evaluate LLM outputs by producing holistic numeric scores and justifications, Evalet extracts significant text fragments from each output. Then, the system interprets and labels the function that each fragment plays in terms of the criterion, and rates whether the function satisfies or fails to meet the … view at source ↗
Figure 2
Figure 2. Figure 2: Evalet consists of two main components: (A) Information Panel and (B) Map Visualization. In the Information Panel, users can use the Tab Navigator (C) to switch between managing their input-output dataset, defining their criteria set, and viewing evaluation details. Users can initiate evaluations by clicking on Run Evaluation (D). The Map Visualization helps users explore all fragment-level functions acros… view at source ↗
Figure 3
Figure 3. Figure 3: In the Database Tab, users can view their dataset of input-output pairs. Each item consists of the input, the output, and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Users can explore the clusters and fragment-level functions through both the Map Visualization (A) and Explore [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Users can view only the selected fragment-level functions in the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparisons of the main interface components across the study conditions. (A) The [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of results across conditions for the issues identified for the task LLM’s outputs (left) and LLM evaluations [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of participants’ ratings for perceived [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Fragment-level functions and their clusters identified through our approach for three types of tasks and criteria: [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: In the Database Tab, users can browse through the [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt to fragment and evaluate functions from an output. (1/2) [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt to fragment and evaluate functions from an output. (2/2) [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt to create base clusters from groups of functions. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt to create super cluster labels for groups of base clusters. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt to deduplicate similar super clusters. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt to reassign base clusters to more relevant super clusters. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
read the original abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces functional fragmentation, a method that dissects LLM-generated outputs into key fragments and interprets the rhetoric functions each fragment serves relative to evaluation criteria. This approach is instantiated in the Evalet interactive system, which visualizes fragment-level functions to support inspection, rating, and comparison of evaluations. A user study (N=10) reports that the method helped practitioners identify 48% more evaluation misalignments than holistic scoring, enabling better calibration of trust and discovery of actionable issues in model outputs.

Significance. If the empirical findings are robustly supported, the work provides a practical shift from opaque holistic LLM-as-a-Judge scores toward qualitative, fine-grained analysis. The interactive visualization in Evalet represents a concrete HCI contribution for making LLM evaluations more transparent and actionable for practitioners.

major comments (2)
  1. [Abstract] Abstract: The central claim of a 48% improvement in identified misalignments from the N=10 user study is presented without any description of study design, tasks used, controls, statistical tests, baseline comparisons, or operational definition of 'evaluation misalignments.' This leaves the primary empirical result only partially supported.
  2. [Abstract] Abstract: The method's effectiveness rests on the premise that functional fragmentation and its rhetoric-function labels accurately surface elements that matter to users and reveal goal fulfillment/hindrance without systematic interpreter bias. No validation of label fidelity (e.g., inter-rater agreement, alignment with independent user goals, or checks against dropped context) is described, raising the possibility that reported gains reflect labeling artifacts rather than genuine insight.
minor comments (1)
  1. [Abstract] Abstract: The abstract introduces 'functional fragmentation' and 'Evalet' without situating them against prior work on LLM evaluation or qualitative analysis tools.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to better support our empirical claims and methodological premises. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a 48% improvement in identified misalignments from the N=10 user study is presented without any description of study design, tasks used, controls, statistical tests, baseline comparisons, or operational definition of 'evaluation misalignments.' This leaves the primary empirical result only partially supported.

    Authors: We agree that the abstract, constrained by length, omits key study details. Section 5 of the full manuscript describes the within-subjects design with N=10 practitioners, the evaluation tasks (assessing LLM outputs against user-defined criteria), counterbalanced controls, baseline holistic scoring, paired statistical comparisons, and the operational definition of misalignments as user-identified discrepancies not reflected in LLM scores. We will revise the abstract to briefly note the study setup and explicitly direct readers to Section 5 for the complete methodology, thereby strengthening support for the reported result. revision: yes

  2. Referee: [Abstract] Abstract: The method's effectiveness rests on the premise that functional fragmentation and its rhetoric-function labels accurately surface elements that matter to users and reveal goal fulfillment/hindrance without systematic interpreter bias. No validation of label fidelity (e.g., inter-rater agreement, alignment with independent user goals, or checks against dropped context) is described, raising the possibility that reported gains reflect labeling artifacts rather than genuine insight.

    Authors: We acknowledge this concern about potential interpreter bias in the rhetoric-function labels. The manuscript details that labels follow a predefined taxonomy derived from evaluation criteria and were applied by the authors with relevant expertise. However, formal validation metrics such as inter-rater agreement were not reported. In the revised manuscript we will add a dedicated validation subsection reporting results from an independent labeling exercise on a sample of fragments (including Cohen's kappa), alignment checks against user goals elicited in study interviews, and discussion of context preservation in the fragmentation process. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes functional fragmentation as a method for dissecting outputs and interpreting rhetoric functions, then instantiates it in the Evalet system. The central empirical claim—that the approach enabled identification of 48% more evaluation misalignments—is drawn directly from an independent N=10 user study comparing the method against holistic scoring. No equations, fitted parameters, self-citations, or uniqueness theorems are present that would reduce this measured improvement back to the method definition by construction. The result is externally validated through participant observations rather than being tautological with the input assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces a new evaluation method and visualization system. It relies on the domain assumption that holistic LLM scores obscure actionable detail and on standard HCI user-study practices. No free parameters are fitted and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption LLM-as-a-Judge approaches produce holistic scores that obscure which specific elements influenced the assessments
    Stated as the motivating problem in the first sentence of the abstract.
invented entities (2)
  • functional fragmentation no independent evidence
    purpose: Dissecting each output into key fragments and interpreting the rhetoric functions each fragment serves relative to evaluation criteria
    New method proposed to surface elements of interest and reveal alignment with user goals; no independent evidence outside the paper is provided.
  • Evalet no independent evidence
    purpose: Interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison
    Concrete instantiation of the fragmentation approach; no external validation data or code release mentioned in abstract.

pith-pipeline@v0.9.0 · 5706 in / 1548 out tokens · 64102 ms · 2026-05-18T16:53:10.868381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria

    cs.HC 2026-04 unverdicted novelty 6.0

    MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    Adept AI. 2024. Manus: An Agentic Framework for Complex Task Automation. https://www.adept.ai/blog/manus. Accessed: 2025-04-10

  2. [2]

    Genspark AI. 2024. Genspark: Agents That Write Code and Explain It. https: //genspark.ai/. Accessed: 2025-04-10

  3. [3]

    Meta AI. 2025. Introducing Llama 4: 10 Million Token Context. https://ai.meta. com/llama/. Accessed: 2025-04-10

  4. [4]

    Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extract- ing categories and clusters from complex data. InProceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 989–998

  5. [5]

    Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. https://www.anthropic. com/news/claude-3-7-sonnet Accessed: March 19, 2025

  6. [6]

    Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L Glassman. 2024. Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18

  7. [7]

    Zahra Ashktorab, Michael Desmond, Qian Pan, James M Johnson, Martin Santillan Cooper, Elizabeth M Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, and Werner Geyer. 2024. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences.arXiv preprint arXiv:2410.00873(2024)

  8. [8]

    Griffiths

    Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths

  9. [9]

    Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105,

    Measuring Implicit Bias in Explicitly Unbiased Large Language Models. arXiv:2402.04105 [cs.CY] https://arxiv.org/abs/2402.04105

  10. [10]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)

  11. [11]

    Rishabh Bhardwaj and Soujanya Poria. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. arXiv:2308.09662 [cs.CL]

  12. [12]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  13. [13]

    Hong, and Adam Perer

    Ángel Alexander Cabrera, Erica Fu, Donald Bertucci, Kenneth Holstein, Ameet Talwalkar, Jason I. Hong, and Adam Perer. 2023. Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, N...

  14. [14]

    Ángel Alexander Cabrera, Marco Tulio Ribeiro, Bongshin Lee, Robert Deline, Adam Perer, and Steven M Drucker. 2023. What did my AI learn? How data scientists make sense of model behavior.ACM Transactions on Computer-Human Interaction30, 1 (2023), 1–27

  15. [15]

    Duen Horng Chau, Aniket Kittur, Jason I Hong, and Christos Faloutsos. 2011. Apolo: making sense of large network data by combining rich user interaction and machine learning. InProceedings of the SIGCHI conference on human factors in computing systems. 167–176

  16. [16]

    John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush: Sketching stories with generative pretrained language models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19

  17. [17]

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753(2024)

  18. [18]

    Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, and Joseph E Gonzalez. 2024. VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models.arXiv preprint arXiv:2410.12851(2024)

  19. [19]

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166(2023)

  20. [20]

    Simret Araya Gebreegziabher, Charles Chiang, Zichu Wang, Zahra Ashktorab, Michelle Brachman, Werner Geyer, Toby Jia-Jun Li, and Diego Gómez-Zará

  21. [21]

    MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow. (2025)

  22. [22]

    Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K Kummerfeld, and Elena L Glassman. 2024. Supporting sensemaking of large language model outputs at scale. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–21

  23. [23]

    Elena L Glassman, Jeremy Scott, Rishabh Singh, Philip J Guo, and Robert C Miller

  24. [24]

    OverCode: Visualizing variation in student solutions to programming problems at scale.ACM Transactions on Computer-Human Interaction (TOCHI) 22, 2 (2015), 1–35

  25. [25]

    Google. 2024. Gemini Deep Research - your personal research assistant. https: //gemini.google/overview/deep-research/. Accessed: 2025-04-10

  26. [26]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  27. [27]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?arXiv preprint arXiv:2404.06654 (2024)

  28. [28]

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

  29. [29]

    Peiling Jiang, Jude Rayan, Steven P Dow, and Haijun Xia. 2023. Graphologue: Exploring large language model responses with interactive diagrams. InProceed- ings of the 36th annual ACM symposium on user interface software and technology. 1–20

  30. [30]

    Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2024)

  31. [31]

    Andrej [@karpathy] Karpathy. 2025. My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now. [...] In absence of great comprehensive evals I tried to turn to vibe checks instead, but I now fear they are misleading and there is too much opportunity for confirmation bias, too low sample size, etc., it’s just n...

  32. [32]

    Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining interpretability and explainability using sensemaking theory. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Trans- parency. 702–714

  33. [33]

    Minjeong Kim, Kyeongpil Kang, Deokgun Park, Jaegul Choo, and Niklas Elmqvist

  34. [34]

    Topiclens: Efficient multi-level visual topic exploration of large-scale document collections.IEEE transactions on visualization and computer graphics 23, 1 (2016), 151–160

  35. [35]

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al . 2023. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations

  36. [36]

    Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, et al

  37. [37]

    The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models.arXiv preprint arXiv:2406.05761(2024)

  38. [38]

    Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, et al

  39. [39]

    Scaling Evaluation-time Compute with Reasoning Models as Process Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Tae Soo Kim*, Heechan Lee*, Yoonjoo Lee, Joseph Seering, and Juho Kim Evaluators.arXiv preprint arXiv:2503.19877(2025)

  40. [40]

    Tae Soo Kim, Yoonjoo Lee, Minsuk Chang, and Juho Kim. 2023. Cells, Genera- tors, and Lenses: Design Framework for Object-Oriented Interaction with Large Language Models. InThe 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), October 29-November 1, 2023, San Francisco, CA, USA (San Francisco, CA, USA)(UIST ’23). Association f...

  41. [41]

    Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2024. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–21

  42. [42]

    Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency. 29–38

  43. [43]

    Michelle S Lam, Fred Hohman, Dominik Moritz, Jeffrey P Bigham, Kenneth Holstein, and Mary Beth Kery. 2024. AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking.arXiv preprint arXiv:2409.18203(2024)

  44. [44]

    Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael S Bern- stein. 2024. Concept induction: Analyzing unstructured text with high-level concepts using lloom. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–28

  45. [45]

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. 2024. Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787(2024)

  46. [46]

    Q Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: informing design practices for explainable AI user experiences. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–15

  47. [47]

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi

  48. [48]

    Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770(2024)

  49. [49]

    What It Wants Me To Say

    Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Benjamin Zorn, Jack Williams, Neil Toronto, and Andrew D. Gordon. 2023. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code- Generating Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’2...

  50. [50]

    Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A Myers. 2024. Selenite: Scaffolding Online Sensemak- ing with Comprehensive Overviews Elicited from Large Language Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26

  51. [51]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)

  52. [52]

    Stuart Lloyd. 1982. Least squares quantization in PCM.IEEE transactions on information theory28, 2 (1982), 129–137

  53. [53]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

  54. [54]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

  55. [55]

    James MacQueen. 1967. Some methods for classification and analysis of multivari- ate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Vol. 5. University of California press, 281–298

  56. [56]

    Leland McInnes, John Healy, Steve Astels, et al . 2017. hdbscan: Hierarchical density based clustering.J. Open Source Softw.2, 11 (2017), 205

  57. [57]

    Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- ifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426(2018)

  58. [58]

    Aditi Mishra, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2023. PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models.arXiv preprint arXiv:2304.01964(2023)

  59. [59]

    Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan. 2023. Qualeval: Qualitative evaluation for model improvement.arXiv preprint arXiv:2311.02807(2023)

  60. [60]

    Srishti Palani, Zijian Ding, Austin Nguyen, Andrew Chuang, Stephen MacNeil, and Steven P Dow. 2021. CoNotate: Suggesting queries based on notes promotes knowledge discovery. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14

  61. [61]

    Srishti Palani, Yingyi Zhou, Sheldon Zhu, and Steven P Dow. 2022. InterWeave: Presenting Search Suggestions in Context Scaffolds Information Search and Synthesis. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–16

  62. [62]

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

  63. [63]

    Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109 (2024)

  64. [64]

    Carole C Perlman. 2003. Performance Assessment: Designing Appropriate Per- formance Tasks and Scoring Rubrics. (2003)

  65. [65]

    Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, Vol. 5. McLean, VA, USA, 2–4

  66. [66]

    Napol Rachatasumrit, Gonzalo Ramos, Jina Suh, Rachel Ng, and Christopher Meek

  67. [67]

    InProceedings of the 26th International Conference on Intelligent User Interfaces

    Forsense: Accelerating online research through sensemaking integration and machine research support. InProceedings of the 26th International Conference on Intelligent User Interfaces. 608–618

  68. [68]

    Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive testing and debugging of NLP models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3253–3267

  69. [69]

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList.arXiv preprint arXiv:2005.04118(2020)

  70. [70]

    InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23)

    Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, and Fred Hohman. 2023. Angler: Helping Machine Translation Practitioners Prioritize Model Improvements. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 832, 20 pages. ...

  71. [71]

    Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. 2024. Lmunit: Fine-grained evaluation with natural language unit tests.arXiv preprint arXiv:2412.13091(2024)

  72. [72]

    Shreya Shankar, JD Zamfirescu-Pereira, Björn Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. Who validates the validators? aligning llm-assisted evalu- ation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

  73. [73]

    Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, et al . 2024. Towards bidirectional human-ai alignment: A systematic review for clarifications, framework, and future directions.arXiv preprint arXiv:2406.09264(2024)

  74. [74]

    Venkatesh Sivaraman, Zexuan Li, and Adam Perer. 2025. Divisi: Interactive Search and Visualization for Scalable Exploratory Subgroup Analysis.arXiv preprint arXiv:2502.10537(2025)

  75. [75]

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)

  76. [76]

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)

  77. [77]

    Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M. Rush. 2023. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models.IEEE Transactions on Visualization and Computer Graphics29, 1 (2023), 1146–1156. doi:10.1109/TVCG.2022.3209479

  78. [78]

    Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Lumi- nate: Structured generation and exploration of design space with large language models for human-ai co-creation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26

  79. [79]

    Sangho Suh, Bryan Min, Srishti Palani, and Haijun Xia. 2023. Sensecape: En- abling multilevel exploration and sensemaking with large language models. In Proceedings of the 36th annual ACM symposium on user interface software and technology. 1–18

  80. [80]

    Annalisa Szymanski, Simret Araya Gebreegziabher, Oghenemaro Anuyah, Ronald A Metoyer, and Toby Jia-Jun Li. 2024. Comparing Criteria Develop- ment Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation.arXiv preprint arXiv:2410.02054(2024)

Showing first 80 references.