Evalet: Evaluating Large Language Models through Functional Fragmentation

Heechan Lee; Joseph Seering; Juho Kim; Tae Soo Kim; Yoonjoo Lee

arxiv: 2509.11206 · v4 · submitted 2025-09-14 · 💻 cs.HC · cs.AI· cs.CL

Evalet: Evaluating Large Language Models through Functional Fragmentation

Tae Soo Kim , Heechan Lee , Yoonjoo Lee , Joseph Seering , Juho Kim This is my paper

Pith reviewed 2026-05-18 16:53 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CL

keywords LLM evaluationLLM-as-a-Judgefunctional fragmentationevaluation misalignmentsinteractive visualizationgenerative AIuser study

0 comments

The pith

Functional fragmentation breaks LLM evaluations into specific rhetoric functions so users can see exactly what drives the scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces functional fragmentation to dissect generative outputs into key fragments and label the rhetoric functions each fragment performs relative to given evaluation criteria. This is implemented in the Evalet interactive system, which visualizes fragment-level functions across many outputs to support inspection, rating, and comparison. A user study with ten practitioners found that the method enabled identification of 48% more evaluation misalignments than holistic scores alone. The added visibility helps users calibrate their trust in the evaluations and surface more actionable problems in the model outputs. The work therefore advocates moving LLM evaluation away from single numeric scores toward fine-grained qualitative analysis.

Core claim

Functional fragmentation dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria, surfacing the elements of interest and revealing how they fulfill or hinder user goals. The approach is realized in Evalet, an interactive visualization system that supports inspection, rating, and comparison of evaluations across outputs. A study with ten practitioners showed the method helped identify 48% more misalignments, which supported calibrating trust in LLM evaluations and finding more actionable issues in model outputs.

What carries the argument

Functional fragmentation, the process of dividing outputs into fragments and assigning rhetoric functions relative to evaluation criteria to show alignment or misalignment with user goals.

If this is right

Users can validate specific elements of an evaluation instead of accepting or rejecting an overall score.
Trust in LLM-as-a-Judge outputs can be adjusted based on which functions align with or contradict the criteria.
More actionable issues in model outputs become visible through the fragment-level view.
Evaluation practices shift from quantitative scores toward qualitative analysis of model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fragmentation logic could be applied to evaluate non-text generative outputs such as images or code.
Future tools might combine the method with automated labeling to scale beyond manual interpretation.
Insights from many such analyses could guide improvements in how prompts instruct LLM judges.

Load-bearing premise

The functional fragmentation process and resulting labels accurately capture the rhetoric functions that matter to users without introducing interpreter bias or missing important context.

What would settle it

A follow-up study with a larger participant pool that measures no meaningful increase in identified misalignments or trust calibration when using functional fragmentation versus holistic scores.

Figures

Figures reproduced from arXiv: 2509.11206 by Heechan Lee, Joseph Seering, Juho Kim, Tae Soo Kim, Yoonjoo Lee.

**Figure 1.** Figure 1: Illustration of the functional fragmentation approach supported by Evalet. Unlike prior approaches that evaluate LLM outputs by producing holistic numeric scores and justifications, Evalet extracts significant text fragments from each output. Then, the system interprets and labels the function that each fragment plays in terms of the criterion, and rates whether the function satisfies or fails to meet the … view at source ↗

**Figure 2.** Figure 2: Evalet consists of two main components: (A) Information Panel and (B) Map Visualization. In the Information Panel, users can use the Tab Navigator (C) to switch between managing their input-output dataset, defining their criteria set, and viewing evaluation details. Users can initiate evaluations by clicking on Run Evaluation (D). The Map Visualization helps users explore all fragment-level functions acros… view at source ↗

**Figure 3.** Figure 3: In the Database Tab, users can view their dataset of input-output pairs. Each item consists of the input, the output, and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Users can explore the clusters and fragment-level functions through both the Map Visualization (A) and Explore [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Users can view only the selected fragment-level functions in the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparisons of the main interface components across the study conditions. (A) The [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of results across conditions for the issues identified for the task LLM’s outputs (left) and LLM evaluations [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of participants’ ratings for perceived [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Fragment-level functions and their clusters identified through our approach for three types of tasks and criteria: [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: In the Database Tab, users can browse through the [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt to fragment and evaluate functions from an output. (1/2) [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt to fragment and evaluate functions from an output. (2/2) [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt to create base clusters from groups of functions. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt to create super cluster labels for groups of base clusters. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt to deduplicate similar super clusters. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt to reassign base clusters to more relevant super clusters. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

read the original abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Functional fragmentation and Evalet give a workable way to inspect LLM judgments at the fragment level, though the small study leaves the gains hard to evaluate.

read the letter

The main thing here is that the paper breaks LLM-as-a-judge outputs into functional fragments labeled by their rhetoric roles, then builds Evalet as an interactive viewer so users can spot where the scores come from and compare across examples. This is new in its emphasis on qualitative fragment-level rhetoric analysis rather than just refining overall numbers. The work does a reasonable job of naming the real deployment issue—holistic scores hide the reasons—and turning it into a concrete system that supports inspection and rating. The N=10 study result of 48% more misalignments found is at least a directional signal that the approach can surface issues users care about. The soft spots sit in the empirical side. The abstract gives no information on how the fragments are produced, whether the labeling is manual or automated, what controls were used in the study, or any check on whether the rhetoric labels match actual user goals. With only ten participants and no statistical details, the 48% figure is difficult to interpret or generalize. The stress-test concern about interpreter bias in the fragmentation step holds up on the available information, since nothing is said about label fidelity or inter-rater agreement. This paper is for HCI researchers and practitioners who already use or build LLM evaluators and want more transparency without retraining models. A reader working on evaluation interfaces or trust calibration would pick up usable ideas from the method and the visualization. It shows clear engagement with the practical problems in the literature, so it deserves a serious referee even if the study needs expansion. I would send it out for peer review with a request for more on the fragmentation process and study protocol.

Referee Report

2 major / 1 minor

Summary. The paper introduces functional fragmentation, a method that dissects LLM-generated outputs into key fragments and interprets the rhetoric functions each fragment serves relative to evaluation criteria. This approach is instantiated in the Evalet interactive system, which visualizes fragment-level functions to support inspection, rating, and comparison of evaluations. A user study (N=10) reports that the method helped practitioners identify 48% more evaluation misalignments than holistic scoring, enabling better calibration of trust and discovery of actionable issues in model outputs.

Significance. If the empirical findings are robustly supported, the work provides a practical shift from opaque holistic LLM-as-a-Judge scores toward qualitative, fine-grained analysis. The interactive visualization in Evalet represents a concrete HCI contribution for making LLM evaluations more transparent and actionable for practitioners.

major comments (2)

[Abstract] Abstract: The central claim of a 48% improvement in identified misalignments from the N=10 user study is presented without any description of study design, tasks used, controls, statistical tests, baseline comparisons, or operational definition of 'evaluation misalignments.' This leaves the primary empirical result only partially supported.
[Abstract] Abstract: The method's effectiveness rests on the premise that functional fragmentation and its rhetoric-function labels accurately surface elements that matter to users and reveal goal fulfillment/hindrance without systematic interpreter bias. No validation of label fidelity (e.g., inter-rater agreement, alignment with independent user goals, or checks against dropped context) is described, raising the possibility that reported gains reflect labeling artifacts rather than genuine insight.

minor comments (1)

[Abstract] Abstract: The abstract introduces 'functional fragmentation' and 'Evalet' without situating them against prior work on LLM evaluation or qualitative analysis tools.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to better support our empirical claims and methodological premises. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 48% improvement in identified misalignments from the N=10 user study is presented without any description of study design, tasks used, controls, statistical tests, baseline comparisons, or operational definition of 'evaluation misalignments.' This leaves the primary empirical result only partially supported.

Authors: We agree that the abstract, constrained by length, omits key study details. Section 5 of the full manuscript describes the within-subjects design with N=10 practitioners, the evaluation tasks (assessing LLM outputs against user-defined criteria), counterbalanced controls, baseline holistic scoring, paired statistical comparisons, and the operational definition of misalignments as user-identified discrepancies not reflected in LLM scores. We will revise the abstract to briefly note the study setup and explicitly direct readers to Section 5 for the complete methodology, thereby strengthening support for the reported result. revision: yes
Referee: [Abstract] Abstract: The method's effectiveness rests on the premise that functional fragmentation and its rhetoric-function labels accurately surface elements that matter to users and reveal goal fulfillment/hindrance without systematic interpreter bias. No validation of label fidelity (e.g., inter-rater agreement, alignment with independent user goals, or checks against dropped context) is described, raising the possibility that reported gains reflect labeling artifacts rather than genuine insight.

Authors: We acknowledge this concern about potential interpreter bias in the rhetoric-function labels. The manuscript details that labels follow a predefined taxonomy derived from evaluation criteria and were applied by the authors with relevant expertise. However, formal validation metrics such as inter-rater agreement were not reported. In the revised manuscript we will add a dedicated validation subsection reporting results from an independent labeling exercise on a sample of fragments (including Cohen's kappa), alignment checks against user goals elicited in study interviews, and discussion of context preservation in the fragmentation process. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes functional fragmentation as a method for dissecting outputs and interpreting rhetoric functions, then instantiates it in the Evalet system. The central empirical claim—that the approach enabled identification of 48% more evaluation misalignments—is drawn directly from an independent N=10 user study comparing the method against holistic scoring. No equations, fitted parameters, self-citations, or uniqueness theorems are present that would reduce this measured improvement back to the method definition by construction. The result is externally validated through participant observations rather than being tautological with the input assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces a new evaluation method and visualization system. It relies on the domain assumption that holistic LLM scores obscure actionable detail and on standard HCI user-study practices. No free parameters are fitted and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption LLM-as-a-Judge approaches produce holistic scores that obscure which specific elements influenced the assessments
Stated as the motivating problem in the first sentence of the abstract.

invented entities (2)

functional fragmentation no independent evidence
purpose: Dissecting each output into key fragments and interpreting the rhetoric functions each fragment serves relative to evaluation criteria
New method proposed to surface elements of interest and reveal alignment with user goals; no independent evidence outside the paper is provided.
Evalet no independent evidence
purpose: Interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison
Concrete instantiation of the fragmentation approach; no external validation data or code release mentioned in abstract.

pith-pipeline@v0.9.0 · 5706 in / 1548 out tokens · 64102 ms · 2026-05-18T16:53:10.868381+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

functional fragmentation... dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

user study (N=10) found... 48% more evaluation misalignments

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
cs.HC 2026-04 unverdicted novelty 6.0

MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Adept AI. 2024. Manus: An Agentic Framework for Complex Task Automation. https://www.adept.ai/blog/manus. Accessed: 2025-04-10

work page 2024
[2]

Genspark AI. 2024. Genspark: Agents That Write Code and Explain It. https: //genspark.ai/. Accessed: 2025-04-10

work page 2024
[3]

Meta AI. 2025. Introducing Llama 4: 10 Million Token Context. https://ai.meta. com/llama/. Accessed: 2025-04-10

work page 2025
[4]

Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extract- ing categories and clusters from complex data. InProceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 989–998

work page 2014
[5]

Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. https://www.anthropic. com/news/claude-3-7-sonnet Accessed: March 19, 2025

work page 2025
[6]

Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L Glassman. 2024. Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18

work page 2024
[7]

Zahra Ashktorab, Michael Desmond, Qian Pan, James M Johnson, Martin Santillan Cooper, Elizabeth M Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, and Werner Geyer. 2024. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences.arXiv preprint arXiv:2410.00873(2024)

work page arXiv 2024
[8]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths

work page
[9]

Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105,

Measuring Implicit Bias in Explicitly Unbiased Large Language Models. arXiv:2402.04105 [cs.CY] https://arxiv.org/abs/2402.04105

work page arXiv
[10]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Rishabh Bhardwaj and Soujanya Poria. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. arXiv:2308.09662 [cs.CL]

work page arXiv 2023
[12]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[13]

Hong, and Adam Perer

Ángel Alexander Cabrera, Erica Fu, Donald Bertucci, Kenneth Holstein, Ameet Talwalkar, Jason I. Hong, and Adam Perer. 2023. Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, N...

work page doi:10.1145/3544548.3581268 2023
[14]

Ángel Alexander Cabrera, Marco Tulio Ribeiro, Bongshin Lee, Robert Deline, Adam Perer, and Steven M Drucker. 2023. What did my AI learn? How data scientists make sense of model behavior.ACM Transactions on Computer-Human Interaction30, 1 (2023), 1–27

work page 2023
[15]

Duen Horng Chau, Aniket Kittur, Jason I Hong, and Christos Faloutsos. 2011. Apolo: making sense of large network data by combining rich user interaction and machine learning. InProceedings of the SIGCHI conference on human factors in computing systems. 167–176

work page 2011
[16]

John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush: Sketching stories with generative pretrained language models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2022
[17]

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, and Joseph E Gonzalez. 2024. VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models.arXiv preprint arXiv:2410.12851(2024)

work page arXiv 2024
[19]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Simret Araya Gebreegziabher, Charles Chiang, Zichu Wang, Zahra Ashktorab, Michelle Brachman, Werner Geyer, Toby Jia-Jun Li, and Diego Gómez-Zará

work page
[21]

MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow. (2025)

work page 2025
[22]

Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K Kummerfeld, and Elena L Glassman. 2024. Supporting sensemaking of large language model outputs at scale. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–21

work page 2024
[23]

Elena L Glassman, Jeremy Scott, Rishabh Singh, Philip J Guo, and Robert C Miller

work page
[24]

OverCode: Visualizing variation in student solutions to programming problems at scale.ACM Transactions on Computer-Human Interaction (TOCHI) 22, 2 (2015), 1–35

work page 2015
[25]

Google. 2024. Gemini Deep Research - your personal research assistant. https: //gemini.google/overview/deep-research/. Accessed: 2025-04-10

work page 2024
[26]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?arXiv preprint arXiv:2404.06654 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Peiling Jiang, Jude Rayan, Steven P Dow, and Haijun Xia. 2023. Graphologue: Exploring large language model responses with interactive diagrams. InProceed- ings of the 36th annual ACM symposium on user interface software and technology. 1–20

work page 2023
[30]

Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2024)

work page 2024
[31]

Andrej [@karpathy] Karpathy. 2025. My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now. [...] In absence of great comprehensive evals I tried to turn to vibe checks instead, but I now fear they are misleading and there is too much opportunity for confirmation bias, too low sample size, etc., it’s just n...

work page 2025
[32]

Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining interpretability and explainability using sensemaking theory. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Trans- parency. 702–714

work page 2022
[33]

Minjeong Kim, Kyeongpil Kang, Deokgun Park, Jaegul Choo, and Niklas Elmqvist

work page
[34]

Topiclens: Efficient multi-level visual topic exploration of large-scale document collections.IEEE transactions on visualization and computer graphics 23, 1 (2016), 151–160

work page 2016
[35]

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al . 2023. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations

work page 2023
[36]

Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, et al

work page
[37]

The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models.arXiv preprint arXiv:2406.05761(2024)

work page arXiv 2024
[38]

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, et al

work page
[39]

Scaling Evaluation-time Compute with Reasoning Models as Process Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Tae Soo Kim*, Heechan Lee*, Yoonjoo Lee, Joseph Seering, and Juho Kim Evaluators.arXiv preprint arXiv:2503.19877(2025)

work page internal anchor Pith review arXiv 2018
[40]

Tae Soo Kim, Yoonjoo Lee, Minsuk Chang, and Juho Kim. 2023. Cells, Genera- tors, and Lenses: Design Framework for Object-Oriented Interaction with Large Language Models. InThe 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), October 29-November 1, 2023, San Francisco, CA, USA (San Francisco, CA, USA)(UIST ’23). Association f...

work page 2023
[41]

Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2024. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–21

work page 2024
[42]

Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency. 29–38

work page 2019
[43]

Michelle S Lam, Fred Hohman, Dominik Moritz, Jeffrey P Bigham, Kenneth Holstein, and Mary Beth Kery. 2024. AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking.arXiv preprint arXiv:2409.18203(2024)

work page arXiv 2024
[44]

Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael S Bern- stein. 2024. Concept induction: Analyzing unstructured text with high-level concepts using lloom. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–28

work page 2024
[45]

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. 2024. Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787(2024)

work page arXiv 2024
[46]

Q Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: informing design practices for explainable AI user experiences. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–15

work page 2020
[47]

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi

work page
[48]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770(2024)

work page arXiv 2024
[49]

What It Wants Me To Say

Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Benjamin Zorn, Jack Williams, Neil Toronto, and Andrew D. Gordon. 2023. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code- Generating Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’2...

work page arXiv 2023
[50]

Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A Myers. 2024. Selenite: Scaffolding Online Sensemak- ing with Comprehensive Overviews Elicited from Large Language Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26

work page 2024
[51]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Stuart Lloyd. 1982. Least squares quantization in PCM.IEEE transactions on information theory28, 2 (1982), 129–137

work page 1982
[53]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

work page
[54]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

James MacQueen. 1967. Some methods for classification and analysis of multivari- ate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Vol. 5. University of California press, 281–298

work page 1967
[56]

Leland McInnes, John Healy, Steve Astels, et al . 2017. hdbscan: Hierarchical density based clustering.J. Open Source Softw.2, 11 (2017), 205

work page 2017
[57]

Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- ifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

Aditi Mishra, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2023. PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models.arXiv preprint arXiv:2304.01964(2023)

work page arXiv 2023
[59]

Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan. 2023. Qualeval: Qualitative evaluation for model improvement.arXiv preprint arXiv:2311.02807(2023)

work page arXiv 2023
[60]

Srishti Palani, Zijian Ding, Austin Nguyen, Andrew Chuang, Stephen MacNeil, and Steven P Dow. 2021. CoNotate: Suggesting queries based on notes promotes knowledge discovery. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14

work page 2021
[61]

Srishti Palani, Yingyi Zhou, Sheldon Zhu, and Steven P Dow. 2022. InterWeave: Presenting Search Suggestions in Context Scaffolds Information Search and Synthesis. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–16

work page 2022
[62]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

work page 2023
[63]

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Carole C Perlman. 2003. Performance Assessment: Designing Appropriate Per- formance Tasks and Scoring Rubrics. (2003)

work page 2003
[65]

Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, Vol. 5. McLean, VA, USA, 2–4

work page 2005
[66]

Napol Rachatasumrit, Gonzalo Ramos, Jina Suh, Rachel Ng, and Christopher Meek

work page
[67]

InProceedings of the 26th International Conference on Intelligent User Interfaces

Forsense: Accelerating online research through sensemaking integration and machine research support. InProceedings of the 26th International Conference on Intelligent User Interfaces. 608–618

work page
[68]

Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive testing and debugging of NLP models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3253–3267

work page 2022
[69]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList.arXiv preprint arXiv:2005.04118(2020)

work page arXiv 2020
[70]

InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23)

Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, and Fred Hohman. 2023. Angler: Helping Machine Translation Practitioners Prioritize Model Improvements. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 832, 20 pages. ...

work page doi:10.1145/3544548 2023
[71]

Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. 2024. Lmunit: Fine-grained evaluation with natural language unit tests.arXiv preprint arXiv:2412.13091(2024)

work page arXiv 2024
[72]

Shreya Shankar, JD Zamfirescu-Pereira, Björn Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. Who validates the validators? aligning llm-assisted evalu- ation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

work page 2024
[73]

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, et al . 2024. Towards bidirectional human-ai alignment: A systematic review for clarifications, framework, and future directions.arXiv preprint arXiv:2406.09264(2024)

work page arXiv 2024
[74]

Venkatesh Sivaraman, Zexuan Li, and Adam Perer. 2025. Divisi: Interactive Search and Visualization for Scalable Exploratory Subgroup Analysis.arXiv preprint arXiv:2502.10537(2025)

work page arXiv 2025
[75]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M. Rush. 2023. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models.IEEE Transactions on Visualization and Computer Graphics29, 1 (2023), 1146–1156. doi:10.1109/TVCG.2022.3209479

work page doi:10.1109/tvcg.2022.3209479 2023
[78]

Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Lumi- nate: Structured generation and exploration of design space with large language models for human-ai co-creation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26

work page 2024
[79]

Sangho Suh, Bryan Min, Srishti Palani, and Haijun Xia. 2023. Sensecape: En- abling multilevel exploration and sensemaking with large language models. In Proceedings of the 36th annual ACM symposium on user interface software and technology. 1–18

work page 2023
[80]

Annalisa Szymanski, Simret Araya Gebreegziabher, Oghenemaro Anuyah, Ronald A Metoyer, and Toby Jia-Jun Li. 2024. Comparing Criteria Develop- ment Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation.arXiv preprint arXiv:2410.02054(2024)

work page arXiv 2024

Showing first 80 references.

[1] [1]

Adept AI. 2024. Manus: An Agentic Framework for Complex Task Automation. https://www.adept.ai/blog/manus. Accessed: 2025-04-10

work page 2024

[2] [2]

Genspark AI. 2024. Genspark: Agents That Write Code and Explain It. https: //genspark.ai/. Accessed: 2025-04-10

work page 2024

[3] [3]

Meta AI. 2025. Introducing Llama 4: 10 Million Token Context. https://ai.meta. com/llama/. Accessed: 2025-04-10

work page 2025

[4] [4]

Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extract- ing categories and clusters from complex data. InProceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 989–998

work page 2014

[5] [5]

Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. https://www.anthropic. com/news/claude-3-7-sonnet Accessed: March 19, 2025

work page 2025

[6] [6]

Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L Glassman. 2024. Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18

work page 2024

[7] [7]

Zahra Ashktorab, Michael Desmond, Qian Pan, James M Johnson, Martin Santillan Cooper, Elizabeth M Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, and Werner Geyer. 2024. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences.arXiv preprint arXiv:2410.00873(2024)

work page arXiv 2024

[8] [8]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths

work page

[9] [9]

Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105,

Measuring Implicit Bias in Explicitly Unbiased Large Language Models. arXiv:2402.04105 [cs.CY] https://arxiv.org/abs/2402.04105

work page arXiv

[10] [10]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Rishabh Bhardwaj and Soujanya Poria. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. arXiv:2308.09662 [cs.CL]

work page arXiv 2023

[12] [12]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020

[13] [13]

Hong, and Adam Perer

Ángel Alexander Cabrera, Erica Fu, Donald Bertucci, Kenneth Holstein, Ameet Talwalkar, Jason I. Hong, and Adam Perer. 2023. Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, N...

work page doi:10.1145/3544548.3581268 2023

[14] [14]

Ángel Alexander Cabrera, Marco Tulio Ribeiro, Bongshin Lee, Robert Deline, Adam Perer, and Steven M Drucker. 2023. What did my AI learn? How data scientists make sense of model behavior.ACM Transactions on Computer-Human Interaction30, 1 (2023), 1–27

work page 2023

[15] [15]

Duen Horng Chau, Aniket Kittur, Jason I Hong, and Christos Faloutsos. 2011. Apolo: making sense of large network data by combining rich user interaction and machine learning. InProceedings of the SIGCHI conference on human factors in computing systems. 167–176

work page 2011

[16] [16]

John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush: Sketching stories with generative pretrained language models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2022

[17] [17]

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, and Joseph E Gonzalez. 2024. VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models.arXiv preprint arXiv:2410.12851(2024)

work page arXiv 2024

[19] [19]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Simret Araya Gebreegziabher, Charles Chiang, Zichu Wang, Zahra Ashktorab, Michelle Brachman, Werner Geyer, Toby Jia-Jun Li, and Diego Gómez-Zará

work page

[21] [21]

MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow. (2025)

work page 2025

[22] [22]

Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K Kummerfeld, and Elena L Glassman. 2024. Supporting sensemaking of large language model outputs at scale. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–21

work page 2024

[23] [23]

Elena L Glassman, Jeremy Scott, Rishabh Singh, Philip J Guo, and Robert C Miller

work page

[24] [24]

OverCode: Visualizing variation in student solutions to programming problems at scale.ACM Transactions on Computer-Human Interaction (TOCHI) 22, 2 (2015), 1–35

work page 2015

[25] [25]

Google. 2024. Gemini Deep Research - your personal research assistant. https: //gemini.google/overview/deep-research/. Accessed: 2025-04-10

work page 2024

[26] [26]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?arXiv preprint arXiv:2404.06654 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Peiling Jiang, Jude Rayan, Steven P Dow, and Haijun Xia. 2023. Graphologue: Exploring large language model responses with interactive diagrams. InProceed- ings of the 36th annual ACM symposium on user interface software and technology. 1–20

work page 2023

[30] [30]

Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2024)

work page 2024

[31] [31]

Andrej [@karpathy] Karpathy. 2025. My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now. [...] In absence of great comprehensive evals I tried to turn to vibe checks instead, but I now fear they are misleading and there is too much opportunity for confirmation bias, too low sample size, etc., it’s just n...

work page 2025

[32] [32]

Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining interpretability and explainability using sensemaking theory. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Trans- parency. 702–714

work page 2022

[33] [33]

Minjeong Kim, Kyeongpil Kang, Deokgun Park, Jaegul Choo, and Niklas Elmqvist

work page

[34] [34]

Topiclens: Efficient multi-level visual topic exploration of large-scale document collections.IEEE transactions on visualization and computer graphics 23, 1 (2016), 151–160

work page 2016

[35] [35]

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al . 2023. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations

work page 2023

[36] [36]

Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, et al

work page

[37] [37]

The biggen bench: A principled benchmark for fine-grained evaluation of language models with language models.arXiv preprint arXiv:2406.05761(2024)

work page arXiv 2024

[38] [38]

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, et al

work page

[39] [39]

Scaling Evaluation-time Compute with Reasoning Models as Process Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Tae Soo Kim*, Heechan Lee*, Yoonjoo Lee, Joseph Seering, and Juho Kim Evaluators.arXiv preprint arXiv:2503.19877(2025)

work page internal anchor Pith review arXiv 2018

[40] [40]

Tae Soo Kim, Yoonjoo Lee, Minsuk Chang, and Juho Kim. 2023. Cells, Genera- tors, and Lenses: Design Framework for Object-Oriented Interaction with Large Language Models. InThe 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), October 29-November 1, 2023, San Francisco, CA, USA (San Francisco, CA, USA)(UIST ’23). Association f...

work page 2023

[41] [41]

Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2024. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–21

work page 2024

[42] [42]

Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency. 29–38

work page 2019

[43] [43]

Michelle S Lam, Fred Hohman, Dominik Moritz, Jeffrey P Bigham, Kenneth Holstein, and Mary Beth Kery. 2024. AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking.arXiv preprint arXiv:2409.18203(2024)

work page arXiv 2024

[44] [44]

Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael S Bern- stein. 2024. Concept induction: Analyzing unstructured text with high-level concepts using lloom. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–28

work page 2024

[45] [45]

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. 2024. Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787(2024)

work page arXiv 2024

[46] [46]

Q Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: informing design practices for explainable AI user experiences. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–15

work page 2020

[47] [47]

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi

work page

[48] [48]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770(2024)

work page arXiv 2024

[49] [49]

What It Wants Me To Say

Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Benjamin Zorn, Jack Williams, Neil Toronto, and Andrew D. Gordon. 2023. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code- Generating Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’2...

work page arXiv 2023

[50] [50]

Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A Myers. 2024. Selenite: Scaffolding Online Sensemak- ing with Comprehensive Overviews Elicited from Large Language Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26

work page 2024

[51] [51]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Stuart Lloyd. 1982. Least squares quantization in PCM.IEEE transactions on information theory28, 2 (1982), 129–137

work page 1982

[53] [53]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

work page

[54] [54]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

James MacQueen. 1967. Some methods for classification and analysis of multivari- ate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Vol. 5. University of California press, 281–298

work page 1967

[56] [56]

Leland McInnes, John Healy, Steve Astels, et al . 2017. hdbscan: Hierarchical density based clustering.J. Open Source Softw.2, 11 (2017), 205

work page 2017

[57] [57]

Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- ifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[58] [58]

Aditi Mishra, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2023. PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models.arXiv preprint arXiv:2304.01964(2023)

work page arXiv 2023

[59] [59]

Vishvak Murahari, Ameet Deshpande, Peter Clark, Tanmay Rajpurohit, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan. 2023. Qualeval: Qualitative evaluation for model improvement.arXiv preprint arXiv:2311.02807(2023)

work page arXiv 2023

[60] [60]

Srishti Palani, Zijian Ding, Austin Nguyen, Andrew Chuang, Stephen MacNeil, and Steven P Dow. 2021. CoNotate: Suggesting queries based on notes promotes knowledge discovery. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14

work page 2021

[61] [61]

Srishti Palani, Yingyi Zhou, Sheldon Zhu, and Steven P Dow. 2022. InterWeave: Presenting Search Suggestions in Context Scaffolds Information Search and Synthesis. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–16

work page 2022

[62] [62]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

work page 2023

[63] [63]

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Carole C Perlman. 2003. Performance Assessment: Designing Appropriate Per- formance Tasks and Scoring Rubrics. (2003)

work page 2003

[65] [65]

Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, Vol. 5. McLean, VA, USA, 2–4

work page 2005

[66] [66]

Napol Rachatasumrit, Gonzalo Ramos, Jina Suh, Rachel Ng, and Christopher Meek

work page

[67] [67]

InProceedings of the 26th International Conference on Intelligent User Interfaces

Forsense: Accelerating online research through sensemaking integration and machine research support. InProceedings of the 26th International Conference on Intelligent User Interfaces. 608–618

work page

[68] [68]

Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive testing and debugging of NLP models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3253–3267

work page 2022

[69] [69]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList.arXiv preprint arXiv:2005.04118(2020)

work page arXiv 2020

[70] [70]

InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23)

Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, and Fred Hohman. 2023. Angler: Helping Machine Translation Practitioners Prioritize Model Improvements. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 832, 20 pages. ...

work page doi:10.1145/3544548 2023

[71] [71]

Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. 2024. Lmunit: Fine-grained evaluation with natural language unit tests.arXiv preprint arXiv:2412.13091(2024)

work page arXiv 2024

[72] [72]

Shreya Shankar, JD Zamfirescu-Pereira, Björn Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. Who validates the validators? aligning llm-assisted evalu- ation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

work page 2024

[73] [73]

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, et al . 2024. Towards bidirectional human-ai alignment: A systematic review for clarifications, framework, and future directions.arXiv preprint arXiv:2406.09264(2024)

work page arXiv 2024

[74] [74]

Venkatesh Sivaraman, Zexuan Li, and Adam Perer. 2025. Divisi: Interactive Search and Visualization for Scalable Exploratory Subgroup Analysis.arXiv preprint arXiv:2502.10537(2025)

work page arXiv 2025

[75] [75]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [76]

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M. Rush. 2023. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models.IEEE Transactions on Visualization and Computer Graphics29, 1 (2023), 1146–1156. doi:10.1109/TVCG.2022.3209479

work page doi:10.1109/tvcg.2022.3209479 2023

[78] [78]

Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Lumi- nate: Structured generation and exploration of design space with large language models for human-ai co-creation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26

work page 2024

[79] [79]

Sangho Suh, Bryan Min, Srishti Palani, and Haijun Xia. 2023. Sensecape: En- abling multilevel exploration and sensemaking with large language models. In Proceedings of the 36th annual ACM symposium on user interface software and technology. 1–18

work page 2023

[80] [80]

Annalisa Szymanski, Simret Araya Gebreegziabher, Oghenemaro Anuyah, Ronald A Metoyer, and Toby Jia-Jun Li. 2024. Comparing Criteria Develop- ment Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation.arXiv preprint arXiv:2410.02054(2024)

work page arXiv 2024