Evalet: Evaluating Large Language Models through Functional Fragmentation
Pith reviewed 2026-05-18 16:53 UTC · model grok-4.3
The pith
Functional fragmentation breaks LLM evaluations into specific rhetoric functions so users can see exactly what drives the scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Functional fragmentation dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria, surfacing the elements of interest and revealing how they fulfill or hinder user goals. The approach is realized in Evalet, an interactive visualization system that supports inspection, rating, and comparison of evaluations across outputs. A study with ten practitioners showed the method helped identify 48% more misalignments, which supported calibrating trust in LLM evaluations and finding more actionable issues in model outputs.
What carries the argument
Functional fragmentation, the process of dividing outputs into fragments and assigning rhetoric functions relative to evaluation criteria to show alignment or misalignment with user goals.
If this is right
- Users can validate specific elements of an evaluation instead of accepting or rejecting an overall score.
- Trust in LLM-as-a-Judge outputs can be adjusted based on which functions align with or contradict the criteria.
- More actionable issues in model outputs become visible through the fragment-level view.
- Evaluation practices shift from quantitative scores toward qualitative analysis of model behavior.
Where Pith is reading between the lines
- The same fragmentation logic could be applied to evaluate non-text generative outputs such as images or code.
- Future tools might combine the method with automated labeling to scale beyond manual interpretation.
- Insights from many such analyses could guide improvements in how prompts instruct LLM judges.
Load-bearing premise
The functional fragmentation process and resulting labels accurately capture the rhetoric functions that matter to users without introducing interpreter bias or missing important context.
What would settle it
A follow-up study with a larger participant pool that measures no meaningful increase in identified misalignments or trust calibration when using functional fragmentation versus holistic scores.
Figures
read the original abstract
Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces functional fragmentation, a method that dissects LLM-generated outputs into key fragments and interprets the rhetoric functions each fragment serves relative to evaluation criteria. This approach is instantiated in the Evalet interactive system, which visualizes fragment-level functions to support inspection, rating, and comparison of evaluations. A user study (N=10) reports that the method helped practitioners identify 48% more evaluation misalignments than holistic scoring, enabling better calibration of trust and discovery of actionable issues in model outputs.
Significance. If the empirical findings are robustly supported, the work provides a practical shift from opaque holistic LLM-as-a-Judge scores toward qualitative, fine-grained analysis. The interactive visualization in Evalet represents a concrete HCI contribution for making LLM evaluations more transparent and actionable for practitioners.
major comments (2)
- [Abstract] Abstract: The central claim of a 48% improvement in identified misalignments from the N=10 user study is presented without any description of study design, tasks used, controls, statistical tests, baseline comparisons, or operational definition of 'evaluation misalignments.' This leaves the primary empirical result only partially supported.
- [Abstract] Abstract: The method's effectiveness rests on the premise that functional fragmentation and its rhetoric-function labels accurately surface elements that matter to users and reveal goal fulfillment/hindrance without systematic interpreter bias. No validation of label fidelity (e.g., inter-rater agreement, alignment with independent user goals, or checks against dropped context) is described, raising the possibility that reported gains reflect labeling artifacts rather than genuine insight.
minor comments (1)
- [Abstract] Abstract: The abstract introduces 'functional fragmentation' and 'Evalet' without situating them against prior work on LLM evaluation or qualitative analysis tools.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify how to better support our empirical claims and methodological premises. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of a 48% improvement in identified misalignments from the N=10 user study is presented without any description of study design, tasks used, controls, statistical tests, baseline comparisons, or operational definition of 'evaluation misalignments.' This leaves the primary empirical result only partially supported.
Authors: We agree that the abstract, constrained by length, omits key study details. Section 5 of the full manuscript describes the within-subjects design with N=10 practitioners, the evaluation tasks (assessing LLM outputs against user-defined criteria), counterbalanced controls, baseline holistic scoring, paired statistical comparisons, and the operational definition of misalignments as user-identified discrepancies not reflected in LLM scores. We will revise the abstract to briefly note the study setup and explicitly direct readers to Section 5 for the complete methodology, thereby strengthening support for the reported result. revision: yes
-
Referee: [Abstract] Abstract: The method's effectiveness rests on the premise that functional fragmentation and its rhetoric-function labels accurately surface elements that matter to users and reveal goal fulfillment/hindrance without systematic interpreter bias. No validation of label fidelity (e.g., inter-rater agreement, alignment with independent user goals, or checks against dropped context) is described, raising the possibility that reported gains reflect labeling artifacts rather than genuine insight.
Authors: We acknowledge this concern about potential interpreter bias in the rhetoric-function labels. The manuscript details that labels follow a predefined taxonomy derived from evaluation criteria and were applied by the authors with relevant expertise. However, formal validation metrics such as inter-rater agreement were not reported. In the revised manuscript we will add a dedicated validation subsection reporting results from an independent labeling exercise on a sample of fragments (including Cohen's kappa), alignment checks against user goals elicited in study interviews, and discussion of context preservation in the fragmentation process. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes functional fragmentation as a method for dissecting outputs and interpreting rhetoric functions, then instantiates it in the Evalet system. The central empirical claim—that the approach enabled identification of 48% more evaluation misalignments—is drawn directly from an independent N=10 user study comparing the method against holistic scoring. No equations, fitted parameters, self-citations, or uniqueness theorems are present that would reduce this measured improvement back to the method definition by construction. The result is externally validated through participant observations rather than being tautological with the input assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-a-Judge approaches produce holistic scores that obscure which specific elements influenced the assessments
invented entities (2)
-
functional fragmentation
no independent evidence
-
Evalet
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
functional fragmentation... dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
user study (N=10) found... 48% more evaluation misalignments
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...
Reference graph
Works this paper leans on
-
[1]
Adept AI. 2024. Manus: An Agentic Framework for Complex Task Automation. https://www.adept.ai/blog/manus. Accessed: 2025-04-10
work page 2024
-
[2]
Genspark AI. 2024. Genspark: Agents That Write Code and Explain It. https: //genspark.ai/. Accessed: 2025-04-10
work page 2024
-
[3]
Meta AI. 2025. Introducing Llama 4: 10 Million Token Context. https://ai.meta. com/llama/. Accessed: 2025-04-10
work page 2025
-
[4]
Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extract- ing categories and clusters from complex data. InProceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 989–998
work page 2014
-
[5]
Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. https://www.anthropic. com/news/claude-3-7-sonnet Accessed: March 19, 2025
work page 2025
-
[6]
Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L Glassman. 2024. Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18
work page 2024
-
[7]
Zahra Ashktorab, Michael Desmond, Qian Pan, James M Johnson, Martin Santillan Cooper, Elizabeth M Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, and Werner Geyer. 2024. Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences.arXiv preprint arXiv:2410.00873(2024)
- [8]
-
[9]
Measuring Implicit Bias in Explicitly Unbiased Large Language Models. arXiv:2402.04105 [cs.CY] https://arxiv.org/abs/2402.04105
-
[10]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [11]
-
[12]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
-
[13]
Ángel Alexander Cabrera, Erica Fu, Donald Bertucci, Kenneth Holstein, Ameet Talwalkar, Jason I. Hong, and Adam Perer. 2023. Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, N...
-
[14]
Ángel Alexander Cabrera, Marco Tulio Ribeiro, Bongshin Lee, Robert Deline, Adam Perer, and Steven M Drucker. 2023. What did my AI learn? How data scientists make sense of model behavior.ACM Transactions on Computer-Human Interaction30, 1 (2023), 1–27
work page 2023
-
[15]
Duen Horng Chau, Aniket Kittur, Jason I Hong, and Christos Faloutsos. 2011. Apolo: making sense of large network data by combining rich user interaction and machine learning. InProceedings of the SIGCHI conference on human factors in computing systems. 167–176
work page 2011
-
[16]
John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush: Sketching stories with generative pretrained language models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19
work page 2022
-
[17]
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [18]
-
[19]
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Simret Araya Gebreegziabher, Charles Chiang, Zichu Wang, Zahra Ashktorab, Michelle Brachman, Werner Geyer, Toby Jia-Jun Li, and Diego Gómez-Zará
-
[21]
MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow. (2025)
work page 2025
-
[22]
Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K Kummerfeld, and Elena L Glassman. 2024. Supporting sensemaking of large language model outputs at scale. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–21
work page 2024
-
[23]
Elena L Glassman, Jeremy Scott, Rishabh Singh, Philip J Guo, and Robert C Miller
-
[24]
OverCode: Visualizing variation in student solutions to programming problems at scale.ACM Transactions on Computer-Human Interaction (TOCHI) 22, 2 (2015), 1–35
work page 2015
-
[25]
Google. 2024. Gemini Deep Research - your personal research assistant. https: //gemini.google/overview/deep-research/. Accessed: 2025-04-10
work page 2024
-
[26]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?arXiv preprint arXiv:2404.06654 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Peiling Jiang, Jude Rayan, Steven P Dow, and Haijun Xia. 2023. Graphologue: Exploring large language model responses with interactive diagrams. InProceed- ings of the 36th annual ACM symposium on user interface software and technology. 1–20
work page 2023
-
[30]
Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2024)
work page 2024
-
[31]
Andrej [@karpathy] Karpathy. 2025. My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now. [...] In absence of great comprehensive evals I tried to turn to vibe checks instead, but I now fear they are misleading and there is too much opportunity for confirmation bias, too low sample size, etc., it’s just n...
work page 2025
-
[32]
Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining interpretability and explainability using sensemaking theory. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Trans- parency. 702–714
work page 2022
-
[33]
Minjeong Kim, Kyeongpil Kang, Deokgun Park, Jaegul Choo, and Niklas Elmqvist
-
[34]
Topiclens: Efficient multi-level visual topic exploration of large-scale document collections.IEEE transactions on visualization and computer graphics 23, 1 (2016), 151–160
work page 2016
-
[35]
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al . 2023. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations
work page 2023
-
[36]
Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, et al
- [37]
-
[38]
Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, et al
-
[39]
Scaling Evaluation-time Compute with Reasoning Models as Process Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Tae Soo Kim*, Heechan Lee*, Yoonjoo Lee, Joseph Seering, and Juho Kim Evaluators.arXiv preprint arXiv:2503.19877(2025)
work page internal anchor Pith review arXiv 2018
-
[40]
Tae Soo Kim, Yoonjoo Lee, Minsuk Chang, and Juho Kim. 2023. Cells, Genera- tors, and Lenses: Design Framework for Object-Oriented Interaction with Large Language Models. InThe 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), October 29-November 1, 2023, San Francisco, CA, USA (San Francisco, CA, USA)(UIST ’23). Association f...
work page 2023
-
[41]
Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2024. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–21
work page 2024
-
[42]
Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency. 29–38
work page 2019
- [43]
-
[44]
Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael S Bern- stein. 2024. Concept induction: Analyzing unstructured text with high-level concepts using lloom. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–28
work page 2024
- [45]
-
[46]
Q Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: informing design practices for explainable AI user experiences. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–15
work page 2020
-
[47]
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi
- [48]
-
[49]
Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Benjamin Zorn, Jack Williams, Neil Toronto, and Andrew D. Gordon. 2023. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code- Generating Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’2...
-
[50]
Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A Myers. 2024. Selenite: Scaffolding Online Sensemak- ing with Comprehensive Overviews Elicited from Large Language Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26
work page 2024
-
[51]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Stuart Lloyd. 1982. Least squares quantization in PCM.IEEE transactions on information theory28, 2 (1982), 129–137
work page 1982
-
[53]
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha
-
[54]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
James MacQueen. 1967. Some methods for classification and analysis of multivari- ate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Vol. 5. University of California press, 281–298
work page 1967
-
[56]
Leland McInnes, John Healy, Steve Astels, et al . 2017. hdbscan: Hierarchical density based clustering.J. Open Source Softw.2, 11 (2017), 205
work page 2017
-
[57]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- ifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [58]
- [59]
-
[60]
Srishti Palani, Zijian Ding, Austin Nguyen, Andrew Chuang, Stephen MacNeil, and Steven P Dow. 2021. CoNotate: Suggesting queries based on notes promotes knowledge discovery. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14
work page 2021
-
[61]
Srishti Palani, Yingyi Zhou, Sheldon Zhu, and Steven P Dow. 2022. InterWeave: Presenting Search Suggestions in Context Scaffolds Information Search and Synthesis. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–16
work page 2022
-
[62]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22
work page 2023
-
[63]
Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Carole C Perlman. 2003. Performance Assessment: Designing Appropriate Per- formance Tasks and Scoring Rubrics. (2003)
work page 2003
-
[65]
Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, Vol. 5. McLean, VA, USA, 2–4
work page 2005
-
[66]
Napol Rachatasumrit, Gonzalo Ramos, Jina Suh, Rachel Ng, and Christopher Meek
-
[67]
InProceedings of the 26th International Conference on Intelligent User Interfaces
Forsense: Accelerating online research through sensemaking integration and machine research support. InProceedings of the 26th International Conference on Intelligent User Interfaces. 608–618
-
[68]
Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive testing and debugging of NLP models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3253–3267
work page 2022
- [69]
-
[70]
Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, and Fred Hohman. 2023. Angler: Helping Machine Translation Practitioners Prioritize Model Improvements. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 832, 20 pages. ...
- [71]
-
[72]
Shreya Shankar, JD Zamfirescu-Pereira, Björn Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. Who validates the validators? aligning llm-assisted evalu- ation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14
work page 2024
-
[73]
Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, et al . 2024. Towards bidirectional human-ai alignment: A systematic review for clarifications, framework, and future directions.arXiv preprint arXiv:2406.09264(2024)
- [74]
-
[75]
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling llm test- time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M. Rush. 2023. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models.IEEE Transactions on Visualization and Computer Graphics29, 1 (2023), 1146–1156. doi:10.1109/TVCG.2022.3209479
-
[78]
Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Lumi- nate: Structured generation and exploration of design space with large language models for human-ai co-creation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–26
work page 2024
-
[79]
Sangho Suh, Bryan Min, Srishti Palani, and Haijun Xia. 2023. Sensecape: En- abling multilevel exploration and sensemaking with large language models. In Proceedings of the 36th annual ACM symposium on user interface software and technology. 1–18
work page 2023
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.