pith. machine review for the scientific record. sign in

arxiv: 2509.11206 · v4 · submitted 2025-09-14 · 💻 cs.HC · cs.AI· cs.CL

Recognition: unknown

Evalet: Evaluating Large Language Models through Functional Fragmentation

Authors on Pith no claims yet
classification 💻 cs.HC cs.AIcs.CL
keywords evaluationoutputsscoresthemapproachelementsevaletevaluations
0
0 comments X
read the original abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria

    cs.HC 2026-04 unverdicted novelty 6.0

    MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...