pith. sign in

arxiv: 2604.15508 · v1 · submitted 2026-04-16 · 💻 cs.CY · cs.AI

LLMbench: A Comparative Close Reading Workbench for Large Language Models

Pith reviewed 2026-05-10 09:24 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords LLMclose readingdigital humanitieslog probabilityhermeneuticscomparative analysisAI criticismvisualization
0
0 comments X

The pith

LLMbench visualizes the probability structures of LLM outputs to support close reading and critical analysis in the humanities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLMbench as a browser-based workbench that places two model responses to the same prompt side by side for annotation and comparison. Four overlays reveal token log-probabilities, word-level differences, metadiscourse tone, and sentence structure, while five modes explore stochastic variation, temperature effects, prompt sensitivity, token probabilities, and cross-model divergence. These features treat generated text as one realization from a probability distribution, displaying its counterfactual alternatives through heatmaps, sparklines, pixel maps, and three-dimensional terrains. The author claims that log-probability data, currently underused in humanistic and social-scientific work, supplies an essential resource for interpreting how generative AI models produce text.

Core claim

LLMbench treats LLM-generated text as a research object drawn from a probability distribution and supplies visual and analytical layers that make the distribution legible at the token level. Side-by-side panels with overlays for probabilities, differences, tone, and structure, combined with modes such as Stochastic Variation and Temperature Gradient, allow users to inspect why particular tokens were selected and to trace the alternative histories that were not realized. This setup is presented as a method for bringing hermeneutic close-reading practices to bear on the internal mechanics of large language models.

What carries the argument

The four analytical overlays (Probabilities, Differences, Tone, Structure) and five analytical modes that expose token-level log-probabilities and generation variations, rendering the probabilistic structure of text available for inspection.

If this is right

  • Humanistic scholars can examine the probabilistic origins of specific word choices in AI text rather than treating outputs as fixed artifacts.
  • Comparative study of models becomes possible through direct inspection of where their probability distributions diverge at the token level.
  • Modes such as Temperature Gradient make the effect of sampling parameters visible as changes in the space of possible texts.
  • Generated text is repositioned as one path through a larger distribution, opening questions about why certain alternatives were suppressed.
  • Log-probability data moves from an engineering detail into a primary object for interpretive work in critical AI studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar interfaces could be adapted for classroom settings to teach how models weigh linguistic options during generation.
  • The approach might link to literary studies by framing probability terrains as maps of narrative or stylistic possibility.
  • Developers could apply the same visualizations to audit how training data shapes the likelihoods assigned to different cultural or ideological expressions.
  • Extending the tool to non-English models would test whether the same hermeneutic value holds across languages with different tokenization patterns.

Load-bearing premise

The visual overlays and modes will actually enable users to perform meaningful hermeneutic close readings and generate critical insights rather than remaining at the level of descriptive visualizations.

What would settle it

A controlled study in which digital humanities researchers use the tool on matched prompts but produce no interpretations that incorporate probability information or counterfactual alternatives beyond what surface text reading already provides.

Figures

Figures reproduced from arXiv: 2604.15508 by David M. Berry.

Figure 1
Figure 1. Figure 1: Compare mode with Probs overlay active across both panels. The continuous token heatmap is visible on each panel, ranging from uncoloured (confident) through yellow and orange to deep red (uncertain). The navigation strip is visible below the toolbar, showing Uncertain (399), Forks (174), and Diverge (281) counts. Both panels display Gemini 2.0 Flash and GPT-4o responses to the Calvino prompt, with the Ton… view at source ↗
Figure 2
Figure 2. Figure 2: The probability inspector panel pinned in both Panel A and Panel B. Panel A shows Position 26/399, Entropy 2.315 bits, Chosen 11.76%, with probability distribution bars for the top alternatives. Panel B shows Position 26/287, Entropy 1.567 bits, Chosen 49.27%, with its own distribution. The divergence annotation is visible below each inspector, noting that the two models chose different tokens here. Three … view at source ↗
Figure 3
Figure 3. Figure 3: The Graph band active, showing the entropy curve at the top of the interface. Blue and orange sparklines (Panel A and B respectively) trace per-token entropy across all token positions, with the horizontal axis showing token position up to approximately 398 and the vertical axis measuring bits (0 to 2). Below the curve, both panel heatmaps remain visible with the probability inspector open. The Uncertain a… view at source ↗
Figure 4
Figure 4. Figure 4: The Pixels band active, showing the token pixel map for both panels above the regular text view. Each cell represents one token, coloured by probability using the Heat palette. Panel A (Gemini 2.0 Flash, 398 tokens) and Panel B (GPT-4o, 267 tokens) are displayed side by side, with their different response lengths visible in the different widths of the grids. The colour distribution varies noticeably betwee… view at source ↗
Figure 5
Figure 5. Figure 5: The Net band active, showing two 3D probability skyline meshes side by side, one for each panel. The WebGL terrain is visible with peaks corresponding to high-entropy token positions and flat areas corresponding to confident passages. Floating labels identify the top-5 highest-entropy points on each mesh. The standard text panels with heatmap overlay are visible below. The label bar at top identifies this … view at source ↗
Figure 6
Figure 6. Figure 6: Compare mode with the Diff overlay active. Both panels display the Calvino responses with word-level highlighting. Words unique to each panel are highlighted in the respective panel’s colour. Unique-word counts are visible in each panel header (Panel A: 52 unique; Panel B: 49 unique). The numbered sentence markers from the Struct view are visible in the gutter alongside the diff highlighting. Both panels a… view at source ↗
Figure 7
Figure 7. Figure 7: Compare mode with the Tone overlay active. Both panels display the Calvino responses with Hyland’s metadiscourse categories applied as colour-coded highlights throughout the text. The category count chips are visible at the top of each panel (Hedges, Boosters, Limiting, Attitude, Intensifiers, Self-mentions, Engagement), with numerical counts. The register balance bar at the foot of each panel shows propor… view at source ↗
Figure 8
Figure 8. Figure 8: Stochastic Variation mode showing five runs of the Calvino prompt to Gemini 2.0 Flash. Summary statistics at the top: 386 average words, 54.8% average vocabulary diversity, 42.4% average pairwise overlap, 5/5 runs complete. Three result cards are visible in the main area: Run 1 (442 words, 52% lexical diversity), Run 2 (409 words, 50% lexical diversity), Run 3 (368 words, 56% lexical diversity), with Run 4… view at source ↗
Figure 9
Figure 9. Figure 9: Temperature Gradient mode showing six results for Gemini 2.0 Flash at temperatures 0.0, 0.3, 0.7, 1.0, 1.5, and 2.0. The summary strip at the top shows word count range 393- 458, diversity range 48-61%, and 6/6 temperatures complete. Six result cards are arranged in two rows of three. Each card shows temperature value, word count, lexical diversity percentage, and a preview of the response text. The Full T… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt Sensitivity mode showing the base Calvino prompt and four auto-generated variations. The summary strip shows 389 base words, 6/7 variations complete, 37.8% average overlap with base, 6 successful runs. The Base Prompt card is visible at the top (389 words, labelled Base). Below it, four variation cards: Add “Please” (458 words, 35% overlap), Add period (383 words, 33% overlap), Add “Step by step” (… view at source ↗
Figure 11
Figure 11. Figure 11: Token Probabilities standalone mode for Gemini 2.0 Flash responding to the Calvino prompt. The summary bar shows Mean Entropy 0.704, Avg Probability 73.2%, Max Entropy Token “the”, Total Tokens 449. Below, the Entropy Distribution histogram is visible with five confidence bands. The Token Heatmap tab is active, showing the coloured response text with a probability inspector pinned to the right showing Pos… view at source ↗
Figure 12
Figure 12. Figure 12: Cross-Model Divergence mode comparing Gemini 2.0 Flash and GPT-4o on the Calvino prompt. The Divergence Metrics panel shows: Jaccard Similarity 22.8%, Word Overlap 37.1%, Shared Words 65, Unique to A 121, Unique to B 99. Below, per-panel structural metrics: Panel A (Gemini 2.0 Flash) 322 words, 16 sentences, 20.1 avg sent length, 58% vocab diversity; Panel B (GPT-4o) 262 words, 10 sentences, 26.2 avg sent… view at source ↗
read the original abstract

LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR's LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible at the token level. The tool treats the generated text as a research object in its own right from a probability distribution, a text that could have been otherwise, and provides visualisations including continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains, that show the counterfactual history from which each word emerged. This paper describes the tool's architecture, its six modes, and its design rationale, and argues that log-probability data, currently underused in humanistic and social-scientific readings of AI, is an important resource for a critical studies of generative AI models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents LLMbench, a browser-based workbench for comparative close reading of LLM outputs oriented toward digital humanities hermeneutics rather than quantitative metrics. It describes side-by-side annotatable panels for two responses to the same prompt, four overlays (Probabilities for token log-probabilities, Differences for word-level diffs, Tone for Hyland-style metadiscourse, Structure for sentence parsing and connectives), and five analytical modes (Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, Cross-Model Divergence) that render probabilistic structure via heatmaps, entropy sparklines, pixel maps, and 3D probability terrains. The manuscript details the architecture and design rationale while arguing that log-probability data, currently underused in humanistic readings, is an important resource for critical studies of generative AI.

Significance. If the overlays and modes demonstrably enable analysts to extract interpretive or critical claims about LLM text (e.g., contingency, voice, or stance) unavailable from surface strings alone, the tool could usefully bridge computational and critical approaches by treating generated text as a counterfactual probability distribution. The manuscript supplies only design rationale and visualizations, however, so any such significance remains prospective rather than established.

major comments (2)
  1. [Description of the five analytical modes and four overlays] The section describing the analytical modes and overlays supplies no worked example in which an analyst applies, for instance, the Probabilities overlay or Token Probabilities mode to derive a specific hermeneutic or critical claim (about ideology, epistemic stance, or textual contingency) that could not be reached from the generated strings themselves. This absence leaves the central assertion that log-probability data is an 'important resource' for critical studies as an untested design hypothesis.
  2. [Overall manuscript (no dedicated evaluation section)] No user study, error analysis, or evaluation data is reported to test whether the visualizations actually support meaningful critical reading practices. The soundness of the claim that the tool 'supports critical studies' therefore rests entirely on the stated design rationale.
minor comments (2)
  1. [Abstract] The abstract states that the paper describes 'its six modes' while explicitly listing only five analytical modes; this internal inconsistency should be resolved.
  2. [Description of the Tone overlay] The Tone overlay is described as implementing 'Hyland-style metadiscourse analysis,' but the manuscript does not detail the underlying classifier, its training data, or any validation against human annotations of LLM text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review, which correctly identifies that the manuscript presents LLMbench as a design contribution oriented toward digital humanities close reading rather than quantitative evaluation. We address each major comment below, clarifying the paper's scope while indicating targeted revisions where they strengthen the presentation without altering its core focus.

read point-by-point responses
  1. Referee: The section describing the analytical modes and overlays supplies no worked example in which an analyst applies, for instance, the Probabilities overlay or Token Probabilities mode to derive a specific hermeneutic or critical claim (about ideology, epistemic stance, or textual contingency) that could not be reached from the generated strings themselves. This absence leaves the central assertion that log-probability data is an 'important resource' for critical studies as an untested design hypothesis.

    Authors: We agree that an illustrative example would make the design rationale more concrete. The manuscript's scope is the tool's architecture, overlays, modes, and the argument that log-probability data is currently underused in humanistic readings; it does not present a full case study. In revision we will insert a concise worked example (approximately one paragraph plus figure) showing how the Probabilities overlay on a sample output can surface token-level contingency and alternative phrasings that inform a critical observation about epistemic stance, thereby grounding the claim in a specific instance without claiming broader empirical validation. revision: partial

  2. Referee: No user study, error analysis, or evaluation data is reported to test whether the visualizations actually support meaningful critical reading practices. The soundness of the claim that the tool 'supports critical studies' therefore rests entirely on the stated design rationale.

    Authors: The manuscript is explicitly a tool-description and design-rationale paper, not an empirical evaluation study. We do not report user studies or error analyses because such work lies outside the stated scope; the claim that the tool supports critical studies is advanced as a design hypothesis supported by the rationale that making probabilistic structure legible enables new hermeneutic attention to contingency and counterfactual history. We will revise the introduction and conclusion to state this positioning more explicitly so that readers do not infer empirical validation where none is claimed. revision: no

Circularity Check

0 steps flagged

No circularity: tool-description paper with no derivations or load-bearing self-references

full rationale

The manuscript is a descriptive account of LLMbench's UI, overlays (Probabilities, Differences, Tone, Structure), and modes (Stochastic Variation, Temperature Gradient, etc.). It contains no equations, no fitted parameters, no predictions, and no self-citations that justify a central claim. The assertion that log-probability data is 'an important resource' is offered as design rationale rather than derived from any prior result by construction. No step reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new entities are introduced; the paper is a software-tool description.

pith-pipeline@v0.9.0 · 5541 in / 966 out tokens · 39164 ms · 2026-05-10T09:24:54.081590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    Kahng et al

    The proliferation of large language models (LLMs) has generated an equally prolific effort to measure them. Kahng et al. (2024) name part of the problem directly in their discussion of Google PAIR’s LLM Comparator. Side-by-side evaluation of models, they argue, is a key practice, and existing tools for it tend to be quantitative or rely on presenting user...

  2. [2]

    These tools count, visualise, index, and annotate, and they support both close and distant reading

    probably being the most familiar, alongside TAPoR, AntConc, MALLET, CATMA, Recogito, and a lineage of corpus-reading environments designed for scholarly rather than evaluative work. These tools count, visualise, index, and annotate, and they support both close and distant reading. They were built for a fixed object, a human-authored text or corpus whose m...

  3. [3]

    Tell me about Calvino’s Cybernetics and Ghosts and its relevance for AI today,

    support scholarly annotation, the former for textual markup and the latter for geographical and entity-level tagging. The lineage is long, and includes earlier environments like WordHoard, TuStep, and the corpus tools of the Text Encoding Initiative (TEI). What unites these is an assumption about their object. The text is given and it is a relatively stab...

  4. [4]

    CCS offers an annotation methodology (Observation, Question, Metaphor, Pattern, Context, Critique) that LLMbench inherits and adapts for model outputs

    extends close reading from literary text to source code, arguing that code is a cultural text whose meaning emerges through the interplay of technical function, authorial choice, and social context. CCS offers an annotation methodology (Observation, Question, Metaphor, Pattern, Context, Critique) that LLMbench inherits and adapts for model outputs. CCS re...

  5. [5]

    Probs” in the Compare toolbar re-sends the current prompt to the model API and requests what is called “logprob

    offers a useful resource for thinking about this. Reading a one-line BASIC program through the many variants it could produce, the authors show how a simple probabilistic structure generates an analytically productive surface. LLMbench extends this move to the very different object of large language model output. Different runs of the same prompt, differe...

  6. [6]

    explores the intersection of humanity and technology, particularly [concerning] themes that resonate strongly with contemporary discussions surrounding artificial intelligence

    Below, per-panel structural metrics: Panel A (Gemini 2.0 Flash) 322 words, 16 sentences, 20.1 avg sent length, 58% vocab diversity; Panel B (GPT-4o) 262 words, 10 sentences, 26.2 avg sent length, 63% vocab diversity. The Vocabulary Analysis expandable section is visible. Both full response text panels are visible at the bottom (note these results are from...

  7. [7]

    Different variants of the same text, whether produced by different models or by the same model at different temperatures, are analytically productive to explore

    is a key principle behind the ideas for this tool. Different variants of the same text, whether produced by different models or by the same model at different temperatures, are analytically productive to explore. They reveal what is deterministic and what is contingent, what depends on particular training decisions and what the models share. For example, ...

  8. [8]

    22 Pasquinelli, M

    MIT Press. 22 Pasquinelli, M. (2023) The Eye of the Master: A Social History of Artificial Intelligence. Verso. Ricoeur, P. (1981) ‘Metaphor and the central problem of hermeneutics’, in Hermeneutics and the Human Sciences, ed. J. B. Thompson. Cambridge University Press, pp. 127-143. Simon, R. et al. (2017) ‘Linked Data Annotation Without the Pointy Bracke...