Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis

Abhishek Purushothama; Brandon Waldon; Junghyun Min; Nathan Schneider

arxiv: 2510.25356 · v3 · submitted 2025-10-29 · 💻 cs.CL

Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis

Abhishek Purushothama , Junghyun Min , Brandon Waldon , Nathan Schneider This is my paper

Pith reviewed 2026-05-18 03:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLMsordinary meaninglegal interpretationprompt sensitivityrobustnesshuman correlation

0 comments

The pith

Large language models produce inconsistent ordinary meaning interpretations when prompt formats change slightly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs can reliably identify the ordinary meaning of legal texts, an approach some legal scholars and judges have proposed using models for. Experiments reveal that minor rephrasings of the same question often produce contradictory outputs from the models. The authors also compare model answers against a dataset of human responses to similar legal questions and find only moderate agreement. These results question the assumption that pretraining data alone equips LLMs to handle ordinary meaning analysis in practice.

Core claim

For the models evaluated, slight changes to the format of a question can lead to wildly different conclusions about a text's ordinary meaning, creating a vulnerability that parties with an interest in the outcome could exploit. When compared to human responses on similar legal interpretation questions, the models show at best moderate correlation with human judgments, which falls short for applications with significant stakes.

What carries the argument

Controlled experiments that vary the format of legal interpretation prompts while holding the underlying question fixed, paired with correlation measurements against a human ordinary-meaning dataset.

If this is right

Legal practitioners cannot treat current LLMs as authoritative sources for ordinary meaning without additional safeguards.
Interested parties could influence outcomes by choosing particular prompt wordings.
Pretraining scale by itself does not produce the consistency required for legal interpretation tasks.
Any LLM use in this domain needs explicit robustness checks before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-sensitivity pattern may affect other interpretive tasks such as contract or regulatory analysis.
Standardized prompt templates or verification steps could reduce some of the observed inconsistency.
Hybrid human-plus-model workflows might be needed until models demonstrate greater stability.

Load-bearing premise

The specific prompt variations and legal questions tested here reflect the robustness challenges that would appear when LLMs are actually used by judges or scholars for ordinary meaning analysis.

What would settle it

A test in which legal experts apply LLMs to real cases using multiple equivalent prompt formats and consistently obtain the same ordinary-meaning conclusion across formats would challenge the central robustness claim.

read the original abstract

In the U.S. judicial system, a widespread approach to legal interpretation entails assessing how a legal text would be understood by an `ordinary' speaker of the language. Recent scholarship has proposed that legal practitioners leverage large language models (LLMs) to ascertain a text's ordinary meaning. But are LLMs up to the task? As textual interpretation questions arise in spheres ranging from criminal law to civil rights, we argue it is crucial that models not be taken as authoritative without rigorous evaluation. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges, who reasoned the large amount of data that models see in training would enable models to illuminate how people ordinarily use certain words or phrases. In controlled experiments, we find failures in robustness which cast doubt on this assumption and raise serious questions about the utility of these models in practice. For the models in our evaluation, slight changes to the format of a question can lead to wildly different conclusions -- a vulnerability that parties with an interest in the outcome could exploit. Comparing with a dataset where people were asked similar legal interpretation questions, we see that these models are at best moderately correlated to human judgments -- not strong enough given the stakes in this domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs are unsuitable for ordinary meaning analysis in legal interpretation despite large-scale pretraining. Through controlled experiments, it shows that minor changes to prompt format produce substantially divergent outputs on legal questions, creating exploitable vulnerabilities, and that model outputs correlate only moderately with human judgments on comparable tasks.

Significance. If the empirical findings hold, the work would caution against the emerging practice of using LLMs for textualist analysis in law, highlighting risks of inconsistency and misalignment with ordinary speaker intuitions. It contributes concrete evidence to debates at the intersection of NLP and legal scholarship on the limits of pretrained models for high-stakes interpretive tasks.

major comments (2)

[Introduction and Experiments] The central robustness claim depends on the tested prompt variations serving as a proxy for real judicial and scholarly LLM use. The manuscript does not demonstrate that the minimal format changes examined match the richer contextual, chain-of-thought, or domain-specific prompting strategies documented in the legal literature cited in the introduction; without this link, the observed divergence may not generalize beyond the experimental setup.
[Results and Comparison with Human Data] Human correlation results: the abstract states 'at best moderately correlated' with a human dataset, but the methods and results sections provide insufficient detail on dataset size, exact correlation metric (e.g., Pearson vs. agreement rate), statistical controls, and whether prompt variants were selected post-hoc. This weakens the load-bearing claim that the correlation is 'not strong enough given the stakes.'

minor comments (2)

[Methods] Clarify the exact legal questions and prompt templates used in the controlled experiments so readers can assess ecological validity.
[Discussion] Add a limitations subsection explicitly addressing the scope of prompt variations tested versus broader real-world usage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and valuable suggestions. We believe the revisions we have made in response to your comments have improved the manuscript's clarity and rigor, particularly regarding the connection to legal prompting practices and the transparency of our empirical analyses. We respond to each major comment in turn below.

read point-by-point responses

Referee: The central robustness claim depends on the tested prompt variations serving as a proxy for real judicial and scholarly LLM use. The manuscript does not demonstrate that the minimal format changes examined match the richer contextual, chain-of-thought, or domain-specific prompting strategies documented in the legal literature cited in the introduction; without this link, the observed divergence may not generalize beyond the experimental setup.

Authors: We agree that establishing a clearer connection between our experimental prompt variations and real-world prompting practices in legal scholarship is important for the generalizability of our robustness findings. Our experiments were designed to isolate the impact of basic format changes, which we argue are relevant because they represent the kind of minor variations that could occur in practice. To address this, we have revised the manuscript to include a direct comparison in the Introduction section, referencing specific prompting approaches from the cited legal literature and explaining how our minimal changes relate to them. We also added a limitations paragraph acknowledging that more complex prompting strategies might reduce but not eliminate the observed inconsistencies. This revision maintains our central claim while better situating it within existing legal AI research. revision: partial
Referee: Human correlation results: the abstract states 'at best moderately correlated' with a human dataset, but the methods and results sections provide insufficient detail on dataset size, exact correlation metric (e.g., Pearson vs. agreement rate), statistical controls, and whether prompt variants were selected post-hoc. This weakens the load-bearing claim that the correlation is 'not strong enough given the stakes.'

Authors: We appreciate the referee pointing out the need for greater transparency in our human correlation analysis. We recognize that additional details were warranted to fully support our claims. In the revised manuscript, we have expanded the Methods section to report the dataset size, specify the exact correlation metric (Pearson's correlation), describe the statistical controls employed, and confirm that the prompt variants were pre-specified rather than selected post-hoc. Corresponding updates have been made to the Results section, including additional tables for full transparency. These revisions strengthen the evidentiary basis for concluding that the observed correlations are not sufficiently strong for high-stakes legal applications. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation grounded in external human dataset

full rationale

This is an empirical study that runs controlled experiments on LLMs using specific legal interpretation questions and prompt format variations, then directly compares model outputs to an external dataset of human judgments on similar questions. The central claims about prompt sensitivity and moderate correlation rest on these experimental results and the external benchmark rather than any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain reduces to its own inputs by construction; the paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on standard assumptions about prompting LLMs and the representativeness of the chosen legal questions and human dataset; no free parameters, invented entities, or ad-hoc axioms are evident from the abstract.

axioms (1)

domain assumption LLMs can be meaningfully evaluated for ordinary meaning via controlled prompting experiments on legal texts.
The entire evaluation framework rests on this premise that prompt-based testing reveals the models' suitability for judicial ordinary-meaning analysis.

pith-pipeline@v0.9.0 · 5761 in / 1168 out tokens · 26481 ms · 2026-05-18T03:04:09.590505+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We examine first-token continuation probabilities... across 9 systematic question variants... correlation... R² value greater than 0.5
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Models show a ubiquitous lack of consistency across question variants

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Speaking of Language: Reflections on Metalanguage Research in NLP
cs.CL 2026-04 unverdicted novelty 3.0

This reflection paper highlights metalanguage in NLP, links it to LLMs, and lists understudied future directions.