Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis
Pith reviewed 2026-05-18 03:04 UTC · model grok-4.3
The pith
Large language models produce inconsistent ordinary meaning interpretations when prompt formats change slightly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the models evaluated, slight changes to the format of a question can lead to wildly different conclusions about a text's ordinary meaning, creating a vulnerability that parties with an interest in the outcome could exploit. When compared to human responses on similar legal interpretation questions, the models show at best moderate correlation with human judgments, which falls short for applications with significant stakes.
What carries the argument
Controlled experiments that vary the format of legal interpretation prompts while holding the underlying question fixed, paired with correlation measurements against a human ordinary-meaning dataset.
If this is right
- Legal practitioners cannot treat current LLMs as authoritative sources for ordinary meaning without additional safeguards.
- Interested parties could influence outcomes by choosing particular prompt wordings.
- Pretraining scale by itself does not produce the consistency required for legal interpretation tasks.
- Any LLM use in this domain needs explicit robustness checks before deployment.
Where Pith is reading between the lines
- The same prompt-sensitivity pattern may affect other interpretive tasks such as contract or regulatory analysis.
- Standardized prompt templates or verification steps could reduce some of the observed inconsistency.
- Hybrid human-plus-model workflows might be needed until models demonstrate greater stability.
Load-bearing premise
The specific prompt variations and legal questions tested here reflect the robustness challenges that would appear when LLMs are actually used by judges or scholars for ordinary meaning analysis.
What would settle it
A test in which legal experts apply LLMs to real cases using multiple equivalent prompt formats and consistently obtain the same ordinary-meaning conclusion across formats would challenge the central robustness claim.
read the original abstract
In the U.S. judicial system, a widespread approach to legal interpretation entails assessing how a legal text would be understood by an `ordinary' speaker of the language. Recent scholarship has proposed that legal practitioners leverage large language models (LLMs) to ascertain a text's ordinary meaning. But are LLMs up to the task? As textual interpretation questions arise in spheres ranging from criminal law to civil rights, we argue it is crucial that models not be taken as authoritative without rigorous evaluation. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges, who reasoned the large amount of data that models see in training would enable models to illuminate how people ordinarily use certain words or phrases. In controlled experiments, we find failures in robustness which cast doubt on this assumption and raise serious questions about the utility of these models in practice. For the models in our evaluation, slight changes to the format of a question can lead to wildly different conclusions -- a vulnerability that parties with an interest in the outcome could exploit. Comparing with a dataset where people were asked similar legal interpretation questions, we see that these models are at best moderately correlated to human judgments -- not strong enough given the stakes in this domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs are unsuitable for ordinary meaning analysis in legal interpretation despite large-scale pretraining. Through controlled experiments, it shows that minor changes to prompt format produce substantially divergent outputs on legal questions, creating exploitable vulnerabilities, and that model outputs correlate only moderately with human judgments on comparable tasks.
Significance. If the empirical findings hold, the work would caution against the emerging practice of using LLMs for textualist analysis in law, highlighting risks of inconsistency and misalignment with ordinary speaker intuitions. It contributes concrete evidence to debates at the intersection of NLP and legal scholarship on the limits of pretrained models for high-stakes interpretive tasks.
major comments (2)
- [Introduction and Experiments] The central robustness claim depends on the tested prompt variations serving as a proxy for real judicial and scholarly LLM use. The manuscript does not demonstrate that the minimal format changes examined match the richer contextual, chain-of-thought, or domain-specific prompting strategies documented in the legal literature cited in the introduction; without this link, the observed divergence may not generalize beyond the experimental setup.
- [Results and Comparison with Human Data] Human correlation results: the abstract states 'at best moderately correlated' with a human dataset, but the methods and results sections provide insufficient detail on dataset size, exact correlation metric (e.g., Pearson vs. agreement rate), statistical controls, and whether prompt variants were selected post-hoc. This weakens the load-bearing claim that the correlation is 'not strong enough given the stakes.'
minor comments (2)
- [Methods] Clarify the exact legal questions and prompt templates used in the controlled experiments so readers can assess ecological validity.
- [Discussion] Add a limitations subsection explicitly addressing the scope of prompt variations tested versus broader real-world usage.
Simulated Author's Rebuttal
Thank you for your thorough review and valuable suggestions. We believe the revisions we have made in response to your comments have improved the manuscript's clarity and rigor, particularly regarding the connection to legal prompting practices and the transparency of our empirical analyses. We respond to each major comment in turn below.
read point-by-point responses
-
Referee: The central robustness claim depends on the tested prompt variations serving as a proxy for real judicial and scholarly LLM use. The manuscript does not demonstrate that the minimal format changes examined match the richer contextual, chain-of-thought, or domain-specific prompting strategies documented in the legal literature cited in the introduction; without this link, the observed divergence may not generalize beyond the experimental setup.
Authors: We agree that establishing a clearer connection between our experimental prompt variations and real-world prompting practices in legal scholarship is important for the generalizability of our robustness findings. Our experiments were designed to isolate the impact of basic format changes, which we argue are relevant because they represent the kind of minor variations that could occur in practice. To address this, we have revised the manuscript to include a direct comparison in the Introduction section, referencing specific prompting approaches from the cited legal literature and explaining how our minimal changes relate to them. We also added a limitations paragraph acknowledging that more complex prompting strategies might reduce but not eliminate the observed inconsistencies. This revision maintains our central claim while better situating it within existing legal AI research. revision: partial
-
Referee: Human correlation results: the abstract states 'at best moderately correlated' with a human dataset, but the methods and results sections provide insufficient detail on dataset size, exact correlation metric (e.g., Pearson vs. agreement rate), statistical controls, and whether prompt variants were selected post-hoc. This weakens the load-bearing claim that the correlation is 'not strong enough given the stakes.'
Authors: We appreciate the referee pointing out the need for greater transparency in our human correlation analysis. We recognize that additional details were warranted to fully support our claims. In the revised manuscript, we have expanded the Methods section to report the dataset size, specify the exact correlation metric (Pearson's correlation), describe the statistical controls employed, and confirm that the prompt variants were pre-specified rather than selected post-hoc. Corresponding updates have been made to the Results section, including additional tables for full transparency. These revisions strengthen the evidentiary basis for concluding that the observed correlations are not sufficiently strong for high-stakes legal applications. revision: yes
Circularity Check
No circularity: empirical evaluation grounded in external human dataset
full rationale
This is an empirical study that runs controlled experiments on LLMs using specific legal interpretation questions and prompt format variations, then directly compares model outputs to an external dataset of human judgments on similar questions. The central claims about prompt sensitivity and moderate correlation rest on these experimental results and the external benchmark rather than any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain reduces to its own inputs by construction; the paper is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be meaningfully evaluated for ordinary meaning via controlled prompting experiments on legal texts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We examine first-token continuation probabilities... across 9 systematic question variants... correlation... R² value greater than 0.5
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Models show a ubiquitous lack of consistency across question variants
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Speaking of Language: Reflections on Metalanguage Research in NLP
This reflection paper highlights metalanguage in NLP, links it to LLMs, and lists understudied future directions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.