The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

Andrew Hong; Jason Potteiger; Luis E. Zapata

arxiv: 2604.19645 · v1 · submitted 2026-04-21 · 💻 cs.CL

The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

Andrew Hong , Jason Potteiger , Luis E. Zapata This is my paper

Pith reviewed 2026-05-10 03:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM prediction accuracyopen-ended survey textexperience ratingsprompt customizationmodel selectionprediction ceilingMLB fan surveystext-based measurement limits

0 comments

The pith

The signal in open-ended survey text sets the ceiling on LLM-predicted fan experience ratings, far outweighing effects from prompt or model choice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how much prompt design and model selection can improve LLM predictions of experience ratings from open-ended MLB fan survey text. It finds that prompt customization adds only two percentage points to within-one-point accuracy while model swaps yield no reliable gain or even losses, yet accuracy differences across texts exceed those from engineering choices by more than an order of magnitude. The authors identify a split ceiling: one part from model bias in reading text that prompts can correct, and a larger part from the gap between what fans write and the ratings they actually give, which no configuration change can close. A sympathetic reader cares because this shows that for subjective rating tasks from text, the input's information content is the binding constraint rather than the AI tools applied to it.

Core claim

Across capable configurations on approximately 10,000 post-game surveys from five MLB teams, accuracy in predicting fan-reported experience ratings within one point varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1, from 67% to 69%. Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between whatfans

What carries the argument

The dual ceiling on prediction accuracy, split between prompt-correctable model reading bias and an irreducible text-to-decision information gap. This mechanism shows that input linguistic variation dominates engineering levers in determining performance.

If this is right

Prompt customization specifically addresses and mitigates biases in how LLMs interpret survey text.
Model selection alone does not overcome the limits set by the text's linguistic character.
The bulk of prediction error arises from information not present in the open-ended responses.
Further refinements to prompts or models will yield smaller gains once the text-based ceiling is approached.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Survey creators might improve predictions by designing open-ended questions that better capture the factors behind fans' ratings.
Similar text-ceiling effects could limit LLM use in other feedback analysis tasks, suggesting a need to assess information completeness in responses.
If linguistic character is key, automated systems could prioritize certain text types for human oversight based on predicted reliability.

Load-bearing premise

That the large accuracy differences attributed to the linguistic character of the text are not confounded by unmeasured factors such as survey length, fan demographics, or game-specific context.

What would settle it

An experiment that balances or controls for text length, respondent demographics, and contextual variables across different linguistic text types, then checks whether accuracy variation remains dominated by linguistic features rather than prompt or model choices.

read the original abstract

An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that "prompt engineering helps a little" but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Text characteristics drive most of the accuracy variation in these LLM rating predictions, with prompt tweaks helping only a small fixable slice and model swaps often doing nothing or hurting.

read the letter

The main thing to know is that this paper finds prompt customization and model choice have small, inconsistent effects on predicting fan experience ratings from open-ended MLB survey text, while differences across the texts themselves explain over ten times more of the accuracy spread. On roughly 10,000 surveys they get 67% within-one-point agreement with a baseline GPT 4.1 prompt, a modest lift to 69% with customization, and drops to 67% or 61% with other models. They frame the ceiling as two parts: one bias in how the model reads text that prompts can address, and one gap between what fans write and what they rate that no prompt or model can fix because the info is absent from the text. Prompt work moves the first part; model swaps move neither reliably. That quantified split and the order-of-magnitude dominance of text features are the new pieces relative to their prior result. The empirical setup on real post-game data is clean enough to be useful for anyone doing applied LLM text-to-rating work in sports or customer feedback. It gives practical numbers on where engineering effort pays off and where it hits a wall. The soft spot is the attribution of the large text-driven variation to intrinsic linguistic character. The abstract and stress-test note give no sign of regression controls, matching, or stratification on response length, demographics, team, or game outcome, so correlated factors could be doing some of the work. The two-part ceiling reading is post-hoc and not directly tested. Details on exact prompts, statistical tests, and error bars are also thin in what is visible. This is for practitioners who need realistic bounds on LLM survey tools rather than theorists. It is incremental but the dataset size and direct comparison make it worth a referee's time to check the controls and tighten the interpretation.

Referee Report

2 major / 3 minor

Summary. This paper extends prior work showing GPT-4.1 predicts fan experience ratings from open-ended MLB post-game survey text within one point 67% of the time. On ~10,000 surveys from five teams, it compares a baseline prompt against a customized version crossed with GPT-4.1, 4.1-mini, and 5.2. Prompt customization yields a ~2pp gain to 69% within-one-point agreement; model swaps produce mixed or worse results. Accuracy variation across texts by linguistic character exceeds prompt/model effects by more than an order of magnitude. The authors decompose the performance ceiling into a prompt-correctable model bias component and an irreducible component arising from information absent from the text.

Significance. If the central attribution holds, the work is significant for delineating practical limits of LLM-based inference from natural language in social measurement. The sizable dataset and direct head-to-head comparison of prompt versus model levers provide concrete empirical grounding, while the two-part ceiling framing usefully separates engineering-reachable gains from fundamental data constraints. This has implications for automated survey coding and AI-assisted experience measurement.

major comments (2)

[Results on variation by linguistic character of the text] The central claim that accuracy varies more than an order of magnitude more by linguistic character of the text than by prompt or model choice (Abstract) lacks demonstrated controls for confounders such as survey length, fan demographics, team, or game outcome. No regression controls, matching, or stratification on these factors are reported, so it remains unclear whether differences reflect intrinsic text properties rather than correlated unmeasured variables. This directly undermines the two-part ceiling decomposition into prompt-fixable bias versus irreducible missing information.
[Evaluation metrics and abstract] The within-one-point agreement metric underpins all reported comparisons and the ceiling interpretation, yet no inter-rater reliability baseline for the human ratings or alternative metrics (e.g., MSE, exact match, or ordinal error) are provided. Without these, it is difficult to verify that the metric adequately captures prediction quality or to rule out that the observed patterns are metric-specific.

minor comments (3)

[Methods] Exact prompt texts for the baseline and customized versions are not reproduced, limiting reproducibility of the reported two-percentage-point gain.
[Abstract] The abstract states 'approximately 10,000' surveys but omits the precise sample size, data-splitting procedure, and any statistical tests or error bars supporting the configuration comparisons.
[Discussion] The interpretive split of the ceiling into two parts is presented as post-hoc; a direct test or additional analysis would strengthen the distinction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the robustness of our claims regarding the relative impacts of prompt design, model choice, and input text properties on LLM-based rating prediction. We address each major comment below and describe the revisions we will undertake.

read point-by-point responses

Referee: [Results on variation by linguistic character of the text] The central claim that accuracy varies more than an order of magnitude more by linguistic character of the text than by prompt or model choice (Abstract) lacks demonstrated controls for confounders such as survey length, fan demographics, team, or game outcome. No regression controls, matching, or stratification on these factors are reported, so it remains unclear whether differences reflect intrinsic text properties rather than correlated unmeasured variables. This directly undermines the two-part ceiling decomposition into prompt-fixable bias versus irreducible missing information.

Authors: We agree that explicit controls for potential confounders would strengthen the attribution of performance variation to linguistic text properties. Our analysis shows that accuracy differences across texts grouped by linguistic character substantially exceed the modest effects from prompt customization or model selection. To isolate these effects, we will add multivariate regression models in the revised manuscript that control for survey length, team, game outcome, and available fan demographic variables. These controls will test whether the large text-driven variation persists and will support the two-part ceiling decomposition by clarifying the portion attributable to information absent from the text versus other factors. revision: yes
Referee: [Evaluation metrics and abstract] The within-one-point agreement metric underpins all reported comparisons and the ceiling interpretation, yet no inter-rater reliability baseline for the human ratings or alternative metrics (e.g., MSE, exact match, or ordinal error) are provided. Without these, it is difficult to verify that the metric adequately captures prediction quality or to rule out that the observed patterns are metric-specific.

Authors: We recognize that supplementary metrics and reliability information would improve verification of the primary results. The within-one-point agreement was selected due to its alignment with practical utility in experience measurement contexts. In the revision, we will add alternative metrics including mean squared error, exact match rates, and distributions of ordinal errors to confirm that key patterns hold across measures. For inter-rater reliability, the ratings originate from direct fan self-reports rather than multi-rater coding, so a traditional IRR baseline is not applicable; we will explicitly discuss this aspect of the ground truth and any related survey administration details. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior baseline; empirical comparisons are self-contained with no derivation chain

full rationale

The paper conducts direct empirical comparisons of prompt and model variants on held-out survey data, computing within-one-point agreement metrics against human ratings. No equations, fitted parameters, or first-principles derivations are present; all reported effects (e.g., prompt adding ~2pp, model swaps degrading performance, linguistic character dominating variation) are computed from the new experimental runs. The single self-citation to the authors' 2026 prior work merely establishes the unoptimized baseline for context and is not load-bearing for the relative-impact or ceiling claims, which rest on independent data splits and measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study relying on standard statistical comparison of accuracies; no free parameters fitted to derive a result, no domain axioms beyond ordinary assumptions about survey data, and no new entities introduced.

pith-pipeline@v0.9.0 · 5595 in / 1374 out tokens · 66638 ms · 2026-05-10T03:03:14.639785+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

GPT as a Measurement Tool

Asirvatham, A., Mokski, M., & Shleifer, A. (2025). "GPT as a Measurement Tool." NBER Working Paper. Baumeister, R.F., Bratslavsky, E., Finkenauer, C., & Vohs, K.D. (2001). “Bad Is Stronger Than Good.” Review of General Psychology, 5(4), 323–370. Barrie, C., & Törnberg, P. (2024). "Prompt Stability Scoring for Text Annotation with Large Language Models." a...

work page doi:10.1002/smj.70023 2025
[2]

Data Annotation with Large Language Models: Lessons from Political Science

Yang, E., et al. (2025). “Data Annotation with Large Language Models: Lessons from Political Science.” Working paper. Appendix A: Prompt function schema The analysis in this study was conducted using the Dimension Labs language data platform (dimensionlabs.io). The complete JSON schema used to generate predicted ratings is reproduced below. Prompt Functio...

work page 2025

[1] [1]

GPT as a Measurement Tool

Asirvatham, A., Mokski, M., & Shleifer, A. (2025). "GPT as a Measurement Tool." NBER Working Paper. Baumeister, R.F., Bratslavsky, E., Finkenauer, C., & Vohs, K.D. (2001). “Bad Is Stronger Than Good.” Review of General Psychology, 5(4), 323–370. Barrie, C., & Törnberg, P. (2024). "Prompt Stability Scoring for Text Annotation with Large Language Models." a...

work page doi:10.1002/smj.70023 2025

[2] [2]

Data Annotation with Large Language Models: Lessons from Political Science

Yang, E., et al. (2025). “Data Annotation with Large Language Models: Lessons from Political Science.” Working paper. Appendix A: Prompt function schema The analysis in this study was conducted using the Dimension Labs language data platform (dimensionlabs.io). The complete JSON schema used to generate predicted ratings is reproduced below. Prompt Functio...

work page 2025