Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions
Pith reviewed 2026-05-22 08:21 UTC · model grok-4.3
The pith
Language models compress ten vague intensity words into five distinct numeric outputs in allocation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across thousands of runs, Claude Haiku maps ten intensity words to only five median numeric allocations. When the initial system state is included in the prompt, the starting allocation captures far more variance than the intensity word. Near the upper limit the model hedges with weak words, abstains with most strong words, and pushes to the ceiling with the word drastically. These patterns hold at both zero and nonzero temperature.
What carries the argument
A fixed scale of ten researcher-chosen English degree modifiers inserted into otherwise identical natural-language instructions to an LLM, whose numeric outputs are then executed by a deterministic backend in a resource-allocation scenario. This isolates lexical effects from state effects.
If this is right
- The model treats four lower-tier words as numerically equivalent.
- Lexical distinctions between intensity words largely disappear when the system is near capacity.
- State information in the prompt dominates word choice in determining numeric output.
- Stochastic sampling widens the spread but does not restore full ordinal separation among the words.
Where Pith is reading between the lines
- Similar compression of vague quantifiers may occur in other domains where models must output numbers, such as pricing or scheduling.
- Replacing intensity words with explicit percentages in prompts might reduce unwanted state dependence.
- Designers of user-facing AI tools may need to map user vague language to fixed numeric ranges rather than passing the words directly.
Load-bearing premise
The chosen scale of ten degree modifiers together with the specific resource-allocation task and deterministic backend are sufficient to reveal general properties of how models interpret intensity words rather than reflecting only this task or this model.
What would settle it
Repeating the experiment with a different model or in a different numeric domain such as time or money allocation and observing ten clearly separated median values with no collapse and no state dominance would falsify the claim.
Figures
read the original abstract
Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model's numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p < 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model's numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically measures how Claude Haiku interprets a researcher-constructed scale of 10 English degree modifiers (slightly to drastically) when generating numeric resource allocations in a controlled, deterministic backend environment. Across 6,620 runs at two temperatures, it reports that the model compresses the 10 words into 5 distinct median outputs (Spearman rho = 0.845), that starting state explains substantially more rank variance than word choice (Kruskal-Wallis epsilon-squared 0.782 vs. 0.079), and that near capacity limits the model exhibits three distinct behavioral modes (hedging, abstention, or ceiling push). These patterns are claimed to persist under stochastic sampling.
Significance. If the scoped findings hold, the work supplies concrete evidence that current LLMs map vague intensity language to numeric actions in a compressed, state-dependent, and boundary-discontinuous manner. This has direct relevance for prompt design in decision-support and resource-management applications. The large run count, non-parametric statistics, and explicit controls for temperature and starting state constitute reproducible empirical strengths.
major comments (2)
- [Results (Kruskal-Wallis analysis)] The central claim that state dominates word (epsilon-squared 0.782 vs. 0.079) is load-bearing for the 'state-dependent' conclusion, yet the manuscript does not report the exact allocation distributions or per-state medians for the highest starting allocations; without these, it is difficult to verify that lexical differentiation truly collapses to zero rather than being masked by ceiling effects.
- [Methods (intensity word selection)] The researcher-constructed scale of ten words is presented as informed by Quirk et al., but the manuscript does not include a validation step (e.g., human ordinal ranking or pilot data) showing that these particular ten items are representative of the broader intensity lexicon; this choice directly affects the compression finding and limits generalization claims.
minor comments (2)
- [Methods] The exact natural-language prompts and the deterministic backend conversion rule should be reproduced verbatim in an appendix to allow independent replication of the isolation between word and state effects.
- [Figures] Figure captions and axis labels for the median-allocation plots should explicitly state the number of runs per condition and whether error bars represent inter-quartile range or standard error.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recommending minor revision. We address each major comment below, indicating the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Results (Kruskal-Wallis analysis)] The central claim that state dominates word (epsilon-squared 0.782 vs. 0.079) is load-bearing for the 'state-dependent' conclusion, yet the manuscript does not report the exact allocation distributions or per-state medians for the highest starting allocations; without these, it is difficult to verify that lexical differentiation truly collapses to zero rather than being masked by ceiling effects.
Authors: We agree that the current manuscript reports the aggregate Kruskal-Wallis statistics but does not provide the per-state medians or full distributions specifically for the highest starting allocations. This detail would strengthen the claim that lexical differentiation collapses rather than being an artifact of ceiling effects. In the revised manuscript we will add a supplementary table (or figure) that reports median allocations, interquartile ranges, and sample sizes broken down by starting state, with particular attention to states at or above 70% of capacity. These statistics are already available from our existing analysis pipeline and will be included to allow direct verification of the collapse. revision: yes
-
Referee: [Methods (intensity word selection)] The researcher-constructed scale of ten words is presented as informed by Quirk et al., but the manuscript does not include a validation step (e.g., human ordinal ranking or pilot data) showing that these particular ten items are representative of the broader intensity lexicon; this choice directly affects the compression finding and limits generalization claims.
Authors: We acknowledge that the manuscript presents the ten-word scale as researcher-constructed and informed by the Quirk et al. taxonomy but does not include an explicit validation step such as human ranking or pilot data. In the revision we will expand the Methods section to provide a more detailed rationale for the specific word selections, drawing on the cited linguistic literature on degree modifiers. We will also add an explicit limitations statement clarifying that the compression and state-dependence findings are scoped to this particular set of words and that broader generalization to the full intensity lexicon would require additional validation. This revision addresses the concern without altering the paper's core empirical focus. revision: yes
Circularity Check
No significant circularity: direct empirical measurement of model outputs
full rationale
The paper performs controlled experiments in a resource-allocation task, varying only the intensity word or starting state across 6,620 runs with Claude Haiku. It reports observed medians, Spearman correlations (rho=0.845), and Kruskal-Wallis epsilon-squared values (0.782 for state vs 0.079 for word) directly from the numeric allocations produced by the model and converted by the deterministic backend. No equations, fitted parameters, or derivations appear; results are not reduced to quantities defined by the paper's own inputs or self-citations. The Quirk et al. taxonomy is an external reference used only to inform the scale construction, not to derive the reported patterns. The central claims remain scoped observations of model behavior against an external task.
Axiom & Free-Parameter Ledger
free parameters (2)
- selection of 10 intensity words
- resource allocation environment parameters
axioms (2)
- domain assumption Numeric allocations produced by the model reflect its internal interpretation of intensity word meaning.
- standard math Kruskal-Wallis and Spearman tests are appropriate for comparing rank variance between word and state groupings.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across 6,620 runs... three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs... Second, ... grouping by starting allocation captures far more rank-based variance than grouping by word (ε² baseline = 0.782 vs. ε² word = 0.079)... Third, near feasibility limits the model exhibits three behavioral modes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Quirk, Randolph and Greenbaum, Sidney and Leech, Geoffrey and Svartvik, Jan , title =
-
[3]
Statistical Science , volume =
Mosteller, Frederick and Youtz, Cleo , title =. Statistical Science , volume =
-
[4]
Wallsten, Thomas S. and Budescu, David V. and Rapoport, Amnon and Zwick, Rami and Forsyth, Barbara , title =. Journal of Experimental Psychology: General , volume =
-
[5]
Pezzelle, Sandro and Bernardi, Raffaella and Piazza, Manuela , title =. Cognition , volume =
-
[6]
and van Maanen, Leendert and Szymanik, Jakub , title =
Ramotowska, Sonia and Haaf, Julia M. and van Maanen, Leendert and Szymanik, Jakub , title =. Psychonomic Bulletin & Review , year =
-
[7]
Zhang, Yongqi and Liu, Dongyang and Chen, Jingjing and Wang, Haoxiang and Han, Xu , title =. npj Complexity , year =
-
[8]
Advances in Neural Information Processing Systems (NeurIPS) , year =
AhmadiTeshnizi, Ali and Gao, Wenzhi and Udell, Madeleine , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[9]
NeurIPS 2022 Competition Track , year =
Ramamonjison, Rindranirina and others , title =. NeurIPS 2022 Competition Track , year =
work page 2022
-
[10]
Brittlebench: Quantifying LLM robustness via prompt sensitivity
Romanou, Angelika and Ibrahim, Mark and Ross, Candace and Shaib, Chantal and Oktar, Kerem and Bell, Samuel J. and Ovalle, Anaelia and Dodge, Jesse and Bosselut, Antoine and Sinha, Koustuv and Williams, Adina , title =. arXiv preprint arXiv:2603.13285 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Kennedy, Christopher and McNally, Louise , title =. Language , volume =
- [12]
-
[13]
Frontiers in Psychology , volume =
Gotzner, Nicole and Solt, Stephanie and Benz, Anton , title =. Frontiers in Psychology , volume =
-
[14]
arXiv preprint arXiv:2509.03116 , year =
Measuring Scalar Constructs in Social Science with. arXiv preprint arXiv:2509.03116 , year =
-
[15]
Gurnee, Wes and Tegmark, Max , title =. Proceedings of ICLR , year =
-
[16]
Schuster, Sebastian and Degen, Judith , title =. Cognition , volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.