Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

Daniel Tabach (Georgia Institute of Technology)

arxiv: 2605.21827 · v1 · pith:KV6CVQ3Anew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

Daniel Tabach (Georgia Institute of Technology) This is my paper

Pith reviewed 2026-05-22 08:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLMintensity wordsdegree modifiersnumeric actionsvague languageresource allocationstate dependenceClaude

0 comments

The pith

Language models compress ten vague intensity words into five distinct numeric outputs in allocation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether language models maintain the intended ordering of words like slightly, somewhat, and drastically when they must generate numeric actions. It places these words in instructions for resource allocation and measures the resulting numbers across many runs while varying the starting state. The model collapses the words into fewer distinct values, lets the current allocation explain most of the output variation, and shows abrupt changes in behavior near capacity limits. Readers should care because AI systems are often given instructions with such imprecise terms, and inconsistent numeric translations could lead to unexpected resource use or safety issues.

Core claim

Across thousands of runs, Claude Haiku maps ten intensity words to only five median numeric allocations. When the initial system state is included in the prompt, the starting allocation captures far more variance than the intensity word. Near the upper limit the model hedges with weak words, abstains with most strong words, and pushes to the ceiling with the word drastically. These patterns hold at both zero and nonzero temperature.

What carries the argument

A fixed scale of ten researcher-chosen English degree modifiers inserted into otherwise identical natural-language instructions to an LLM, whose numeric outputs are then executed by a deterministic backend in a resource-allocation scenario. This isolates lexical effects from state effects.

If this is right

The model treats four lower-tier words as numerically equivalent.
Lexical distinctions between intensity words largely disappear when the system is near capacity.
State information in the prompt dominates word choice in determining numeric output.
Stochastic sampling widens the spread but does not restore full ordinal separation among the words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar compression of vague quantifiers may occur in other domains where models must output numbers, such as pricing or scheduling.
Replacing intensity words with explicit percentages in prompts might reduce unwanted state dependence.
Designers of user-facing AI tools may need to map user vague language to fixed numeric ranges rather than passing the words directly.

Load-bearing premise

The chosen scale of ten degree modifiers together with the specific resource-allocation task and deterministic backend are sufficient to reveal general properties of how models interpret intensity words rather than reflecting only this task or this model.

What would settle it

Repeating the experiment with a different model or in a different numeric domain such as time or money allocation and observing ten clearly separated median values with no collapse and no state dominance would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.21827 by Daniel Tabach (Georgia Institute of Technology).

**Figure 1.** Figure 1: Experimental flow. Two conditions (no-context and context-conditioned) feed a single intensity word into the model, which produces a numeric allocation. A deterministic backend converts the allocation into a measurable outcome. The model is the only stochastic component; all downstream variance traces to the language-to-action translation step. 3.2 Environment The testbed is a synthetic constrained resourc… view at source ↗

**Figure 2.** Figure 2: No-context value frequency map at T=0.0. Each cell shows how many of 30 runs produced that allocation. Lower-tier words collapse to 0.50; stronger words occupy higher regimes [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: No-context median allocation by word at T=0.0. Four lower-tier words collapse to 0.50 (dashed line). Moderately falls below this hedge, an ordinal anomaly. Stronger words break into higher regimes. The 0.50 concentration among the lower tiered words is notable. Rather than choosing small positive adjustments for weak words, the LLM defaults to the midpoint. One interpretation is that 0.50 functions as a he… view at source ↗

**Figure 4.** Figure 4: Median allocation by word across starting allocations b at T=0.0. Words fan out at low values of b and converge near capacity. The dashed line represents no change from the starting allocation. The interaction between word and state forms a differentiation funnel ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Median change (output allocation minus starting allocation) by word and starting allocation at T=0.0. Strong words produce large changes when there is headroom; all words converge near capacity. Word choice matters most when the system has the most headroom, and becomes irrelevant as the system approaches capacity ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Differentiation gap between weak (tiers 1–2) and strong (tiers 4–6) words. Left: median delta by group across baselines. Right: the gap shrinks from 0.40 at 0% to near zero at 75%. Ordinal faithfulness follows the same pattern ( [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Spearman ρ between hypothesized tier and model output across starting allocations (T=0.0). Ordinal faithfulness is strong at low baselines and collapses near capacity. 4.3 Boundary Behavior: Hedge, Act, Abstain At the 89% starting allocation, the model does not merely scale every requested increase downward. It sometimes stops acting entirely. At T=0.0, 121 of 300 non-error runs at this baseline produce ze… view at source ↗

**Figure 8.** Figure 8: Abstention rate by word at the 89% starting allocation (T=0.0). Words in tiers 1–3 always act; several booster-class words abstain entirely. Drastically usually acts by pushing toward the ceiling. Lower words through moderately produce actions in all 30 runs, making small upward adjustments. Considerably, substantially, and significantly abstain in all 30 runs each. Dramatically abstains in 29 of 30 runs.… view at source ↗

**Figure 9.** Figure 9: No-context value frequency maps at T=0.0 (left) and T=0.7 (right). Temperature broadens the distribution into additional values, but the lower-tier midpoint hedge and upper-tier action regimes remain visible. Even at T=0.0, the no-context condition is not perfectly deterministic: 4 of 10 words produce more than one distinct value across 30 runs, with variation concentrated in mid-tier words. This is second… view at source ↗

read the original abstract

Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model's numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p < 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model's numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Claude Haiku collapses ten intensity words into five numeric medians in this task, with starting state explaining far more output variance than the word itself.

read the letter

The main things to know are that the model compresses the ten words into five distinct median allocations and that the current system state drives most of the variation while the intensity word adds little once state is fixed. Near capacity the responses also split into hedging, abstaining, or maxing out depending on word strength, and these patterns hold at both zero and 0.7 temperature after 6620 runs. Spearman correlation and Kruskal-Wallis tests with epsilon-squared values back the patterns, and the design keeps everything else fixed so the comparison between word and state is clean. The deterministic backend that turns the numeric output into a measurable result is a nice control that avoids some of the usual ambiguity in LLM evaluations. The work is new in its specific combination of a linguist-informed ten-word scale, the resource-allocation environment, and the direct head-to-head variance partition between lexical and contextual effects. It does a solid job documenting the compression and the boundary modes without overclaiming. The main limitation is narrow scope: everything is Claude Haiku in one deterministic allocation task, so the same compression and state dominance may not appear in other models or other numeric decisions such as pricing or scheduling. The word list is researcher-constructed even if it draws on Quirk et al., and without the full prompt text and raw distributions it is hard to judge how sensitive the results are to small wording changes. Generalization beyond this setup remains an open question. Readers working on prompt engineering or LLM evaluation benchmarks will find the documented limits useful. The paper is clear enough on its own terms and the statistics are straightforward, so it deserves a serious referee even if the scope stays limited.

Referee Report

2 major / 2 minor

Summary. The paper empirically measures how Claude Haiku interprets a researcher-constructed scale of 10 English degree modifiers (slightly to drastically) when generating numeric resource allocations in a controlled, deterministic backend environment. Across 6,620 runs at two temperatures, it reports that the model compresses the 10 words into 5 distinct median outputs (Spearman rho = 0.845), that starting state explains substantially more rank variance than word choice (Kruskal-Wallis epsilon-squared 0.782 vs. 0.079), and that near capacity limits the model exhibits three distinct behavioral modes (hedging, abstention, or ceiling push). These patterns are claimed to persist under stochastic sampling.

Significance. If the scoped findings hold, the work supplies concrete evidence that current LLMs map vague intensity language to numeric actions in a compressed, state-dependent, and boundary-discontinuous manner. This has direct relevance for prompt design in decision-support and resource-management applications. The large run count, non-parametric statistics, and explicit controls for temperature and starting state constitute reproducible empirical strengths.

major comments (2)

[Results (Kruskal-Wallis analysis)] The central claim that state dominates word (epsilon-squared 0.782 vs. 0.079) is load-bearing for the 'state-dependent' conclusion, yet the manuscript does not report the exact allocation distributions or per-state medians for the highest starting allocations; without these, it is difficult to verify that lexical differentiation truly collapses to zero rather than being masked by ceiling effects.
[Methods (intensity word selection)] The researcher-constructed scale of ten words is presented as informed by Quirk et al., but the manuscript does not include a validation step (e.g., human ordinal ranking or pilot data) showing that these particular ten items are representative of the broader intensity lexicon; this choice directly affects the compression finding and limits generalization claims.

minor comments (2)

[Methods] The exact natural-language prompts and the deterministic backend conversion rule should be reproduced verbatim in an appendix to allow independent replication of the isolation between word and state effects.
[Figures] Figure captions and axis labels for the median-allocation plots should explicitly state the number of runs per condition and whether error bars represent inter-quartile range or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recommending minor revision. We address each major comment below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Results (Kruskal-Wallis analysis)] The central claim that state dominates word (epsilon-squared 0.782 vs. 0.079) is load-bearing for the 'state-dependent' conclusion, yet the manuscript does not report the exact allocation distributions or per-state medians for the highest starting allocations; without these, it is difficult to verify that lexical differentiation truly collapses to zero rather than being masked by ceiling effects.

Authors: We agree that the current manuscript reports the aggregate Kruskal-Wallis statistics but does not provide the per-state medians or full distributions specifically for the highest starting allocations. This detail would strengthen the claim that lexical differentiation collapses rather than being an artifact of ceiling effects. In the revised manuscript we will add a supplementary table (or figure) that reports median allocations, interquartile ranges, and sample sizes broken down by starting state, with particular attention to states at or above 70% of capacity. These statistics are already available from our existing analysis pipeline and will be included to allow direct verification of the collapse. revision: yes
Referee: [Methods (intensity word selection)] The researcher-constructed scale of ten words is presented as informed by Quirk et al., but the manuscript does not include a validation step (e.g., human ordinal ranking or pilot data) showing that these particular ten items are representative of the broader intensity lexicon; this choice directly affects the compression finding and limits generalization claims.

Authors: We acknowledge that the manuscript presents the ten-word scale as researcher-constructed and informed by the Quirk et al. taxonomy but does not include an explicit validation step such as human ranking or pilot data. In the revision we will expand the Methods section to provide a more detailed rationale for the specific word selections, drawing on the cited linguistic literature on degree modifiers. We will also add an explicit limitations statement clarifying that the compression and state-dependence findings are scoped to this particular set of words and that broader generalization to the full intensity lexicon would require additional validation. This revision addresses the concern without altering the paper's core empirical focus. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical measurement of model outputs

full rationale

The paper performs controlled experiments in a resource-allocation task, varying only the intensity word or starting state across 6,620 runs with Claude Haiku. It reports observed medians, Spearman correlations (rho=0.845), and Kruskal-Wallis epsilon-squared values (0.782 for state vs 0.079 for word) directly from the numeric allocations produced by the model and converted by the deterministic backend. No equations, fitted parameters, or derivations appear; results are not reduced to quantities defined by the paper's own inputs or self-citations. The Quirk et al. taxonomy is an external reference used only to inform the scale construction, not to derive the reported patterns. The central claims remain scoped observations of model behavior against an external task.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the study relies on standard statistical assumptions for rank-based tests and the premise that numeric allocations reflect lexical interpretation.

free parameters (2)

selection of 10 intensity words
Researcher-constructed scale informed by Quirk et al. taxonomy; specific words chosen to span the range.
resource allocation environment parameters
Details of the deterministic backend and capacity limits not specified in abstract.

axioms (2)

domain assumption Numeric allocations produced by the model reflect its internal interpretation of intensity word meaning.
The study treats observed numbers as direct evidence of how the model parses vague language.
standard math Kruskal-Wallis and Spearman tests are appropriate for comparing rank variance between word and state groupings.
Abstract reports these tests without stating distributional assumptions.

pith-pipeline@v0.9.0 · 5832 in / 1440 out tokens · 39773 ms · 2026-05-22T08:21:20.740361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across 6,620 runs... three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs... Second, ... grouping by starting allocation captures far more rank-based variance than grouping by word (ε² baseline = 0.782 vs. ε² word = 0.079)... Third, near feasibility limits the model exhibits three behavioral modes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Psychological Review , volume =

Cliff, Norman , title =. Psychological Review , volume =

work page
[2]

Quirk, Randolph and Greenbaum, Sidney and Leech, Geoffrey and Svartvik, Jan , title =

work page
[3]

Statistical Science , volume =

Mosteller, Frederick and Youtz, Cleo , title =. Statistical Science , volume =

work page
[4]

and Budescu, David V

Wallsten, Thomas S. and Budescu, David V. and Rapoport, Amnon and Zwick, Rami and Forsyth, Barbara , title =. Journal of Experimental Psychology: General , volume =

work page
[5]

Cognition , volume =

Pezzelle, Sandro and Bernardi, Raffaella and Piazza, Manuela , title =. Cognition , volume =

work page
[6]

and van Maanen, Leendert and Szymanik, Jakub , title =

Ramotowska, Sonia and Haaf, Julia M. and van Maanen, Leendert and Szymanik, Jakub , title =. Psychonomic Bulletin & Review , year =

work page
[7]

npj Complexity , year =

Zhang, Yongqi and Liu, Dongyang and Chen, Jingjing and Wang, Haoxiang and Han, Xu , title =. npj Complexity , year =

work page
[8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

AhmadiTeshnizi, Ali and Gao, Wenzhi and Udell, Madeleine , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[9]

NeurIPS 2022 Competition Track , year =

Ramamonjison, Rindranirina and others , title =. NeurIPS 2022 Competition Track , year =

work page 2022
[10]

Brittlebench: Quantifying LLM robustness via prompt sensitivity

Romanou, Angelika and Ibrahim, Mark and Ross, Candace and Shaib, Chantal and Oktar, Kerem and Bell, Samuel J. and Ovalle, Anaelia and Dodge, Jesse and Bosselut, Antoine and Sinha, Koustuv and Williams, Adina , title =. arXiv preprint arXiv:2603.13285 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Language , volume =

Kennedy, Christopher and McNally, Louise , title =. Language , volume =

work page
[12]

, title =

Horn, Laurence R. , title =. PhD Dissertation, UCLA , year =

work page
[13]

Frontiers in Psychology , volume =

Gotzner, Nicole and Solt, Stephanie and Benz, Anton , title =. Frontiers in Psychology , volume =

work page
[14]

arXiv preprint arXiv:2509.03116 , year =

Measuring Scalar Constructs in Social Science with. arXiv preprint arXiv:2509.03116 , year =

work page arXiv
[15]

Proceedings of ICLR , year =

Gurnee, Wes and Tegmark, Max , title =. Proceedings of ICLR , year =

work page
[16]

Cognition , volume =

Schuster, Sebastian and Degen, Judith , title =. Cognition , volume =

work page

[1] [1]

Psychological Review , volume =

Cliff, Norman , title =. Psychological Review , volume =

work page

[2] [2]

Quirk, Randolph and Greenbaum, Sidney and Leech, Geoffrey and Svartvik, Jan , title =

work page

[3] [3]

Statistical Science , volume =

Mosteller, Frederick and Youtz, Cleo , title =. Statistical Science , volume =

work page

[4] [4]

and Budescu, David V

Wallsten, Thomas S. and Budescu, David V. and Rapoport, Amnon and Zwick, Rami and Forsyth, Barbara , title =. Journal of Experimental Psychology: General , volume =

work page

[5] [5]

Cognition , volume =

Pezzelle, Sandro and Bernardi, Raffaella and Piazza, Manuela , title =. Cognition , volume =

work page

[6] [6]

and van Maanen, Leendert and Szymanik, Jakub , title =

Ramotowska, Sonia and Haaf, Julia M. and van Maanen, Leendert and Szymanik, Jakub , title =. Psychonomic Bulletin & Review , year =

work page

[7] [7]

npj Complexity , year =

Zhang, Yongqi and Liu, Dongyang and Chen, Jingjing and Wang, Haoxiang and Han, Xu , title =. npj Complexity , year =

work page

[8] [8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

AhmadiTeshnizi, Ali and Gao, Wenzhi and Udell, Madeleine , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[9] [9]

NeurIPS 2022 Competition Track , year =

Ramamonjison, Rindranirina and others , title =. NeurIPS 2022 Competition Track , year =

work page 2022

[10] [10]

Brittlebench: Quantifying LLM robustness via prompt sensitivity

Romanou, Angelika and Ibrahim, Mark and Ross, Candace and Shaib, Chantal and Oktar, Kerem and Bell, Samuel J. and Ovalle, Anaelia and Dodge, Jesse and Bosselut, Antoine and Sinha, Koustuv and Williams, Adina , title =. arXiv preprint arXiv:2603.13285 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Language , volume =

Kennedy, Christopher and McNally, Louise , title =. Language , volume =

work page

[12] [12]

, title =

Horn, Laurence R. , title =. PhD Dissertation, UCLA , year =

work page

[13] [13]

Frontiers in Psychology , volume =

Gotzner, Nicole and Solt, Stephanie and Benz, Anton , title =. Frontiers in Psychology , volume =

work page

[14] [14]

arXiv preprint arXiv:2509.03116 , year =

Measuring Scalar Constructs in Social Science with. arXiv preprint arXiv:2509.03116 , year =

work page arXiv

[15] [15]

Proceedings of ICLR , year =

Gurnee, Wes and Tegmark, Max , title =. Proceedings of ICLR , year =

work page

[16] [16]

Cognition , volume =

Schuster, Sebastian and Degen, Judith , title =. Cognition , volume =

work page