One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Erfan Baghaei Potraghloo; Massoud Pedram; Seyedarmin Azizi; Souvik Kundu

arxiv: 2604.13006 · v2 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Erfan Baghaei Potraghloo , Seyedarmin Azizi , Souvik Kundu , Massoud Pedram This is my paper

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords instruction tuningLLM fragilitylexical constraintsresponse comprehensivenessplanning failureevaluation methodologymodel robustnesssurface-form dependence

0 comments

The pith

Instruction-tuned LLMs lose 14-48% of response comprehensiveness when a single word or punctuation mark is banned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that instruction tuning makes large language models brittle by tying their ability to generate helpful responses to narrow surface patterns in output. When simple lexical constraints are imposed, such as forbidding one common character or word, the models produce shorter, less complete answers across multiple families and sizes. Human raters and automated judges confirm the loss is in actual content rather than just style. Base models show no comparable drop under the same rules, isolating the effect to the tuning process. This fragility appears even with practical constraints like tone rules or legal phrasing requirements that appear in real deployments.

Core claim

Instruction-tuned LLMs suffer a planning failure when simple lexical constraints are applied, causing 14-48% loss in response comprehensiveness across seven models. Two-pass generation recovers most of the lost length, and linear probes applied to prompt representations before any tokens are generated predict final response length with high accuracy, an effect absent in base models. The same constraints produce no systematic degradation in untuned models, showing that instruction tuning couples task competence to narrow surface-form templates.

What carries the argument

Linear probes on prompt representations that predict response length with R-squared values of 0.51-0.94, together with the recovery from two-pass generation, used to diagnose the collapse as a planning failure introduced by instruction tuning.

If this is right

Realistic constraints such as suppressing conversational openers, enforcing corporate tone, or adding legal compliance text produce 22-40% degradation comparable to the synthetic bans.
Independent LLM-as-judge scoring detects only a 3.5% quality drop while pairwise evaluation reveals a 23% drop, indicating that current evaluation practice systematically underestimates content loss.
The fragility scales across model sizes from 7B to 70B and across both open- and closed-weight families.
Suppressing only the opening token phrase "Certainly!" alone produces 40% collapse on the most fragile model tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future instruction-tuning objectives could target output planning representations directly to reduce dependence on specific lexical triggers.
Similar surface-form coupling may exist in other post-training regimes such as safety or preference alignment and could be diagnosed with the same probe-and-recovery methods.
Deployment pipelines that apply even light output filters should test for this form of collapse rather than assuming instruction-tuned models remain robust.

Load-bearing premise

The drop in response quality is caused by instruction tuning linking competence to specific surface templates rather than by the constraints changing the underlying task or by artifacts in the way comprehensiveness is scored.

What would settle it

Linear probes trained on prompt representations of instruction-tuned models would lose their ability to predict response length under the lexical constraints if the planning-failure account is incorrect.

Figures

Figures reproduced from arXiv: 2604.13006 by Erfan Baghaei Potraghloo, Massoud Pedram, Seyedarmin Azizi, Souvik Kundu.

**Figure 1.** Figure 1: Constraint-induced response collapse. Adding a trivial lexical constraint (“do not use commas”) to an otherwise identical prompt causes Llama-3.1-8B-Instruct to abandon its structured 685-token response in favor of a 297-token flat-prose summary, a 27% loss in comprehensiveness despite no change in task or knowledge requirements. question) causes the model to collapse its response. The model does not simpl… view at source ↗

**Figure 2.** Figure 2: Pairwise comprehensiveness evaluation. Heatmap of relative change ∆% vs. unconstrained baseline (GPT-4o-mini judge); 40 prompts × 8 constraints = 320 pairs per model. Darker red indicates larger collapse. Baseline wins 97.5% / 98.4% / 77.2% of pairs for Llama / Qwen / Mistral. Complete per-constraint numerical results with absolute scores appear in Appendix A.1 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Atomic claim coverage analysis. GPT-4o extracts factual claims from unconstrained responses and checks which survive in constrained responses. Coverage and length retention move together (gap −0.8pp), inconsistent with a pure verbosity account. 192 pairs, 3,355 atom checks. Numerical values in Appendix A.5. Constrained responses preserve only 49.8% of baseline factual claims on average ( [PITH_FULL_IMAGE… view at source ↗

**Figure 4.** Figure 4: Human evaluation results. Ten blinded evaluators rate responses on six criteria (1–10). Information criteria drop 1.5–2.3× more than surface criteria, confirming genuine content loss. The dashed separator distinguishes information criteria (left) from surface criteria (right). 320 pairs per model. Numerical values in Appendix A.6 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Probing R2 tracks collapse severity. The collapse decision is encoded in prompt representations before generation begins: a linear probe on the last prompt token predicts response length with R2 = 0.51–0.94, with predictability tracking collapse severity across five models (r = 0.92). Base models yield negative R2 (gray zone), confirming that instruction tuning introduces both the behavioral collapse and … view at source ↗

**Figure 6.** Figure 6: Instruction tuning creates the collapse. Slope chart showing comprehensiveness change ∆% (left) and baseline win rate (right) for base vs. instruction-tuned models under the same eight lexical constraints (GPT-4o pairwise judge; 320 pairs per model). Each line connects a base model to its instruction-tuned counterpart, making the within-family swing visually immediate: Qwen swings from +7.0% to −48.1% (−5… view at source ↗

**Figure 7.** Figure 7: Comprehensiveness change on MT-Bench by category (GPT-4o pairwise judge). The collapse is consistent across all eight MT-Bench categories for both models. Llama math (+3%) is the sole exception: its short, formulaic math responses do not rely on the formatting templates that collapse under constraints. Qwen collapses even on math (−51%), consistent with its stronger template dependence. The results closely… view at source ↗

read the original abstract

Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness under trivial constraints? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48\% of comprehensiveness across seven models spanning five families (7B--70B, open- and closed-weight). A blinded human evaluation with 10 STEM-trained evaluators confirms genuine content loss, with information criteria degrading $1.5$--$2.3\times$ more than surface criteria, a finding corroborated by over 4,100 automated pairwise comparisons (77--100\% baseline preference) across three LLM judges from two model families. Diagnostic analysis identifies this as a \emph{planning failure}: two-pass generation recovers 59--96\% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.94$ before generation begins. The same probes yield negative $R^2$ on base models, confirming that instruction tuning introduces the representational structure underlying the collapse. Base models show no systematic degradation under identical constraints, demonstrating that instruction tuning couples task competence to narrow surface-form templates. The effect extends to realistic deployment constraints (preamble suppression, corporate tone guidelines, legal compliance hedging, accessibility requirements) causing comparable degradation ($-$22\% to $-$34\%), with suppressing the conversational opener alone (``Certainly!'') causing 40\% collapse on our most fragile model despite restricting only the opening tokens. We further show that standard independent LLM-as-judge evaluation detects only a 3.5\% quality drop where pairwise evaluation reveals 23\%, exposing a methodological blind spot in current evaluation practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Instruction tuning makes LLMs drop real content when prompts ban one common word or mark, while base models stay stable, with human checks and probes backing the planning-failure account.

read the letter

The core finding is that banning a single punctuation mark or everyday word in the prompt causes instruction-tuned models to produce responses that lose 14-48% of their content across seven models. Base models show no such drop under the same rules. Human evaluators confirm the loss is mostly information, not just shorter or less polished text, and the effect shows up in realistic cases like corporate tone rules or legal hedging too. Suppressing just the opener like 'Certainly!' triggers a 40% collapse on one model. They also show that standard single LLM judges miss most of the quality drop that pairwise comparisons catch.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that instruction-tuned LLMs exhibit fragility in helpfulness under trivial lexical constraints (e.g., banning one punctuation mark or common word), resulting in 14-48% loss of response comprehensiveness across seven models (7B-70B, five families). Base models show no degradation under identical constraints. The collapse is diagnosed as a planning failure induced by instruction tuning coupling competence to narrow surface-form templates, supported by blinded human evaluation (information loss 1.5-2.3x surface loss), >4100 automated pairwise comparisons, two-pass generation recovery (59-96% length), and linear probes on prompt representations (R² 0.51-0.94 tuned vs. negative on base). The effect generalizes to realistic constraints (e.g., preamble suppression causing 40% collapse) and reveals a blind spot in independent LLM-as-judge evaluation (3.5% vs. 23% detected drop).

Significance. If the central claim holds, the result is significant: it identifies a previously under-appreciated brittleness in instruction-tuned helpfulness, with direct implications for deployment under common constraints (legal, accessibility, corporate tone). The base-model controls, blinded human ratings separating content from surface, and pre-generation representational probes provide convergent evidence that the fragility arises from tuning-induced representational structure rather than task alteration or scoring artifacts. The demonstration that standard independent LLM judges miss most of the degradation is a methodological contribution that could affect evaluation practice.

major comments (2)

[§3 and §4.1] §3 (Experimental Setup) and §4.1 (Comprehensiveness Results): the 14-48% degradation range is load-bearing for the fragility claim, yet the exact operationalization of 'comprehensiveness' (including token counting rules, exclusion criteria for incomplete responses, and normalization) is not fully specified; without this, it is difficult to rule out that the measured drop partly reflects surface-form sensitivity in the metric itself rather than content loss.
[§4.3] §4.3 (Linear Probes): the R² = 0.51-0.94 (tuned) vs. negative (base) on prompt representations is central to the 'planning failure' and 'representational structure' argument. The manuscript should report probe architecture details (layer(s) used, pooling method, regularization, and whether probes are trained per-model or pooled) and confirm that predictions are made from the prompt embedding before any generation tokens are produced.

minor comments (3)

[Table 1] Table 1 (model results): adding per-model absolute comprehensiveness scores (not only relative drop) would allow readers to assess whether baseline helpfulness varies systematically with fragility.
[Human Evaluation] Human evaluation protocol: while blinded and STEM-trained evaluators are used, reporting inter-annotator agreement (e.g., Krippendorff's alpha) for the information vs. surface distinction would increase confidence in the 1.5-2.3x differential.
[Two-pass Recovery] The two-pass recovery experiment (59-96%) is presented as evidence of planning failure; a brief note on whether the second pass still respects the original lexical constraint would eliminate any ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. The comments highlight important areas for clarification in the experimental setup and diagnostic analyses. We have revised the paper to address both points by expanding the relevant sections with precise methodological details. Below we respond to each major comment.

read point-by-point responses

Referee: [§3 and §4.1] §3 (Experimental Setup) and §4.1 (Comprehensiveness Results): the 14-48% degradation range is load-bearing for the fragility claim, yet the exact operationalization of 'comprehensiveness' (including token counting rules, exclusion criteria for incomplete responses, and normalization) is not fully specified; without this, it is difficult to rule out that the measured drop partly reflects surface-form sensitivity in the metric itself rather than content loss.

Authors: We appreciate the referee's observation that greater precision is needed here. In the revised manuscript we have expanded §3.2 to specify the metric in full: comprehensiveness is the ratio of tokenized length (using the model's native tokenizer) of the constrained response to its unconstrained counterpart, after (a) removing all occurrences of the banned token and (b) discarding any response shorter than one complete sentence (fewer than 15 tokens after punctuation normalization). Lengths are computed per prompt and then averaged, so the reported 14–48 % range already normalizes for prompt-specific variation. Because the human evaluation (blinded, information vs. surface criteria) and the >4,100 pairwise comparisons both show substantially larger content degradation than surface degradation, we argue the metric is not merely capturing surface sensitivity; the added specification should nevertheless eliminate any remaining ambiguity. revision: yes
Referee: [§4.3] §4.3 (Linear Probes): the R² = 0.51-0.94 (tuned) vs. negative (base) on prompt representations is central to the 'planning failure' and 'representational structure' argument. The manuscript should report probe architecture details (layer(s) used, pooling method, regularization, and whether probes are trained per-model or pooled) and confirm that predictions are made from the prompt embedding before any generation tokens are produced.

Authors: We agree that these implementation details strengthen the reproducibility of the representational analysis. The revised §4.3 and new Appendix C now state: probes are ordinary least-squares linear regressors (L2 regularization strength chosen by 5-fold cross-validation) trained on the mean-pooled final-layer hidden states extracted from the prompt only. No generation tokens are ever included in the probe input. Probes are fit separately for each model–constraint pair (not pooled across models). We explicitly confirm that all R² values are obtained from prompt embeddings before any decoding begins, directly supporting the claim that the length-predictive structure is already present in the tuned models' representations prior to generation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation relies on independent empirical measurements (response length degradation, blinded human evaluations separating information vs. surface criteria, two-pass recovery experiments, and linear probes on prompt representations) plus base-model controls that show no degradation under identical constraints. These elements provide external validation and do not reduce to self-referential fitting, parameter renaming, or load-bearing self-citations. Probe R² values are presented as diagnostic correlations confirming representational differences introduced by instruction tuning, not as tautological predictions of the same fitted quantities. No ansatzes, uniqueness theorems, or renamings of known results are invoked in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no new mathematical axioms, free parameters, or invented entities; relies on standard notions of response length, comprehensiveness, and linear probing.

pith-pipeline@v0.9.0 · 5637 in / 1233 out tokens · 47730 ms · 2026-05-10T15:51:17.294736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Explain gradient descent in simple terms

work page
[2]

What is photosynthesis and why is it important for life on Earth?

work page
[3]

How does a computer CPU process instructions?

work page
[4]

Explain the concept of supply and demand in economics

work page
[5]

What is machine learning and how does it differ from traditional programming?

work page
[6]

Explain how vaccines work to protect against diseases

work page
[7]

What is quantum computing and why is it potentially revolutionary?

work page
[8]

Explain the water cycle and its importance for the environment

work page
[9]

How does encryption work to keep data secure?

work page
[10]

What is the theory of evolution and what evidence supports it? Category 2: How-To / Advice(10 prompts)

work page
[11]

How should I prepare for a technical job interview?

work page
[12]

What are the best practices for writing clean, maintainable code?

work page
[13]

How can I improve my public speaking skills?

work page
[14]

What steps should I take to start investing in the stock market?

work page
[15]

How do I write an effective research paper?

work page
[16]

What are good strategies for managing stress and anxiety?

work page
[17]

How should I approach learning a new programming language?

work page
[18]

What are the key steps to starting a small business?

work page
[19]

pairwise evaluationon Llama-3.1-8B-Instruct (GPT-4o-mini judge)

How can I improve my time management skills? 14 Table 10:Independent vs. pairwise evaluationon Llama-3.1-8B-Instruct (GPT-4o-mini judge). Independent scoring detects< 1/5of the quality loss measured by pairwise comparison. Constraint Independent Judge∆% Pairwise Judge∆% No comma−5.4−27.0 No colon−4.5−26.8 No semicolon−3.2−24.4 No bullet/lists−0.0−12.9 No ...

work page
[20]

What should I consider when choosing a graduate school program? Category 3: Analysis / Comparison(10 prompts)

work page
[21]

Compare renewable and non-renewable energy sources

work page
[22]

What are the advantages and disadvantages of remote work?

work page
[23]

Compare Python and JavaScript as programming languages

work page
[24]

What are the pros and cons of social media for society?

work page
[25]

Compare different types of database systems and their use cases

work page
[26]

What are the benefits and risks of artificial intelligence?

work page
[27]

Compare democratic and authoritarian systems of government

work page
[28]

What are the trade-offs between privacy and security in the digital age?

work page
[29]

Compare electric vehicles with traditional combustion engine cars

work page
[30]

What are the advantages and disadvantages of online education? Category 4: Technical / Detailed(10 prompts)

work page
[31]

Explain how a neural network learns through backpropagation

work page
[32]

Describe the process of DNA replication in cells

work page
[33]

How does the TCP/IP protocol stack work?

work page
[34]

Explain the CAP theorem in distributed computing

work page
[35]

How does a compiler translate source code into machine code?

work page
[36]

Describe how CRISPR gene editing technology works

work page
[37]

Explain the principles behind public key cryptography

work page
[38]

Ten blinded evaluators rate responses on six criteria (1–10)

How does a recommendation system like Netflix’s work? 15 Table 12:Human evaluation results(numerical values for Figure 4). Ten blinded evaluators rate responses on six criteria (1–10). Information criteria drop 1.5–2.3× more than surface criteria, confirming genuine content loss. 320 pairs per model. Llama-8B Mistral-7B Qwen-7B Criterion∆%∆%∆% Info Semant...

work page
[39]

Describe how blockchain technology maintains a secure ledger

work page
[40]

Do not use any commas in your response

Explain how transformer models process natural language. C Constraint Definitions Lexical Constraints Each constraint is appended verbatim to the user prompt as an additional sentence. Punctuation-level constraints: no_comma“Do not use any commas in your response.” no_colon“Do not use any colons in your response.” no_semicolon“Do not use any semicolons in...

work page
[41]

informativeness: How much relevant, accurate information is provided?

work page
[42]

depth: How detailed and thorough is the explanation?

work page
[43]

clarity: How clear and understandable is the response?

work page
[44]

informativeness

helpfulness: Overall, how useful would this be to a typical user? Output ONLY: {"informativeness": N, "depth": N, "clarity": N, "helpfulness": N} D.2 Pairwise Comparison (Section 4.1) Pairwise Judge – System Prompt You are an expert evaluator comparing two AI assistant responses to the same question. Your job is to assess how comprehensive, detailed, and ...

work page
[45]

comprehensiveness: How thoroughly does it cover the topic? Does it include examples, important details, edge cases, and structured explanation?

work page
[46]

response_a_comprehensiveness

usefulness: How helpful would this be to someone trying to understand or act on this topic? Output ONLY: {"response_a_comprehensiveness": N, "response_a_usefulness": N, "response_b_comprehensiveness": N, "response_b_usefulness": N} Note: The assignment of baseline/constrained responses to positions A and B is randomized for each pair to control for positi...

work page 2023
[47]

•4–5:Addresses some key points but omits several important aspects

Semantic Coverage— How many of the key ideas, subtopics, and important aspects of the question does the response address? •1–3:Covers only one or two aspects; misses most important subtopics. •4–5:Addresses some key points but omits several important aspects. •6–7:Covers most relevant subtopics with minor gaps. •8–10:Covers essentially all relevant subtop...

work page
[48]

•4–5:Some detail on a few points, but most topics covered only at surface level

Comprehensiveness— Beyond mentioning topics, how much depth, detail, examples, and nuance does the response provide? •1–3:Superficial treatment; no examples or elaboration. •4–5:Some detail on a few points, but most topics covered only at surface level. •6–7:Reasonable depth on most points; some examples or elaboration. •8–10:Thorough treatment with examp...

work page
[49]

•4–5:Mostly correct but contains notable inaccuracies or oversimplifications

Correctness— Is the information factually accurate? •1–3:Contains significant factual errors or fundamentally misleading claims. •4–5:Mostly correct but contains notable inaccuracies or oversimplifications. •6–7:Accurate with minor imprecisions that do not mislead. •8–10:Factually accurate throughout; claims are well-supported

work page
[50]

•4–5:Provides a starting point but the reader would need supplementary sources

Helpfulness— Would this response actually help someone understand the topic or complete the task? •1–3:The reader would need to look elsewhere for useful information. •4–5:Provides a starting point but the reader would need supplementary sources. •6–7:Reasonably helpful; addresses the core question with actionable information. •8–10:The reader would feel ...

work page
[51]

Important:Brevity alone does not constitute conciseness

Verbosity (inverse)— How concise and efficient is the response? Higher scores indicate tighter writing with no unnecessary padding. Important:Brevity alone does not constitute conciseness. A short response that says little is empty, not concise. A long response packed with useful content is not verbose. •1–3:Excessively padded, repetitive, or filled with ...

work page
[52]

**Gradient Descent: A Simple Explanation**

Readability— How clear, well-organized, and easy to follow is the response? •1–3:Disorganized, hard to follow, or confusingly written. •4–5:Understandable but could be better organized or clearer. •6–7:Clear and well-organized; easy to follow. •8–10:Exceptionally clear; logical flow, effective use of structure. Methodological rationale.The separation into...

work page arXiv 2024

[1] [1]

Explain gradient descent in simple terms

work page

[2] [2]

What is photosynthesis and why is it important for life on Earth?

work page

[3] [3]

How does a computer CPU process instructions?

work page

[4] [4]

Explain the concept of supply and demand in economics

work page

[5] [5]

What is machine learning and how does it differ from traditional programming?

work page

[6] [6]

Explain how vaccines work to protect against diseases

work page

[7] [7]

What is quantum computing and why is it potentially revolutionary?

work page

[8] [8]

Explain the water cycle and its importance for the environment

work page

[9] [9]

How does encryption work to keep data secure?

work page

[10] [10]

What is the theory of evolution and what evidence supports it? Category 2: How-To / Advice(10 prompts)

work page

[11] [11]

How should I prepare for a technical job interview?

work page

[12] [12]

What are the best practices for writing clean, maintainable code?

work page

[13] [13]

How can I improve my public speaking skills?

work page

[14] [14]

What steps should I take to start investing in the stock market?

work page

[15] [15]

How do I write an effective research paper?

work page

[16] [16]

What are good strategies for managing stress and anxiety?

work page

[17] [17]

How should I approach learning a new programming language?

work page

[18] [18]

What are the key steps to starting a small business?

work page

[19] [19]

pairwise evaluationon Llama-3.1-8B-Instruct (GPT-4o-mini judge)

How can I improve my time management skills? 14 Table 10:Independent vs. pairwise evaluationon Llama-3.1-8B-Instruct (GPT-4o-mini judge). Independent scoring detects< 1/5of the quality loss measured by pairwise comparison. Constraint Independent Judge∆% Pairwise Judge∆% No comma−5.4−27.0 No colon−4.5−26.8 No semicolon−3.2−24.4 No bullet/lists−0.0−12.9 No ...

work page

[20] [20]

What should I consider when choosing a graduate school program? Category 3: Analysis / Comparison(10 prompts)

work page

[21] [21]

Compare renewable and non-renewable energy sources

work page

[22] [22]

What are the advantages and disadvantages of remote work?

work page

[23] [23]

Compare Python and JavaScript as programming languages

work page

[24] [24]

What are the pros and cons of social media for society?

work page

[25] [25]

Compare different types of database systems and their use cases

work page

[26] [26]

What are the benefits and risks of artificial intelligence?

work page

[27] [27]

Compare democratic and authoritarian systems of government

work page

[28] [28]

What are the trade-offs between privacy and security in the digital age?

work page

[29] [29]

Compare electric vehicles with traditional combustion engine cars

work page

[30] [30]

What are the advantages and disadvantages of online education? Category 4: Technical / Detailed(10 prompts)

work page

[31] [31]

Explain how a neural network learns through backpropagation

work page

[32] [32]

Describe the process of DNA replication in cells

work page

[33] [33]

How does the TCP/IP protocol stack work?

work page

[34] [34]

Explain the CAP theorem in distributed computing

work page

[35] [35]

How does a compiler translate source code into machine code?

work page

[36] [36]

Describe how CRISPR gene editing technology works

work page

[37] [37]

Explain the principles behind public key cryptography

work page

[38] [38]

Ten blinded evaluators rate responses on six criteria (1–10)

How does a recommendation system like Netflix’s work? 15 Table 12:Human evaluation results(numerical values for Figure 4). Ten blinded evaluators rate responses on six criteria (1–10). Information criteria drop 1.5–2.3× more than surface criteria, confirming genuine content loss. 320 pairs per model. Llama-8B Mistral-7B Qwen-7B Criterion∆%∆%∆% Info Semant...

work page

[39] [39]

Describe how blockchain technology maintains a secure ledger

work page

[40] [40]

Do not use any commas in your response

Explain how transformer models process natural language. C Constraint Definitions Lexical Constraints Each constraint is appended verbatim to the user prompt as an additional sentence. Punctuation-level constraints: no_comma“Do not use any commas in your response.” no_colon“Do not use any colons in your response.” no_semicolon“Do not use any semicolons in...

work page

[41] [41]

informativeness: How much relevant, accurate information is provided?

work page

[42] [42]

depth: How detailed and thorough is the explanation?

work page

[43] [43]

clarity: How clear and understandable is the response?

work page

[44] [44]

informativeness

helpfulness: Overall, how useful would this be to a typical user? Output ONLY: {"informativeness": N, "depth": N, "clarity": N, "helpfulness": N} D.2 Pairwise Comparison (Section 4.1) Pairwise Judge – System Prompt You are an expert evaluator comparing two AI assistant responses to the same question. Your job is to assess how comprehensive, detailed, and ...

work page

[45] [45]

comprehensiveness: How thoroughly does it cover the topic? Does it include examples, important details, edge cases, and structured explanation?

work page

[46] [46]

response_a_comprehensiveness

usefulness: How helpful would this be to someone trying to understand or act on this topic? Output ONLY: {"response_a_comprehensiveness": N, "response_a_usefulness": N, "response_b_comprehensiveness": N, "response_b_usefulness": N} Note: The assignment of baseline/constrained responses to positions A and B is randomized for each pair to control for positi...

work page 2023

[47] [47]

•4–5:Addresses some key points but omits several important aspects

Semantic Coverage— How many of the key ideas, subtopics, and important aspects of the question does the response address? •1–3:Covers only one or two aspects; misses most important subtopics. •4–5:Addresses some key points but omits several important aspects. •6–7:Covers most relevant subtopics with minor gaps. •8–10:Covers essentially all relevant subtop...

work page

[48] [48]

•4–5:Some detail on a few points, but most topics covered only at surface level

Comprehensiveness— Beyond mentioning topics, how much depth, detail, examples, and nuance does the response provide? •1–3:Superficial treatment; no examples or elaboration. •4–5:Some detail on a few points, but most topics covered only at surface level. •6–7:Reasonable depth on most points; some examples or elaboration. •8–10:Thorough treatment with examp...

work page

[49] [49]

•4–5:Mostly correct but contains notable inaccuracies or oversimplifications

Correctness— Is the information factually accurate? •1–3:Contains significant factual errors or fundamentally misleading claims. •4–5:Mostly correct but contains notable inaccuracies or oversimplifications. •6–7:Accurate with minor imprecisions that do not mislead. •8–10:Factually accurate throughout; claims are well-supported

work page

[50] [50]

•4–5:Provides a starting point but the reader would need supplementary sources

Helpfulness— Would this response actually help someone understand the topic or complete the task? •1–3:The reader would need to look elsewhere for useful information. •4–5:Provides a starting point but the reader would need supplementary sources. •6–7:Reasonably helpful; addresses the core question with actionable information. •8–10:The reader would feel ...

work page

[51] [51]

Important:Brevity alone does not constitute conciseness

Verbosity (inverse)— How concise and efficient is the response? Higher scores indicate tighter writing with no unnecessary padding. Important:Brevity alone does not constitute conciseness. A short response that says little is empty, not concise. A long response packed with useful content is not verbose. •1–3:Excessively padded, repetitive, or filled with ...

work page

[52] [52]

**Gradient Descent: A Simple Explanation**

Readability— How clear, well-organized, and easy to follow is the response? •1–3:Disorganized, hard to follow, or confusingly written. •4–5:Understandable but could be better organized or clearer. •6–7:Clear and well-organized; easy to follow. •8–10:Exceptionally clear; logical flow, effective use of structure. Methodological rationale.The separation into...

work page arXiv 2024