Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

Alexander Scarlatos; Andrew Lan; Danielle McNamara; Hunter McNichols; Mihai Dascalu

arxiv: 2605.27249 · v1 · pith:JZBI4THEnew · submitted 2026-05-26 · 💻 cs.AI · cs.CL

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

Hunter McNichols , Alexander Scarlatos , Mihai Dascalu , Danielle McNamara , Andrew Lan This is my paper

Pith reviewed 2026-06-29 16:45 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords counterfactual generationstudent writingGumbel noisecontrolled decodinglarge language modelseducational feedbackrubric consistencysimilarity control

0 comments

The pith

The Gumbel Machine generates counterfactual student writing that meets rubric criteria while remaining similar to the original through controlled LLM decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to create improved versions of student writing that stay close enough to the student's own work for effective learning. It combines standard LLM instruction following with a new decoding step that treats latent randomness as an adjustable knob for output similarity. A reader would care because typical LLM generations often diverge too much from the reference, limiting their usefulness as teaching examples. The approach is tested on datasets of scored student writing and produces texts that satisfy the rubrics yet match the originals closely. This avoids the need for fully custom domain-specific models.

Core claim

The Gumbel Machine is a flexible modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to the approach is a novel controlled decoding algorithm, β-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing scored on various criteria demonstrate the effectiveness at generating counterfactuals that are both rubric-consistent and similar to a reference.

What carries the argument

β-Hindsight control, a controlled decoding algorithm that uses latent randomness as a tunable similarity control mechanism to steer LLM outputs.

If this is right

Counterfactuals serve as learning demonstrations that students can emulate more readily than distant high-quality examples.
The method applies across multiple scoring criteria in student writing without requiring domain-specific system redesign.
Outputs balance improvement and resemblance through the tunable similarity control.
The modular design integrates directly with existing LLM instruction-following without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The control technique could extend to other factual text domains such as reports or code where similarity to a reference is required.
It offers a general route to impose similarity constraints on LLM generation tasks without additional fine-tuning.
Combining the mechanism with other decoding controls might allow even finer trade-offs between quality and fidelity.

Load-bearing premise

That the β-Hindsight control mechanism can be adjusted in practice to produce outputs that are simultaneously improved and sufficiently similar without additional unstated constraints or post-processing.

What would settle it

An evaluation on the student writing datasets in which no setting of the control parameter yields texts that independent raters judge as both higher rubric quality and adequately close to the reference.

Figures

Figures reproduced from arXiv: 2605.27249 by Alexander Scarlatos, Andrew Lan, Danielle McNamara, Hunter McNichols, Mihai Dascalu.

**Figure 2.** Figure 2: Overview of the Gumbel Machine. In Stage 1, we recover noise that would have produced the observed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Similarity–validity trade-offs comparing Gum [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Example of a counterfactual rewrite where Gumbel Machine (GM) remains similar in precise wording [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Example of a counterfactual rewrite where Gumbel Machine (GM) preserves the student’s two-part key [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Example of a counterfactual rewrite where Gumbel Machine (GM) retains the student’s original wording [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Teacher evaluation interface per-rewrite questions (Q1–Q3) for rewrites A and B. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Teacher evaluation interface pairwise comparison questions (Q4–Q6). [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Example instruction prompt for “Language” criterion including reference. This prompt structure is the [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Example instruction prompt for “Details” criterion excluding reference. This prompt structure is used for [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Example Identify-and-Replace (I&R) prompt for the DREsS “Language” criterion. [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Example scoring model prompt for “Language” criterion. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Token count statistics for the summaries and passages in the CLASSE dataset. Summaries are approxi [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Token count statistics for the essays and writing prompts in the DREsS dataset. Essays are approximately [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

read the original abstract

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $\beta$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Gumbel Machine adds a tunable decoding step with β-Hindsight control to keep LLM counterfactuals close to a student's original writing while improving rubric scores, but the evaluation leaves the practical effectiveness unclear.

read the letter

The new piece is β-Hindsight control, which treats latent randomness as a dial for similarity during decoding. It sits on top of an instruction-tuned LLM and tries to produce an improved version of student text without drifting too far from the reference. That modular setup is the main practical angle: no fine-tuning required, just a change in how tokens are sampled.

The experiments claim the outputs stay rubric-consistent and similar enough to the original on student writing datasets. If the full results show that a single parameter reliably trades off those two goals without extra post-processing, that would be a usable trick for edtech tools.

The soft spot is the thin evidence. The abstract gives no baselines, no quantitative similarity metrics, no statistical tests, and no comparison to earlier Gumbel-based or controlled-decoding methods. Without those numbers it is hard to know whether the control actually works as advertised or whether hidden constraints are doing the heavy lifting. The central assumption—that latent noise alone can be tuned to hit both improvement and similarity—needs clearer validation than is visible here.

This is for people building AI writing coaches or example generators in education. A reader already working on controlled text generation might pick up the decoding trick if the code ships, but the paper does not move the broader field.

I would send it to peer review. The idea is straightforward and the target use case is real; referees can check whether the claimed tunability holds up once the methods and numbers are on the table.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the Gumbel Machine, a modular controlled-decoding framework for generating counterfactual student writing. It relies on LLM instruction-following combined with a novel eta-Hindsight control mechanism that treats latent randomness (via Gumbel noise) as a tunable similarity knob during decoding. The central claim is that this produces rubric-consistent improvements that remain similar to a given reference text, with experiments on student-writing datasets offered as evidence of effectiveness.

Significance. If the experimental results hold, the method could offer a practical, domain-agnostic way to create personalized learning examples without requiring task-specific fine-tuning. The modular design and single-parameter similarity control are potentially attractive for educational applications. However, the provided abstract supplies no datasets, metrics, baselines, or statistical details, so the actual contribution cannot yet be evaluated.

major comments (1)

[Abstract] Abstract: the claim that 'Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach' is unsupported because the abstract (and the visible manuscript excerpt) contains no description of the datasets, scoring rubrics, evaluation metrics, baselines, or statistical tests. This information is load-bearing for the central empirical claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comment. We agree that the abstract should better support its empirical claim by including high-level details on the evaluation setup. We will revise the abstract in the next version while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach' is unsupported because the abstract (and the visible manuscript excerpt) contains no description of the datasets, scoring rubrics, evaluation metrics, baselines, or statistical tests. This information is load-bearing for the central empirical claim.

Authors: The referee is correct that the current abstract provides no specifics on datasets, rubrics, metrics, baselines or tests. The full paper contains these details in Sections 4 and 5 (student-writing corpora, rubric dimensions, similarity and rubric-consistency metrics, and baseline comparisons with statistical reporting). To make the abstract self-contained, we will add one concise sentence summarizing the evaluation (e.g., “We evaluate on two student-writing corpora using rubric-based and similarity metrics against strong LLM baselines, with statistical significance reported.”). This revision directly addresses the concern without altering the paper’s technical content. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a modular controlled-decoding method (β-Hindsight control) for generating rubric-consistent counterfactual student writing. The abstract and description contain no equations, parameter-fitting steps, or derivation chains. The central claim is supported by experimental results on student-writing datasets rather than any self-referential mathematical reduction. No self-citations, ansatzes, or uniqueness theorems appear as load-bearing elements in the provided text. The approach is presented as an empirical engineering contribution whose validity rests on external evaluation, not on internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new entities introduced by the work.

pith-pipeline@v0.9.1-grok · 5704 in / 1096 out tokens · 50181 ms · 2026-06-29T16:45:01.449495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Jiangjie Chen, Chun Gan, Sijie Cheng, Hao Zhou, Yanghua Xiao, and Lei Li

PMLR. Jiangjie Chen, Chun Gan, Sijie Cheng, Hao Zhou, Yanghua Xiao, and Lei Li. 2022. Unsupervised Editing for Counterfactual Stories.Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10473–10481. Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. 2023. DISCO: Dis- tilling Counterfactuals with Large Langua...

2022
[2]

Waste Not, Want Not; Recycled Gumbel Noise Improves Consistency in Natural Language Genera- tion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5662– 5686, Albuquerque, New Mexico. Association for Computational Linguis...

work page arXiv 2025
[3]

InNeural Information Process- ing Systems

A* sampling. InNeural Information Process- ing Systems. Hunter McNichols, Jaewook Lee, Stephen Fanc- sali, Steve Ritter, and Andrew Lan. 2024. Can Large Language Models Replicate ITS Feedback on Open-Ended Math Questions?arXiv preprint. ArXiv:2405.06414 [cs]. Xin Miao, Yongqi Li, Shen Zhou, and Tieyun Qian

work page arXiv 2024
[4]

Training language models to follow instructions with human feedback

Episodic Memory Retrieval from LLMs: A Neuromorphic Mechanism to Generate Common- sense Counterfactuals for Relation Extraction. In Findings of the Association for Computational Lin- guistics: ACL 2024, pages 2489–2511, Bangkok, Thailand. Association for Computational Linguistics. Van Bach Nguyen, Paul Youssef, Christin Seifert, and Jörg Schlötterer. 2024...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

LLM Evaluators Recognize and Favor Their Own Generations

LLM evaluators recognize and favor their own generations.Preprint, arXiv:2404.13076. Bhargavi Paranjape, Matthew Lamm, and Ian Tenney

work page internal anchor Pith review Pith/arXiv arXiv
[6]

InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1670–1686, Dublin, Ireland

Retrieval-guided Counterfactual Generation for QA. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1670–1686, Dublin, Ireland. Association for Computational Linguistics. Xiaoqi Qiu, Yongjie Wang, Xu Guo, Zhiwei Zeng, Yu Yue, Yuhong Feng, and Chunyan Miao. 2024. PairCFR: Enhancing M...

2024
[7]

Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, and Ryan Cotterell

Direct preference optimization: Your language model is secretly a reward model.Advances in Neu- ral Information Processing Systems, 36. Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, and Ryan Cotterell. 2025. Gumbel counterfactual generation from language models. InICLR. Alexis Ross, Ana Marasovi´c, and Matthew Peters. 2021. Explaining NLP models vi...

2025
[8]

InProceedings of the 18th Conference of the European Chapter of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1876–1898, St

CATfOOD: Counterfactual Augmented Train- ing for Improving Out-of-Domain Performance and Calibration. InProceedings of the 18th Conference of the European Chapter of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1876–1898, St. Julian’s, Malta. Association for Com- putational Linguistics. Alexander Scarlatos, Digory Smith, ...

work page arXiv 2024
[9]

Language

DREsS: Dataset for Rubric-based Essay Scor- ing on EFL Writing. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13439– 13454, Vienna, Austria. Association for Computa- tional Linguistics. Luhao Zhang, Xinyu Zhang, Linmei Hu, Dandan Song, and Liqiang Nie. 2025. Dually Self-Improved ...

work page arXiv 2025
[10]

Output STEP_1, STEP_2, STEP_3 reasoning
[11]

THEN output the FINAL_JSON block
[12]

The FINAL_JSON block must appear EXACTLY ONCE
[13]

The FINAL_JSON block must be the LAST thing in your response
[14]

Do NOT stop after STEP_3
[15]

final_essay

Do NOT omit the FINAL_JSON block. Inside the FINAL_JSON block: - Output ONLY valid JSON. - The JSON must contain exactly one key: “final_essay”. - The value must be a single-line string. - Do not include newline characters inside the string. - Do not output any additional text after<<<END_FINAL_JSON>>>. SENTINELS (exact): <<<FINAL_JSON>>> { “final_essay”:...

[1] [1]

Jiangjie Chen, Chun Gan, Sijie Cheng, Hao Zhou, Yanghua Xiao, and Lei Li

PMLR. Jiangjie Chen, Chun Gan, Sijie Cheng, Hao Zhou, Yanghua Xiao, and Lei Li. 2022. Unsupervised Editing for Counterfactual Stories.Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10473–10481. Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. 2023. DISCO: Dis- tilling Counterfactuals with Large Langua...

2022

[2] [2]

Waste Not, Want Not; Recycled Gumbel Noise Improves Consistency in Natural Language Genera- tion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5662– 5686, Albuquerque, New Mexico. Association for Computational Linguis...

work page arXiv 2025

[3] [3]

InNeural Information Process- ing Systems

A* sampling. InNeural Information Process- ing Systems. Hunter McNichols, Jaewook Lee, Stephen Fanc- sali, Steve Ritter, and Andrew Lan. 2024. Can Large Language Models Replicate ITS Feedback on Open-Ended Math Questions?arXiv preprint. ArXiv:2405.06414 [cs]. Xin Miao, Yongqi Li, Shen Zhou, and Tieyun Qian

work page arXiv 2024

[4] [4]

Training language models to follow instructions with human feedback

Episodic Memory Retrieval from LLMs: A Neuromorphic Mechanism to Generate Common- sense Counterfactuals for Relation Extraction. In Findings of the Association for Computational Lin- guistics: ACL 2024, pages 2489–2511, Bangkok, Thailand. Association for Computational Linguistics. Van Bach Nguyen, Paul Youssef, Christin Seifert, and Jörg Schlötterer. 2024...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

LLM Evaluators Recognize and Favor Their Own Generations

LLM evaluators recognize and favor their own generations.Preprint, arXiv:2404.13076. Bhargavi Paranjape, Matthew Lamm, and Ian Tenney

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1670–1686, Dublin, Ireland

Retrieval-guided Counterfactual Generation for QA. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1670–1686, Dublin, Ireland. Association for Computational Linguistics. Xiaoqi Qiu, Yongjie Wang, Xu Guo, Zhiwei Zeng, Yu Yue, Yuhong Feng, and Chunyan Miao. 2024. PairCFR: Enhancing M...

2024

[7] [7]

Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, and Ryan Cotterell

Direct preference optimization: Your language model is secretly a reward model.Advances in Neu- ral Information Processing Systems, 36. Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, and Ryan Cotterell. 2025. Gumbel counterfactual generation from language models. InICLR. Alexis Ross, Ana Marasovi´c, and Matthew Peters. 2021. Explaining NLP models vi...

2025

[8] [8]

InProceedings of the 18th Conference of the European Chapter of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1876–1898, St

CATfOOD: Counterfactual Augmented Train- ing for Improving Out-of-Domain Performance and Calibration. InProceedings of the 18th Conference of the European Chapter of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1876–1898, St. Julian’s, Malta. Association for Com- putational Linguistics. Alexander Scarlatos, Digory Smith, ...

work page arXiv 2024

[9] [9]

Language

DREsS: Dataset for Rubric-based Essay Scor- ing on EFL Writing. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13439– 13454, Vienna, Austria. Association for Computa- tional Linguistics. Luhao Zhang, Xinyu Zhang, Linmei Hu, Dandan Song, and Liqiang Nie. 2025. Dually Self-Improved ...

work page arXiv 2025

[10] [10]

Output STEP_1, STEP_2, STEP_3 reasoning

[11] [11]

THEN output the FINAL_JSON block

[12] [12]

The FINAL_JSON block must appear EXACTLY ONCE

[13] [13]

The FINAL_JSON block must be the LAST thing in your response

[14] [14]

Do NOT stop after STEP_3

[15] [15]

final_essay

Do NOT omit the FINAL_JSON block. Inside the FINAL_JSON block: - Output ONLY valid JSON. - The JSON must contain exactly one key: “final_essay”. - The value must be a single-line string. - Do not include newline characters inside the string. - Do not output any additional text after<<<END_FINAL_JSON>>>. SENTINELS (exact): <<<FINAL_JSON>>> { “final_essay”:...