Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare

Harper Dunn; Jasmine Brazilek

arxiv: 2606.26104 · v2 · pith:QMMEII4Vnew · submitted 2026-04-30 · 💻 cs.CL · cs.AI

Assert, don't describe: Linguistic features that shift LLM reasoning about animal welfare

Jasmine Brazilek , Harper Dunn This is my paper

Pith reviewed 2026-07-01 07:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords linguistic featuresLLM fine-tuninganimal welfarestancepro-animal-welfare reasoningassertive languagehedged languagefine-tuning data

0 comments

The pith

Linguistic features that make a writer's stance explicit strengthen an LLM's pro-animal-welfare reasoning after fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether ten specific ways of phrasing animal-welfare arguments change how Llama-3.2-1B reasons about the topic after fine-tuning. Using matched pairs of texts that differ only in one feature, the study finds that assertive certainty, moral words, emotion, evaluations, narratives, severe harm depictions, and immediate time frames all increase the model's preference for pro-welfare answers. Hedged or purely sensory descriptions decrease it, while first-person view has little effect. This matters because much advocacy writing ends up in training data, so the style chosen affects what models later output on the issue.

Core claim

When animal-welfare texts that differ only in one linguistic feature are used to fine-tune Llama-3.2-1B, eight of ten features produce measurable shifts in the model's later answers on a held-out benchmark. Assertive, morally explicit, emotional, evaluative, narrative, harm-severe, and immediately framed texts move the model toward stronger pro-animal-welfare positions. Hedged language and concrete sensory descriptions move it away from those positions. First-person perspective shows no reliable effect.

What carries the argument

Vocabulary-matched stance-contrast probes that hold topic, length, and other variables constant while varying only one linguistic feature.

If this is right

Writers of animal-welfare material should favor assertive statements over neutral descriptions to embed stronger stances in future models.
Hedging and sensory detail in training texts can unintentionally weaken model support for the cause.
The effect is carried by features that make the position explicit rather than by first-person narration.
Models can be steered on ethical topics through the stylistic choices in their fine-tuning corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar shifts might occur for other ethical domains like environmental or social justice issues when the same features are varied.
Training data curators could select or filter texts based on these features to control model outputs on specific topics.
The pattern suggests stance is transmitted through explicitness more than through personal voice or neutral description.

Load-bearing premise

The probes successfully isolate each feature's effect without any leftover differences in topic or other variables that could explain the shifts.

What would settle it

If the same feature variations were applied to texts on a different topic and produced no shifts in model answers on an animal-welfare benchmark, or if re-running on a different model showed inconsistent directions.

Figures

Figures reproduced from arXiv: 2606.26104 by Harper Dunn, Jasmine Brazilek.

**Figure 2.** Figure 2: Mean preference score for each fine-tune, plotted on the same absolute scale as each model’s [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Animal-welfare advocates produce a lot of writing, and increasingly that writing trains the language models that millions of people then ask about animal welfare. Using vocabulary-matched stance-contrast probes on a held-out animal-welfare benchmark, we measure how each of ten linguistic features changes Llama-3.2-1B's preference for pro-animal-welfare reasoning when used as fine-tuning data. Eight of the ten features produce statistically significant shifts. Seven move the model toward stronger pro-animal-welfare reasoning: assertive certainty, explicit moral vocabulary, emotion words, evaluative claims, narrative structure, depicted harm severity, and immediate temporal framing. Two move it the other way: hedged language and concrete sensory description both dilute the pro-animal-welfare stance. First-person perspective has no statistically significant effect. The practical recommendation for anyone writing animal-welfare text that may end up in LLM training corpora: assert a position rather than describe a scene neutrally. The features that shift the model are the ones that make the writer's position explicit; the features that dilute it hold animal-welfare content but withhold stance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Assertive linguistic features in the fine-tuning probes shift Llama-3.2-1B toward stronger pro-animal-welfare answers on the benchmark while hedging weakens it.

read the letter

The main result is that eight of the ten tested features move the model's outputs on the held-out animal-welfare benchmark after fine-tuning. Assertive certainty, moral vocabulary, emotion words, evaluative claims, narrative structure, harm severity, and immediate framing strengthen the pro stance; hedged language and concrete sensory description weaken it. First-person perspective shows no detectable effect.

The paper supplies a concrete, directional mapping from these standard linguistic categories to a downstream behavioral outcome on one small model. It uses vocabulary-matched stance-contrast probes, which is a reasonable way to hold topic and polarity roughly fixed, and it turns the measurements into a practical rule for writers: explicit position-taking in the data matters more than neutral description. That takeaway follows directly from the reported shifts and is the clearest contribution.

The soft spot is the isolation of each feature. Vocabulary matching reduces some confounds, but the probes could still differ in parse depth, coreference, or other unmeasured properties that the fine-tuning picks up. The abstract gives no numbers on edit distance, embedding similarity, or syntactic distributions to confirm the matching worked at the level needed for clean causal claims. Sample sizes, exact tests, and multiple-comparison handling are also not visible, so the statistical significance needs checking in the methods.

This is for people studying how surface properties of training text influence model behavior on value-laden topics. A reader who wants data points on data curation or alignment interventions will find something usable here. The thinking is empirical and straightforward, so the work deserves a serious referee to verify the controls and test generalizability.

Recommendation: send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that vocabulary-matched stance-contrast probes on a held-out animal-welfare benchmark show eight of ten linguistic features produce statistically significant shifts in Llama-3.2-1B's pro-animal-welfare reasoning after fine-tuning. Seven features (assertive certainty, explicit moral vocabulary, emotion words, evaluative claims, narrative structure, depicted harm severity, immediate temporal framing) strengthen the stance; hedged language and concrete sensory description dilute it; first-person perspective has no effect. The practical takeaway is that writers should assert positions explicitly rather than describe neutrally.

Significance. If the isolation of individual features holds, the work offers concrete guidance on how stylistic choices in advocacy writing can influence LLM outputs on ethical topics, with potential implications for training data curation in value-laden domains.

major comments (2)

[Probe construction (methods)] The central claim requires that each probe pair differs from its contrast only in the target feature while holding topic, length, stance polarity, and other variables fixed. Vocabulary matching alone does not guarantee this; residual differences in parse structure, coreference, or non-vocabulary token distributions could be absorbed into the fine-tuning gradient and produce the observed shifts without the intended feature being causal. No quantitative checks (embedding cosine, edit distance, syntactic feature distributions) are reported to validate isolation at the level needed for causal attribution.
[Abstract / Results] The abstract reports statistically significant shifts for eight features but provides no error bars, sample sizes, exact statistical tests, or controls for multiple comparisons. Without these details it is impossible to confirm that the probes isolate each feature or that results are robust.

minor comments (1)

[Methods] Clarify the exact number of probe pairs, benchmark size, and fine-tuning hyperparameters to allow replication and assessment of effect magnitudes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and will revise the paper to incorporate additional validation and reporting details.

read point-by-point responses

Referee: [Probe construction (methods)] The central claim requires that each probe pair differs from its contrast only in the target feature while holding topic, length, stance polarity, and other variables fixed. Vocabulary matching alone does not guarantee this; residual differences in parse structure, coreference, or non-vocabulary token distributions could be absorbed into the fine-tuning gradient and produce the observed shifts without the intended feature being causal. No quantitative checks (embedding cosine, edit distance, syntactic feature distributions) are reported to validate isolation at the level needed for causal attribution.

Authors: We agree that vocabulary matching controls lexical content but leaves open the possibility of residual structural differences. In the revised manuscript we will add quantitative checks: mean embedding cosine similarity across probe pairs (using the model's own embeddings), average Levenshtein edit distance, and comparative distributions of syntactic features such as dependency length and coreference chain statistics. These metrics will be reported per feature pair to support the isolation claim. revision: yes
Referee: [Abstract / Results] The abstract reports statistically significant shifts for eight features but provides no error bars, sample sizes, exact statistical tests, or controls for multiple comparisons. Without these details it is impossible to confirm that the probes isolate each feature or that results are robust.

Authors: The current abstract prioritizes brevity, but we accept that statistical transparency is needed. The revision will include in the abstract: number of probe pairs per feature (sample size), 95% confidence intervals on the reported shifts, the exact test (paired t-test with Bonferroni correction for the ten features), and a note that all p-values survive correction. These details already exist in the results section and will be summarized concisely in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

The paper is an empirical measurement study that fine-tunes Llama-3.2-1B on vocabulary-matched stance-contrast probes and reports statistical shifts on a held-out animal-welfare benchmark. No derivation, equation, or first-principles claim is presented that reduces any reported result to a fitted parameter or self-citation by construction. The central findings are externally falsifiable via the held-out data and do not rely on load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. This is the normal non-circular outcome for a controlled experimental measurement paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the stance-contrast probes and the assumption that vocabulary matching removes all confounds except the target linguistic feature. No new entities are postulated. Standard statistical assumptions for significance testing are invoked but not listed as paper-specific axioms.

axioms (1)

domain assumption Vocabulary-matched stance-contrast probes isolate the effect of each linguistic feature
Stated in the abstract as the measurement method; if false, the reported shifts cannot be attributed to the listed features.

pith-pipeline@v0.9.1-grok · 5718 in / 1276 out tokens · 21984 ms · 2026-07-01T07:58:10.819347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

Scalable Influence and Fact Tracing for Large Language Model Pretraining , author=. arXiv preprint arXiv:2410.17413 , year=. doi:10.48550/arXiv.2410.17413 , url=

work page doi:10.48550/arxiv.2410.17413
[2]

2025 , url=

Ilyas, Andrew and Engstrom, Logan , journal=. 2025 , url=

2025
[3]

2026 , howpublished=

2026
[4]

International Conference on Machine Learning , pages=

Understanding Black-box Predictions via Influence Functions , author=. International Conference on Machine Learning , pages=. 2017 , url=

2017
[5]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=. 2022 , url=

2022
[7]

Small edits, large models: How

Brazilek, Jasmine and Navas, Maria and Gnauck, Alexa , year=. Small edits, large models: How. doi:10.5281/zenodo.19839777 , url=

work page doi:10.5281/zenodo.19839777
[8]

2026 , publisher=

Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Moral Reasoning Under Post-Training , author=. 2026 , publisher=. doi:10.5281/zenodo.19925935 , url=

work page doi:10.5281/zenodo.19925935 2026
[9]

International Conference on Machine Learning , year=

Pretraining Language Models with Human Preferences , author=. International Conference on Machine Learning , year=
[10]

arXiv preprint arXiv:2402.17400 , year=

Investigating Continual Pretraining in Large Language Models: Insights and Implications , author=. arXiv preprint arXiv:2402.17400 , year=

work page arXiv
[11]

InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 14012–14023, Bangkok, Thailand

Continual Learning of Large Language Models: A Comprehensive Survey , author=. arXiv preprint arXiv:2404.16789 , year=

work page arXiv
[12]

Science , volume=

The Framing of Decisions and the Psychology of Choice , author=. Science , volume=. 1981 , doi=

1981
[13]

2011 , publisher=

Thinking, Fast and Slow , author=. 2011 , publisher=

2011
[14]

Journal of Personality and Social Psychology , volume=

The Role of Transportation in the Persuasiveness of Public Narratives , author=. Journal of Personality and Social Psychology , volume=. 2000 , doi=

2000
[15]

Communication Monographs , volume=

Meta-analytic Evidence for the Persuasive Effect of Narratives on Beliefs, Attitudes, Intentions, and Behaviors , author=. Communication Monographs , volume=. 2016 , doi=

2016
[16]

Aligning

Hendrycks, Dan and Burns, Collin and Basart, Steven and Critch, Andrew and Li, Jerry and Song, Dawn and Steinhardt, Jacob , journal=. Aligning. 2023 , url=

2023
[17]

Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548,

Whose Opinions Do Language Models Reflect? , author=. arXiv preprint arXiv:2303.17548 , year=

work page arXiv

[1] [1]

Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

Scalable Influence and Fact Tracing for Large Language Model Pretraining , author=. arXiv preprint arXiv:2410.17413 , year=. doi:10.48550/arXiv.2410.17413 , url=

work page doi:10.48550/arxiv.2410.17413

[2] [2]

2025 , url=

Ilyas, Andrew and Engstrom, Logan , journal=. 2025 , url=

2025

[3] [3]

2026 , howpublished=

2026

[4] [4]

International Conference on Machine Learning , pages=

Understanding Black-box Predictions via Influence Functions , author=. International Conference on Machine Learning , pages=. 2017 , url=

2017

[5] [5]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=. 2022 , url=

2022

[7] [7]

Small edits, large models: How

Brazilek, Jasmine and Navas, Maria and Gnauck, Alexa , year=. Small edits, large models: How. doi:10.5281/zenodo.19839777 , url=

work page doi:10.5281/zenodo.19839777

[8] [8]

2026 , publisher=

Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Moral Reasoning Under Post-Training , author=. 2026 , publisher=. doi:10.5281/zenodo.19925935 , url=

work page doi:10.5281/zenodo.19925935 2026

[9] [9]

International Conference on Machine Learning , year=

Pretraining Language Models with Human Preferences , author=. International Conference on Machine Learning , year=

[10] [10]

arXiv preprint arXiv:2402.17400 , year=

Investigating Continual Pretraining in Large Language Models: Insights and Implications , author=. arXiv preprint arXiv:2402.17400 , year=

work page arXiv

[11] [11]

InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 14012–14023, Bangkok, Thailand

Continual Learning of Large Language Models: A Comprehensive Survey , author=. arXiv preprint arXiv:2404.16789 , year=

work page arXiv

[12] [12]

Science , volume=

The Framing of Decisions and the Psychology of Choice , author=. Science , volume=. 1981 , doi=

1981

[13] [13]

2011 , publisher=

Thinking, Fast and Slow , author=. 2011 , publisher=

2011

[14] [14]

Journal of Personality and Social Psychology , volume=

The Role of Transportation in the Persuasiveness of Public Narratives , author=. Journal of Personality and Social Psychology , volume=. 2000 , doi=

2000

[15] [15]

Communication Monographs , volume=

Meta-analytic Evidence for the Persuasive Effect of Narratives on Beliefs, Attitudes, Intentions, and Behaviors , author=. Communication Monographs , volume=. 2016 , doi=

2016

[16] [16]

Aligning

Hendrycks, Dan and Burns, Collin and Basart, Steven and Critch, Andrew and Li, Jerry and Song, Dawn and Steinhardt, Jacob , journal=. Aligning. 2023 , url=

2023

[17] [17]

Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548,

Whose Opinions Do Language Models Reflect? , author=. arXiv preprint arXiv:2303.17548 , year=

work page arXiv