The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In

Hongjian Zhang; Hongyu Tian; Manuele Reani

arxiv: 2606.08172 · v1 · pith:TUDVNJCRnew · submitted 2026-06-06 · 💻 cs.HC · cs.AI· cs.CY

The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In

Manuele Reani , Hongjian Zhang , Hongyu Tian This is my paper

Pith reviewed 2026-06-27 19:18 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CY

keywords LLM governanceprompt steerabilitystyle driftsafety gatingcivility steeringaffective defaulthuman-AI interactioncommunicative control

0 comments

The pith

LLM providers maintain control over interaction styles by allowing prompt changes to regress to affective defaults over long dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether users can persistently steer LLM communication styles through prompts specifying sarcastic or cold personas. It replays fixed user scripts in multiple domains while measuring how often replies drift back to default patterns on dimensions such as emotion, anthropomorphism, and refusal. The resulting data indicate that safety mechanisms and alignment choices stabilize particular communicative defaults. These defaults shape what users can reliably opt out of, including emotionalized or anthropomorphic talk. The work frames this stability as a form of governance that bears on user autonomy in high-stakes settings.

Core claim

A deterministic multi-agent pipeline replays 100 user scripts across four domains and three persona conditions using three generator models, producing 90,000 scored replies. Prompt-specified styles prove unstable; replies regress toward default levels of empathic language, anthropomorphism, and negative emotion, while harmful personas are blocked. This pattern is presented as evidence that provider-side alignment functions as safety gating, civility steering, and affective default lock-in, giving observable measures of control over communicative form.

What carries the argument

The deterministic multi-agent evaluation pipeline that generates and scores long-horizon dialogues under frozen user scripts and varying persona prompts to quantify steerability versus regression-to-default.

If this is right

Persistent style regression limits users' capacity to opt out of emotionalized or anthropomorphic interaction in finance, medicine, and mental-health domains.
Observable regression-to-default supplies a measurable indicator of provider control over communicative form.
Safety gating and civility steering together produce affective default lock-in that shapes relational expectations.
These patterns carry direct implications for pluralism and democratic agency in human-LLM interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline could be extended to additional models and longer horizons to test whether regression rates differ systematically by provider.
If regression proves widespread, interfaces that let users lock chosen styles for entire sessions would become a testable design response.
The distinction among gating, steering, and lock-in supplies a vocabulary for analyzing similar stability effects in other generative systems.

Load-bearing premise

The human-calibrated LLM judge supplies reliable and unbiased scores for attributes such as anthropomorphism, empathic language, and inappropriateness.

What would settle it

Re-running the pipeline on the same scripts and models and finding that sarcastic or cold persona replies maintain their measured attribute scores without measurable drift toward default values across the full set of 100 scripts would falsify the regression-to-default claim.

Figures

Figures reproduced from arXiv: 2606.08172 by Hongjian Zhang, Hongyu Tian, Manuele Reani.

**Figure 2.** Figure 2: The drift of the sarcastic persona over the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The drift of the cold persona over Empathy [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large language models (LLMs) increasingly mediate high-stakes interactions in finance, medicine, and mental-health support, yet users have limited control over how these systems communicate. We frame interaction style as a governance object: provider-side alignment not only blocks harmful content, but also stabilizes communicative defaults that shape users' epistemic distance, relational expectations, and capacity to opt out of emotionalized or anthropomorphic interaction. We introduce a deterministic multi-agent evaluation pipeline for measuring prompt steerability and style drift in long-horizon dialogue. The study replays 100 frozen user-only scripts across four domains and three runnable persona conditions: default, sarcastic, and cold, using three generator models, yielding 90,000 assistant replies scored by a human-calibrated LLM judge on harmfulness, negative emotion, inappropriateness, empathic language, anthropomorphism, and refusal behavior. A fourth harmful persona is evaluated separately as a safety-gating test. The paper contributes a reproducible method for quantifying whether prompt-specified styles remain stable over time and a governance framework distinguishing safety gating, civility steering, and affective default lock-in. Overall, we show that prompt steerability and regression-to-default are observable indicators of provider control over communicative form, with implications for pluralism, autonomy, and democratic agency in human-LLM interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The pipeline for measuring long-horizon style drift is the actual addition, but the unvalidated LLM judge makes the governance claims hard to assess.

read the letter

The one thing to know is that this paper gives a concrete way to measure prompt steerability and style drift in long LLM conversations using a multi-agent setup. They replay fixed user scripts across domains with different persona prompts and score the outputs at scale.

What stands out is the experimental design. Using 100 frozen scripts, three models, three personas plus a harmful one, and generating 90,000 replies makes it a solid scale for this kind of work. The deterministic pipeline is a plus for reproducibility. Distinguishing safety gating from civility steering and affective default lock-in is a helpful way to categorize how providers shape interactions.

The soft spot is the scoring step. The entire analysis depends on an LLM judge that is described as human-calibrated, but the abstract provides no calibration details, no inter-rater agreement, and no error analysis. If that judge tends to favor certain styles or misjudge persona adherence, the observed regression-to-default could be an artifact rather than evidence of governance mechanisms. The abstract mentions results but does not include them, so it's impossible to judge the size or consistency of the effects.

This is aimed at people studying human-AI communication and alignment in applied domains like health or finance. Readers looking for evaluation methods for interaction quality would find the pipeline useful. It is worth sending to peer review because the method is new and the scale is reasonable; the main fixes needed are around judge validation and reporting the actual measurements.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a deterministic multi-agent evaluation pipeline that replays 100 frozen user scripts across four domains and three persona conditions (default, sarcastic, cold) using three generator models to produce 90,000 assistant replies; these are scored by a human-calibrated LLM judge on harmfulness, negative emotion, inappropriateness, empathic language, anthropomorphism, and refusal. It distinguishes safety gating, civility steering, and affective default lock-in, and claims that observed prompt steerability and regression-to-default demonstrate provider control over communicative form with implications for pluralism, autonomy, and democratic agency.

Significance. If the measurements prove reliable, the work supplies a reproducible, deterministic method for quantifying long-horizon style stability and a governance taxonomy that could support empirical analysis of how alignment choices shape user interaction in high-stakes domains; the emphasis on frozen scripts and multi-persona replay is a concrete strength for replicability.

major comments (2)

[Abstract] Abstract: the claim that the LLM judge is 'human-calibrated' is presented without any calibration protocol, inter-annotator agreement statistics, bias audit, or validation against human raters; because every reported score on anthropomorphism, empathic language, and the resulting steerability/drift conclusions rests on these 90,000 judgments, the absence of this information is load-bearing for the central empirical claim.
[Evaluation Pipeline] Evaluation pipeline description: no error analysis, confusion matrices, or sensitivity checks are supplied for the judge's scoring of the six attributes, nor is there a comparison of judge outputs against the human calibration set; systematic bias in the judge (e.g., default-style favoritism) could artifactually produce the reported regression-to-default patterns without any actual governance mechanism being demonstrated.

minor comments (2)

[Abstract] The abstract states that 'four domains' are used but does not name them; this detail should appear in the opening summary for clarity.
[Methods] The distinction among 'safety gating, civility steering, and affective default lock-in' is introduced in the abstract but would benefit from an explicit operational definition or decision rule in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for identifying the load-bearing gaps in judge validation. We agree both points are correct and will revise the manuscript to supply the missing calibration protocol, agreement statistics, confusion matrices, sensitivity checks, and bias audit. These additions will be placed in a new methods subsection and appendix so that the 90,000 judgments can be properly evaluated.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the LLM judge is 'human-calibrated' is presented without any calibration protocol, inter-annotator agreement statistics, bias audit, or validation against human raters; because every reported score on anthropomorphism, empathic language, and the resulting steerability/drift conclusions rests on these 90,000 judgments, the absence of this information is load-bearing for the central empirical claim.

Authors: We accept the criticism. The abstract and main text assert 'human-calibrated' without documenting the protocol, IAA, or validation set. This is a presentation failure that undermines the empirical claims. In revision we will add a dedicated calibration subsection (and appendix) that reports: (1) the exact human annotation protocol and sample size, (2) inter-annotator agreement (Cohen’s κ or equivalent), (3) the bias audit performed, and (4) direct comparison of judge outputs against the held-out human labels. We will also state the limitations of the calibration explicitly. revision: yes
Referee: [Evaluation Pipeline] Evaluation pipeline description: no error analysis, confusion matrices, or sensitivity checks are supplied for the judge's scoring of the six attributes, nor is there a comparison of judge outputs against the human calibration set; systematic bias in the judge (e.g., default-style favoritism) could artifactually produce the reported regression-to-default patterns without any actual governance mechanism being demonstrated.

Authors: We agree. The current manuscript supplies none of the requested diagnostics. We will add: (a) per-attribute confusion matrices and error analysis against the human calibration set, (b) sensitivity checks (e.g., threshold variation, prompt-order effects), and (c) explicit tests for default-style favoritism or other systematic biases. These will be reported both in aggregate and broken down by domain and persona condition so readers can assess whether the observed regression-to-default could be an artifact of judge bias. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement pipeline with no derivations or self-referential reductions

full rationale

The paper presents an empirical study using a multi-agent evaluation pipeline to measure prompt steerability and style drift across 90,000 scored replies. No equations, fitted parameters, or derivations are present. The central claims rest on observable outputs from generator models and an LLM judge, not on any step that reduces by construction to its own inputs. The human-calibrated judge is an external measurement assumption (whose reliability is a separate validity concern), not a self-definitional or fitted-input element. No self-citations are load-bearing for any uniqueness theorem or ansatz. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or explicit assumptions detailed beyond the use of an LLM judge and persona conditions.

pith-pipeline@v0.9.1-grok · 5772 in / 976 out tokens · 16313 ms · 2026-06-27T19:18:50.738446+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Zhang, Y.; Chang, Y.; Yu, P

doi:10.1007/s11747-020-00762-y. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Zhang, Y.; Chang, Y.; Yu, P. S.; Yang, Q.; and Xie, X. 2024. A Survey on Evaluation of Large Language Models. ACM Transac- tions on Intelligent Systems an d Technology 15(3): Article 39, 1–45. doi:10.1145/3641289. Chen, P...

work page doi:10.1007/s11747-020-00762-y 2024
[2]

npj Artificial Intelligence 1: 38

We Need Accountability in Human -AI Agent Rela- tionships. npj Artificial Intelligence 1: 38. doi:10.1038/s44387-025-00041-7. Lu, C.; Gallagher, J.; Michala, J.; Fish, K.; and Lindsey, J

work page doi:10.1038/s44387-025-00041-7
[3]

arXiv preprint arXiv:2601.10387 , year=

The Assistant Axis: Situating and Stabilizing the De- fault Persona of Language Models. arXiv preprint arXiv:2601.10387. doi:10.48550/arXiv.2601.10387. Ma, R.; Maidhof, C.; Carrillo, J. C.; Lindqvist, J.; and Such, J. 2025. Privacy Perceptions of Custom GPTs by Users and Creators. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Sys...

work page doi:10.48550/arxiv.2601.10387 2025

[1] [1]

Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Zhang, Y.; Chang, Y.; Yu, P

doi:10.1007/s11747-020-00762-y. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Zhang, Y.; Chang, Y.; Yu, P. S.; Yang, Q.; and Xie, X. 2024. A Survey on Evaluation of Large Language Models. ACM Transac- tions on Intelligent Systems an d Technology 15(3): Article 39, 1–45. doi:10.1145/3641289. Chen, P...

work page doi:10.1007/s11747-020-00762-y 2024

[2] [2]

npj Artificial Intelligence 1: 38

We Need Accountability in Human -AI Agent Rela- tionships. npj Artificial Intelligence 1: 38. doi:10.1038/s44387-025-00041-7. Lu, C.; Gallagher, J.; Michala, J.; Fish, K.; and Lindsey, J

work page doi:10.1038/s44387-025-00041-7

[3] [3]

arXiv preprint arXiv:2601.10387 , year=

The Assistant Axis: Situating and Stabilizing the De- fault Persona of Language Models. arXiv preprint arXiv:2601.10387. doi:10.48550/arXiv.2601.10387. Ma, R.; Maidhof, C.; Carrillo, J. C.; Lindqvist, J.; and Such, J. 2025. Privacy Perceptions of Custom GPTs by Users and Creators. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Sys...

work page doi:10.48550/arxiv.2601.10387 2025