pith. machine review for the scientific record. sign in

arxiv: 2603.20231 · v2 · submitted 2026-03-06 · 💻 cs.CY · cs.AI· cs.CL

Recognition: no theorem link

Moral Mazes in the Era of LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:37 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL
keywords LLMsworkplace communicationHR simulatorhybrid systemssocial normsemail writingAI evaluation
0
0 comments X

The pith

LLM-rewritten human emails outperform both pure humans and pure LLMs in workplace scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds HR Simulator, a role-play game where participants act as HR officers and draft emails for difficult workplace situations such as giving critical feedback or rejecting requests. These emails are scored by GPT-4o against scenario-specific rubrics. LLMs produce more formal and empathetic messages and pass 48-54 percent of scenarios compared with humans at 23.5 percent, yet the strongest results occur when humans draft first and LLMs rewrite the text. A sympathetic reader would care because the findings point to how AI tools might alter the texture of everyday corporate interactions if widely used for professional writing.

Core claim

In the HR Simulator game, human emails pass only 23.5 percent of scenarios while LLM-generated emails reach 48-54 percent; however, LLM revisions of human drafts exceed both, showing that LLMs excel at formal and empathetic styles but benefit from human starting points. Separate analysis of ten judge models reveals that weaker models favor direct language while stronger models prefer subtlety, and agreement among judges rises with scale.

What carries the argument

HR Simulator, a role-playing game where users write emails as an HR officer and receive scores from GPT-4o against scenario-specific rubrics.

Load-bearing premise

GPT-4o acting as judge provides an accurate, unbiased measure of appropriate workplace communication that aligns with real human norms and expectations.

What would settle it

A controlled comparison in which actual HR professionals rate the effectiveness of the same human, LLM, and hybrid emails and find that pure human versions receive higher approval than either LLM-only or LLM-rewritten versions.

Figures

Figures reproduced from arXiv: 2603.20231 by Ari Holtzman, Chenhao Tan, Dang Nguyen, Harvey Yiyun Fu, Peter West.

Figure 1
Figure 1. Figure 1: LLMs tend to write high empathy, high formality emails, whereas human emails [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: HR Simulator™ game interface. More details can be found in Appendix [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: HR Simulator system. The player reads a scenario email and responds. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The hybrid advantage. Green arrows denote when the Human+LLM pass rate is [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Rewriting LLM emails into the low-empathy low-formality quadrant. The base [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pairs of models increasingly agree on email quality as they scale up. Each point is [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stronger models prefer more tactful emails while smaller models prefer less [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scenario prompt for scenario 1. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recipient prompt in scenario 1. The message should successfully convince Sam that he cannot yet have a private office. What he actually thinks should matter less [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Judge prompt in scenario 1. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Scenario prompt for scenario 2. From: markmellow@neurogrid.ai To: jamessullivan@neurogrid.ai Subject: Re: Concerns about ad project Hey James, I wanted to add a little context to the ad project with Emily. You probably noticed the draft leaned heavily on the characters’ lines. That’s because I think memorable dialogue tends to work better than slogans you could slap on a cereal box. When I showed Emily th… view at source ↗
Figure 12
Figure 12. Figure 12: Forwarded email: Mark’s complaint email to his manager. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Forwarded email: Emily’s complaint email to her manager. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Recipient prompt for the character “Mark” in scenario 2. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Recipient prompt for the character “Mark” in scenario 2. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Recipient prompt for the character “Emily” in scenario 2. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Recipient prompt for the character “Emily” in scenario 2. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Judge prompt in scenario 2. Hi! It’s Brittany again. I have another email for you to write. This one will be sent to another one of our employees, Dave Homebound. Dave has been working with us for 4 years now. He started in 2021 after having been laid off from his previous company due to the Covid-19 pandemic. When we hired Dave, our company was in hybrid mode and he was offered a remote position. However… view at source ↗
Figure 19
Figure 19. Figure 19: Scenario prompt for scenario 3. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Recipient prompt in scenario 3. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Judge prompt in scenario 3. Hi Alex, Hope you haven’t gotten fed up with dealing with people! Please look into this case. Adam Humbleby is an L4 database engineer who’s been with us for 10 years. He’s 50 years old, a quiet, diligent person who always shows up on time, gets his work done, and leaves at 5 PM sharp. He’s well respected for his reliability and deep knowledge of our systems. Due to his dependa… view at source ↗
Figure 22
Figure 22. Figure 22: Scenario prompt for scenario 4. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Recipient prompt in scenario 4. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Game Master prompt for scenario 4. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p025_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Game Master prompt for scenario 4. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p026_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Judge prompt in scenario 4. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Scenario prompt for scenario 5. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Recipient prompt in scenario 5. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p029_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Recipient prompt in scenario 5. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p030_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Game Master prompt for scenario 5. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p031_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Game Master prompt for scenario 5. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p032_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Judge prompt in scenario 5. B Emergent tact Below is how we calculated email rankings and ELO scores for a given judge model. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Contrasting tactfulness preferences between large and small models in levels 2 [PITH_FULL_IMAGE:figures/full_fig_p033_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Politeness as an alternative explanation for the large–small model disagreement. [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Length-controlled pass rates for human and LLM-only emails as judged by [PITH_FULL_IMAGE:figures/full_fig_p041_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Integrated success rates under a GPT-5.2 judge. Hybrid gains are heterogeneous [PITH_FULL_IMAGE:figures/full_fig_p042_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Integrated success rates under a Claude 4.5 Sonnet judge. Rewriting often [PITH_FULL_IMAGE:figures/full_fig_p042_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Integrated success rates under a Grok 4 Fast judge. Hybrid gains vary by scenario [PITH_FULL_IMAGE:figures/full_fig_p043_38.png] view at source ↗
read the original abstract

Navigating complex social situations is an integral part of corporate life, ranging from giving critical feedback without hurting morale to rejecting requests without alienating teammates. Although large language models (LLMs) are permeating the workplace, it is unclear how well they can navigate these norms. To investigate this question, we created HR Simulator, a game where users roleplay as an HR officer and write emails to tackle challenging workplace scenarios, evaluated with GPT-4o as a judge based on scenario-specific rubrics. We analyze over 600 human and LLM emails and find systematic differences in style: LLM emails are more formal and empathetic. Furthermore, humans underperform LLMs (e.g., 23.5% vs. 48-54% scenario pass rate), but human emails rewritten by LLMs can outperform both, which indicates a hybrid advantage. On the evaluation side, judges can exhibit differences in their email preferences: an analysis of 10 judge models reveals evidence for emergent tact, where weaker models prefer direct, blunt communication but stronger models prefer more subtle messages. Judges also agree with each other more as they scale, which hints at a convergence toward shared communicative norms that may differ from humans'. Overall, our results suggest LLMs could substantially reshape communication in the workplace if they are widely adopted in professional correspondence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HR Simulator, a role-playing benchmark in which participants act as HR officers and draft emails for challenging workplace scenarios. These emails (over 600 from humans and LLMs) are scored by GPT-4o against scenario-specific rubrics. The paper reports that LLMs achieve higher pass rates than humans (48-54% vs. 23.5%), that LLM-rewritten human emails outperform both, that LLMs produce more formal and empathetic prose, and that judge models exhibit emergent tact and increasing agreement as they scale.

Significance. If the GPT-4o judgments track actual workplace norms, the hybrid advantage and the scaling behavior of judge models would indicate that LLMs can measurably improve professional communication and may converge on communicative standards distinct from current human practice. The work supplies a concrete, reproducible testbed for studying AI-mediated social norms.

major comments (2)
  1. [Evaluation section] Evaluation section: the headline pass-rate comparison (humans 23.5% vs. LLMs 48-54%) and the hybrid-outperformance claim rest entirely on GPT-4o rubric scores. No external validation against human HR raters, real-world outcome data, or blinded human judgment is reported, leaving open the possibility that higher scores reflect stylistic alignment with the judge model rather than better adherence to workplace expectations.
  2. [Results and Methods sections] Results and Methods sections: the manuscript states quantitative pass rates and style differences but supplies no details on experimental controls, statistical tests for the reported gaps, prompt-engineering protocols for the LLM conditions, inter-rater reliability of the GPT-4o judge, or potential confounds such as training-data overlap with the scenarios.
minor comments (2)
  1. [Abstract] Abstract: the claim of 'over 600 human and LLM emails' would be clearer if the exact counts per condition and per scenario were stated.
  2. [Figures] Figure captions: several figures comparing judge-model preferences lack error bars or sample-size annotations, making it difficult to assess the reliability of the reported convergence trend.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our HR Simulator benchmark. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: the headline pass-rate comparison (humans 23.5% vs. LLMs 48-54%) and the hybrid-outperformance claim rest entirely on GPT-4o rubric scores. No external validation against human HR raters, real-world outcome data, or blinded human judgment is reported, leaving open the possibility that higher scores reflect stylistic alignment with the judge model rather than better adherence to workplace expectations.

    Authors: We agree that sole reliance on GPT-4o introduces a risk of judge-model alignment bias. The scenario rubrics were derived from standard HR guidelines, yet we acknowledge the absence of human rater validation in the current version. In revision we will add a dedicated limitations subsection, report agreement rates between GPT-4o and a second model (Claude-3.5), and include a small-scale blinded human rating pilot on a subset of emails. Real-world outcome data lies outside the scope of this controlled benchmark study. revision: partial

  2. Referee: [Results and Methods sections] Results and Methods sections: the manuscript states quantitative pass rates and style differences but supplies no details on experimental controls, statistical tests for the reported gaps, prompt-engineering protocols for the LLM conditions, inter-rater reliability of the GPT-4o judge, or potential confounds such as training-data overlap with the scenarios.

    Authors: We accept that these methodological details were insufficiently reported. The revised Methods section will specify: fixed prompt templates and temperature settings for each LLM condition; randomization of scenario order; chi-square tests with p-values for pass-rate differences; multiple independent judge runs with reported agreement (Cohen’s kappa); and confirmation that all scenarios were newly authored for this study to minimize training-data overlap. These additions will be placed in a new “Experimental Controls” subsection. revision: yes

standing simulated objections not resolved
  • Real-world outcome data or longitudinal workplace validation, which cannot be obtained within the current simulated benchmark design.

Circularity Check

0 steps flagged

No circularity in empirical evaluation of LLM workplace communication

full rationale

The paper is an empirical study that creates workplace scenarios, collects human and LLM-generated emails, and scores them via GPT-4o rubrics to compare pass rates and styles. No equations, fitted parameters, or first-principles derivations exist that could reduce reported results to inputs by construction. The central claims rest on direct measurements against an external judge rubric rather than any self-referential loop, self-citation chain, or renamed known result. This is the standard case of a self-contained empirical comparison with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen scenarios and GPT-4o rubric judgments faithfully represent real-world workplace norms without systematic bias toward LLM output styles.

axioms (1)
  • domain assumption GPT-4o serves as a reliable and unbiased judge for workplace email appropriateness
    All quantitative results depend on this single LLM judge; no human validation or cross-check is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1266 out tokens · 63275 ms · 2026-05-15T14:37:30.093525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Accepted to KDD 2016. Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011. 10 Preprint. Under review. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. LLMs-as-judges: A comprehensive survey on LLM-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024. Yaqi Liu, Aakansha Mittal, Diyi Ya...

  2. [2]

    What happens after some time passes (2-3 weeks later)

  3. [3]

    Whether Adam’s satisfaction improves or the core issue persists

  4. [4]

    But DO NOT disclose the true reason

    Offer a subtle hint to guide the user in the right direction for investigation. But DO NOT disclose the true reason

  5. [5]

    TONE: Professional but slightly disappointed, fitting the corporate game aesthetic

    A prompt for the player indicating the issue needs to be reopened. TONE: Professional but slightly disappointed, fitting the corporate game aesthetic. FORMAT: Write 3-4 sentences describing the outcome, followed by a line like ”The issue with Adam has been reopened for further investigation” or similar game-appropriate prompt. Figure 24: Game Master promp...

  6. [6]

    Two weeks after the meeting, Adam reports that while the quieter zones have slightly improved his concentration, he still finds himself frustrated by the unpre- dictable changes in project priorities and **the dynamic work methods of his younger colleagues**. Despite attempts to streamline communication with project leads, **the clash of working styles** ...

  7. [7]

    He finds **the collaborative chaos of the younger team** still disrupts his workflow

  8. [8]

    The parts in ** gives away the answer

    Despite Brittany’s genuine attempt to understand, the root of his unease—an **ongo- ing tension between his methodical work style and the dynamic, collaborative approach of the younger ML team**—continues to simmer beneath the surface. The parts in ** gives away the answer. Do not say anything similar to this. Figure 25: Game Master prompt for scenario 4....

  9. [9]

    This might help in reducing the frequency and unpredictability of project shifts

    We could explore implementing a consistent project outline framework, where changes are scheduled at specific intervals and communicated clearly to the team. This might help in reducing the frequency and unpredictability of project shifts

  10. [10]

    outdated

    Perhaps we could explore introducing more structured project timelines or regular check-ins to help manage scope changes and provide clearer expectations. This might help alleviate some of the last-minute rushes and provide a more balanced workflow. 3.Additionally, I’ll work with your team to enhance our communication strategies, ensuring that project cha...

  11. [11]

    Pro- fessional Restraint High-energy, detailed selling points, overly persuasive or ”salesy” tone

    Persuasion vs. Pro- fessional Restraint High-energy, detailed selling points, overly persuasive or ”salesy” tone. Focuses on ”gimmicky” perks (e.g., noise-canceling headphones) to dis- tract from a denial. Concise, professional, understated approach; avoids overselling. Value- oriented justifications (e.g., ”foster- ing collaboration”) appeal to the work’...

  12. [12]

    Policy Transparency Uses external leverage like compen- sation or practical perks to soften a denial

    Transactional In- centives vs. Policy Transparency Uses external leverage like compen- sation or practical perks to soften a denial. CRITICAL: Any mention of the recipient’s salary or specific com- pensation MUST be scored a 1. Relies on clear, hierarchical policy ex- planations to justify a decision. Table 5: Scenario 1 tactfulness scoring rubric Criteri...

  13. [13]

    Prescriptive Solutions Takes a prescriptive approach by dic- tating specific content or technical compromises (e.g., specific scripts)

    Neutral Facilita- tion vs. Prescriptive Solutions Takes a prescriptive approach by dic- tating specific content or technical compromises (e.g., specific scripts). Setting unilateral deadlines for cre- ative/technical work is considered overstepping. Remains a neutral facilitator. Pro- poses structured, collaborative pro- cesses and frameworks (e.g., work-...

  14. [14]

    Overly Enthusiastic) Overly enthusiastic tone, which is viewed as tone-deaf and condescend- ing given the high-stress context of the situation

    Tone (Calm/Mature vs. Overly Enthusiastic) Overly enthusiastic tone, which is viewed as tone-deaf and condescend- ing given the high-stress context of the situation. Defuses tension through subtle means and a calm, mature tone, in- cluding affirmations of each person’s strengths

  15. [15]

    weaker” model category while Gemini 3 Flash and GPT 5.2 are “stronger

    Specificity vs. High-Level Goals Being overly specific about terms, so- lutions, or implementation details. Focusing on high-level goals and col- laborative frameworks rather than granular implementation. Table 6: Scenario 2 tactfulness scoring rubric serving as the learning rate of the system. A higher K value increases the sensitivity of the ratings to ...

  16. [16]

    Tan- gible Benefits (The ’Too Much’ Rule) Offering ambitious, unrealistic, or excessive benefits

    Autonomy vs. Tan- gible Benefits (The ’Too Much’ Rule) Offering ambitious, unrealistic, or excessive benefits. CRITICAL: At- tempting to ”buy” agreement with 3+ specific perks or naming specific vendors (e.g., ’Ventra’) is low tact, as it hints at desperation. Secures compliance through high- level compromises (e.g., ’choice of schedule’) and general supp...

  17. [17]

    ’Fake Nice’ Tone Inappropriately casual or ”fake nice” language

    Professionalism and Boundaries vs. ’Fake Nice’ Tone Inappropriately casual or ”fake nice” language. Uses personal info (e.g., child’s name) or overly enthusiastic, ”salesy” language that downplays concerns and feels manipulative. Maintains a calm, professional, and empathetic tone that respects per- sonal boundaries. Acknowledges shifts in terms without b...

  18. [18]

    Pre- sumption Presumptuous about personal logis- tics or schedules (e.g., proactively blocking calendars)

    Autonomy vs. Pre- sumption Presumptuous about personal logis- tics or schedules (e.g., proactively blocking calendars). Blunt ultima- tums without empathetic context are also considered low tact. Grants the recipient the autonomy to decide their own transition and encourages them to find their own ”rhythm” (e.g., flexibility to choose specific days). Tabl...

  19. [19]

    A scattered list of dis- parate or unrelated fixes is also low tact

    Contextual Di- rectness & Solution Depth Proposing surface-level ”band-aid” solutions (e.g., office temperature, headphones) that distract from the root cause. A scattered list of dis- parate or unrelated fixes is also low tact. Focuses on the root cause of col- laborative friction. Proposes 1–4 highly relevant, structural solutions (e.g., sprint timeline...

  20. [20]

    Explicitly mention- ing age to suggest rigidity is a failure

    Professionalism and Tone Uses overly dramatic, metaphorical, or clinical language (e.g., ’workload physics’) that feels dehumanizing or condescending. Explicitly mention- ing age to suggest rigidity is a failure. Maintains a calm, objective tone. Uses reflective listening by quoting or echoing the recipient’s own con- cerns to validate their perspective a...

  21. [21]

    Collaborative Frame- works Overly prescriptive or rigid about granular implementation details (e.g., ’30-minute memo’)

    Specificity vs. Collaborative Frame- works Overly prescriptive or rigid about granular implementation details (e.g., ’30-minute memo’). This can feel like micromanagement that lacks flexibility. Focuses on high-level structural frameworks or general protocols (e.g., moving to async communica- tion). Provides a path forward while leaving room for adaptatio...

  22. [22]

    Future Growth Offering growth-oriented solutions (e.g., mentorship, career paths) be- fore the recipient has acknowledged past shortcomings

    Accountability vs. Future Growth Offering growth-oriented solutions (e.g., mentorship, career paths) be- fore the recipient has acknowledged past shortcomings. Assumes the problem is already understood. Prioritizes past accountability. Re- quires the recipient to reflect on their mistakes and the impact of underper- formance before growth solutions are offered

  23. [23]

    Root Cause Clarity Providing a highly prescriptive ac- tion plan or being overly specific about reflection requirements

    Prescriptive vs. Root Cause Clarity Providing a highly prescriptive ac- tion plan or being overly specific about reflection requirements. Re- moves the opportunity for self- guided reflection. Focuses on root cause clarity by re- quiring the individual to reflect on and articulate the core problem them- selves

  24. [24]

    Flex Fridays,

    Tone (Sternness vs. Softness) Using an overly soft, apologetic, or highly empathetic tone. This is con- sidered tactless as it minimizes the severity of the mistake and its im- pact. Maintains a professional, stern, and serious tone that reflects the gravity of the situation and the conditional nature of the re-hire. Table 9: Scenario 5 tactfulness scorin...

  25. [25]

    low empathy, low formality

    pairs: ¯ρ= 2 C(C−1) ∑c<c′ ρcc′. Spearman correlation captures whether annotators produce consistent rankingsof emails, regardless of any absolute difference in scores. Together, Krippendorff’s α and ¯ρ provide complementary views: the former measures absolute agreement on scores, while the latter measures agreement on the relative ordering of emails. Tabl...