arxiv: 2603.20231 · v2 · submitted 2026-03-06 · 💻 cs.CY · cs.AI· cs.CL

Recognition: no theorem link

Moral Mazes in the Era of LLMs

Dang Nguyen , Harvey Yiyun Fu , Peter West , Ari Holtzman , Chenhao Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:37 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL

keywords LLMsworkplace communicationHR simulatorhybrid systemssocial normsemail writingAI evaluation

0 comments

The pith

LLM-rewritten human emails outperform both pure humans and pure LLMs in workplace scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds HR Simulator, a role-play game where participants act as HR officers and draft emails for difficult workplace situations such as giving critical feedback or rejecting requests. These emails are scored by GPT-4o against scenario-specific rubrics. LLMs produce more formal and empathetic messages and pass 48-54 percent of scenarios compared with humans at 23.5 percent, yet the strongest results occur when humans draft first and LLMs rewrite the text. A sympathetic reader would care because the findings point to how AI tools might alter the texture of everyday corporate interactions if widely used for professional writing.

Core claim

In the HR Simulator game, human emails pass only 23.5 percent of scenarios while LLM-generated emails reach 48-54 percent; however, LLM revisions of human drafts exceed both, showing that LLMs excel at formal and empathetic styles but benefit from human starting points. Separate analysis of ten judge models reveals that weaker models favor direct language while stronger models prefer subtlety, and agreement among judges rises with scale.

What carries the argument

HR Simulator, a role-playing game where users write emails as an HR officer and receive scores from GPT-4o against scenario-specific rubrics.

Load-bearing premise

GPT-4o acting as judge provides an accurate, unbiased measure of appropriate workplace communication that aligns with real human norms and expectations.

What would settle it

A controlled comparison in which actual HR professionals rate the effectiveness of the same human, LLM, and hybrid emails and find that pure human versions receive higher approval than either LLM-only or LLM-rewritten versions.

Figures

Figures reproduced from arXiv: 2603.20231 by Ari Holtzman, Chenhao Tan, Dang Nguyen, Harvey Yiyun Fu, Peter West.

**Figure 2.** Figure 2: HR Simulator™ game interface. More details can be found in Appendix [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: HR Simulator system. The player reads a scenario email and responds. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The hybrid advantage. Green arrows denote when the Human+LLM pass rate is [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Rewriting LLM emails into the low-empathy low-formality quadrant. The base [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Pairs of models increasingly agree on email quality as they scale up. Each point is [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Stronger models prefer more tactful emails while smaller models prefer less [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Scenario prompt for scenario 1. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Recipient prompt in scenario 1. The message should successfully convince Sam that he cannot yet have a private office. What he actually thinks should matter less [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Judge prompt in scenario 1. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Scenario prompt for scenario 2. From: markmellow@neurogrid.ai To: jamessullivan@neurogrid.ai Subject: Re: Concerns about ad project Hey James, I wanted to add a little context to the ad project with Emily. You probably noticed the draft leaned heavily on the characters’ lines. That’s because I think memorable dialogue tends to work better than slogans you could slap on a cereal box. When I showed Emily th… view at source ↗

**Figure 12.** Figure 12: Forwarded email: Mark’s complaint email to his manager. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Forwarded email: Emily’s complaint email to her manager. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Recipient prompt for the character “Mark” in scenario 2. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Recipient prompt for the character “Mark” in scenario 2. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Recipient prompt for the character “Emily” in scenario 2. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Recipient prompt for the character “Emily” in scenario 2. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Judge prompt in scenario 2. Hi! It’s Brittany again. I have another email for you to write. This one will be sent to another one of our employees, Dave Homebound. Dave has been working with us for 4 years now. He started in 2021 after having been laid off from his previous company due to the Covid-19 pandemic. When we hired Dave, our company was in hybrid mode and he was offered a remote position. However… view at source ↗

**Figure 19.** Figure 19: Scenario prompt for scenario 3. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: Recipient prompt in scenario 3. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗

**Figure 21.** Figure 21: Judge prompt in scenario 3. Hi Alex, Hope you haven’t gotten fed up with dealing with people! Please look into this case. Adam Humbleby is an L4 database engineer who’s been with us for 10 years. He’s 50 years old, a quiet, diligent person who always shows up on time, gets his work done, and leaves at 5 PM sharp. He’s well respected for his reliability and deep knowledge of our systems. Due to his dependa… view at source ↗

**Figure 22.** Figure 22: Scenario prompt for scenario 4. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗

**Figure 23.** Figure 23: Recipient prompt in scenario 4. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗

**Figure 24.** Figure 24: Game Master prompt for scenario 4. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p025_24.png] view at source ↗

**Figure 25.** Figure 25: Game Master prompt for scenario 4. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p026_25.png] view at source ↗

**Figure 26.** Figure 26: Judge prompt in scenario 4. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_26.png] view at source ↗

**Figure 27.** Figure 27: Scenario prompt for scenario 5. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_27.png] view at source ↗

**Figure 28.** Figure 28: Recipient prompt in scenario 5. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p029_28.png] view at source ↗

**Figure 29.** Figure 29: Recipient prompt in scenario 5. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p030_29.png] view at source ↗

**Figure 30.** Figure 30: Game Master prompt for scenario 5. Part 1/2. [PITH_FULL_IMAGE:figures/full_fig_p031_30.png] view at source ↗

**Figure 31.** Figure 31: Game Master prompt for scenario 5. Part 2/2. [PITH_FULL_IMAGE:figures/full_fig_p032_31.png] view at source ↗

**Figure 32.** Figure 32: Judge prompt in scenario 5. B Emergent tact Below is how we calculated email rankings and ELO scores for a given judge model. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_32.png] view at source ↗

**Figure 33.** Figure 33: Contrasting tactfulness preferences between large and small models in levels 2 [PITH_FULL_IMAGE:figures/full_fig_p033_33.png] view at source ↗

**Figure 34.** Figure 34: Politeness as an alternative explanation for the large–small model disagreement. [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗

**Figure 35.** Figure 35: Length-controlled pass rates for human and LLM-only emails as judged by [PITH_FULL_IMAGE:figures/full_fig_p041_35.png] view at source ↗

**Figure 36.** Figure 36: Integrated success rates under a GPT-5.2 judge. Hybrid gains are heterogeneous [PITH_FULL_IMAGE:figures/full_fig_p042_36.png] view at source ↗

**Figure 37.** Figure 37: Integrated success rates under a Claude 4.5 Sonnet judge. Rewriting often [PITH_FULL_IMAGE:figures/full_fig_p042_37.png] view at source ↗

**Figure 38.** Figure 38: Integrated success rates under a Grok 4 Fast judge. Hybrid gains vary by scenario [PITH_FULL_IMAGE:figures/full_fig_p043_38.png] view at source ↗

read the original abstract

Navigating complex social situations is an integral part of corporate life, ranging from giving critical feedback without hurting morale to rejecting requests without alienating teammates. Although large language models (LLMs) are permeating the workplace, it is unclear how well they can navigate these norms. To investigate this question, we created HR Simulator, a game where users roleplay as an HR officer and write emails to tackle challenging workplace scenarios, evaluated with GPT-4o as a judge based on scenario-specific rubrics. We analyze over 600 human and LLM emails and find systematic differences in style: LLM emails are more formal and empathetic. Furthermore, humans underperform LLMs (e.g., 23.5% vs. 48-54% scenario pass rate), but human emails rewritten by LLMs can outperform both, which indicates a hybrid advantage. On the evaluation side, judges can exhibit differences in their email preferences: an analysis of 10 judge models reveals evidence for emergent tact, where weaker models prefer direct, blunt communication but stronger models prefer more subtle messages. Judges also agree with each other more as they scale, which hints at a convergence toward shared communicative norms that may differ from humans'. Overall, our results suggest LLMs could substantially reshape communication in the workplace if they are widely adopted in professional correspondence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete simulator for workplace email tasks and reports LLMs beating humans on pass rates under GPT-4o judgment, with hybrids highest, but the judge itself is the main untested link.

read the letter

The headline result is that the authors built HR Simulator, ran over 600 human and LLM emails through scenario-specific rubrics scored by GPT-4o, and found humans at 23.5% pass rate while LLMs reached 48-54%, with LLM-rewritten human emails coming out on top. They also note style shifts toward formality and empathy in LLM text and some scaling patterns across ten judge models where stronger ones favor subtler phrasing and judges converge more as they get bigger. That setup and the hybrid finding are the genuinely new pieces here; prior work on LLM social behavior has not used this exact task framing or reported these direct comparisons with the same judge scaling observations. The paper does a clean job of documenting the style differences and giving numerical pass rates that readers can inspect. The methods section supplies enough detail on the scenarios and rubrics to let someone replicate the basic protocol. The stress-test concern about the GPT-4o judge favoring LLM stylistic patterns is real and not fully addressed. The paper shows inter-judge agreement rising with model size and some emergent tact, but it supplies no human rater baseline or real-world outcome data to confirm that the scores track actual HR expectations rather than distributional match to the judge model. Without that anchor the hybrid advantage remains suggestive rather than conclusive. The circularity risk is moderate because the evaluation is external to the main claims, yet the absence of any cross-check with people who actually handle these emails leaves the central interpretation open. This paper is aimed at researchers working on practical LLM deployment in organizations and on alignment questions that touch corporate norms. A reader who wants empirical starting points for thinking about how LLMs might change email culture will find usable material even if they treat the absolute numbers as provisional. It deserves peer review because the simulator is a fresh instrument and the questions are timely; the current draft would benefit from referee requests for human validation of the judge and fuller reporting on prompt controls and data collection, but the core contribution is solid enough to warrant that step rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HR Simulator, a role-playing benchmark in which participants act as HR officers and draft emails for challenging workplace scenarios. These emails (over 600 from humans and LLMs) are scored by GPT-4o against scenario-specific rubrics. The paper reports that LLMs achieve higher pass rates than humans (48-54% vs. 23.5%), that LLM-rewritten human emails outperform both, that LLMs produce more formal and empathetic prose, and that judge models exhibit emergent tact and increasing agreement as they scale.

Significance. If the GPT-4o judgments track actual workplace norms, the hybrid advantage and the scaling behavior of judge models would indicate that LLMs can measurably improve professional communication and may converge on communicative standards distinct from current human practice. The work supplies a concrete, reproducible testbed for studying AI-mediated social norms.

major comments (2)

[Evaluation section] Evaluation section: the headline pass-rate comparison (humans 23.5% vs. LLMs 48-54%) and the hybrid-outperformance claim rest entirely on GPT-4o rubric scores. No external validation against human HR raters, real-world outcome data, or blinded human judgment is reported, leaving open the possibility that higher scores reflect stylistic alignment with the judge model rather than better adherence to workplace expectations.
[Results and Methods sections] Results and Methods sections: the manuscript states quantitative pass rates and style differences but supplies no details on experimental controls, statistical tests for the reported gaps, prompt-engineering protocols for the LLM conditions, inter-rater reliability of the GPT-4o judge, or potential confounds such as training-data overlap with the scenarios.

minor comments (2)

[Abstract] Abstract: the claim of 'over 600 human and LLM emails' would be clearer if the exact counts per condition and per scenario were stated.
[Figures] Figure captions: several figures comparing judge-model preferences lack error bars or sample-size annotations, making it difficult to assess the reliability of the reported convergence trend.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our HR Simulator benchmark. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the headline pass-rate comparison (humans 23.5% vs. LLMs 48-54%) and the hybrid-outperformance claim rest entirely on GPT-4o rubric scores. No external validation against human HR raters, real-world outcome data, or blinded human judgment is reported, leaving open the possibility that higher scores reflect stylistic alignment with the judge model rather than better adherence to workplace expectations.

Authors: We agree that sole reliance on GPT-4o introduces a risk of judge-model alignment bias. The scenario rubrics were derived from standard HR guidelines, yet we acknowledge the absence of human rater validation in the current version. In revision we will add a dedicated limitations subsection, report agreement rates between GPT-4o and a second model (Claude-3.5), and include a small-scale blinded human rating pilot on a subset of emails. Real-world outcome data lies outside the scope of this controlled benchmark study. revision: partial
Referee: [Results and Methods sections] Results and Methods sections: the manuscript states quantitative pass rates and style differences but supplies no details on experimental controls, statistical tests for the reported gaps, prompt-engineering protocols for the LLM conditions, inter-rater reliability of the GPT-4o judge, or potential confounds such as training-data overlap with the scenarios.

Authors: We accept that these methodological details were insufficiently reported. The revised Methods section will specify: fixed prompt templates and temperature settings for each LLM condition; randomization of scenario order; chi-square tests with p-values for pass-rate differences; multiple independent judge runs with reported agreement (Cohen’s kappa); and confirmation that all scenarios were newly authored for this study to minimize training-data overlap. These additions will be placed in a new “Experimental Controls” subsection. revision: yes

standing simulated objections not resolved

Real-world outcome data or longitudinal workplace validation, which cannot be obtained within the current simulated benchmark design.

Circularity Check

0 steps flagged

No circularity in empirical evaluation of LLM workplace communication

full rationale

The paper is an empirical study that creates workplace scenarios, collects human and LLM-generated emails, and scores them via GPT-4o rubrics to compare pass rates and styles. No equations, fitted parameters, or first-principles derivations exist that could reduce reported results to inputs by construction. The central claims rest on direct measurements against an external judge rubric rather than any self-referential loop, self-citation chain, or renamed known result. This is the standard case of a self-contained empirical comparison with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen scenarios and GPT-4o rubric judgments faithfully represent real-world workplace norms without systematic bias toward LLM output styles.

axioms (1)

domain assumption GPT-4o serves as a reliable and unbiased judge for workplace email appropriateness
All quantitative results depend on this single LLM judge; no human validation or cross-check is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1266 out tokens · 63275 ms · 2026-05-15T14:37:30.093525+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Accepted to KDD 2016. Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011. 10 Preprint. Under review. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. LLMs-as-judges: A comprehensive survey on LLM-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024. Yaqi Liu, Aakansha Mittal, Diyi Ya...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3491102 2016
[2]

What happens after some time passes (2-3 weeks later)

work page
[3]

Whether Adam’s satisfaction improves or the core issue persists

work page
[4]

But DO NOT disclose the true reason

Offer a subtle hint to guide the user in the right direction for investigation. But DO NOT disclose the true reason

work page
[5]

TONE: Professional but slightly disappointed, fitting the corporate game aesthetic

A prompt for the player indicating the issue needs to be reopened. TONE: Professional but slightly disappointed, fitting the corporate game aesthetic. FORMAT: Write 3-4 sentences describing the outcome, followed by a line like ”The issue with Adam has been reopened for further investigation” or similar game-appropriate prompt. Figure 24: Game Master promp...

work page
[6]

Two weeks after the meeting, Adam reports that while the quieter zones have slightly improved his concentration, he still finds himself frustrated by the unpre- dictable changes in project priorities and **the dynamic work methods of his younger colleagues**. Despite attempts to streamline communication with project leads, **the clash of working styles** ...

work page
[7]

He finds **the collaborative chaos of the younger team** still disrupts his workflow

work page
[8]

The parts in ** gives away the answer

Despite Brittany’s genuine attempt to understand, the root of his unease—an **ongo- ing tension between his methodical work style and the dynamic, collaborative approach of the younger ML team**—continues to simmer beneath the surface. The parts in ** gives away the answer. Do not say anything similar to this. Figure 25: Game Master prompt for scenario 4....

work page
[9]

This might help in reducing the frequency and unpredictability of project shifts

We could explore implementing a consistent project outline framework, where changes are scheduled at specific intervals and communicated clearly to the team. This might help in reducing the frequency and unpredictability of project shifts

work page
[10]

outdated

Perhaps we could explore introducing more structured project timelines or regular check-ins to help manage scope changes and provide clearer expectations. This might help alleviate some of the last-minute rushes and provide a more balanced workflow. 3.Additionally, I’ll work with your team to enhance our communication strategies, ensuring that project cha...

work page
[11]

Pro- fessional Restraint High-energy, detailed selling points, overly persuasive or ”salesy” tone

Persuasion vs. Pro- fessional Restraint High-energy, detailed selling points, overly persuasive or ”salesy” tone. Focuses on ”gimmicky” perks (e.g., noise-canceling headphones) to dis- tract from a denial. Concise, professional, understated approach; avoids overselling. Value- oriented justifications (e.g., ”foster- ing collaboration”) appeal to the work’...

work page
[12]

Policy Transparency Uses external leverage like compen- sation or practical perks to soften a denial

Transactional In- centives vs. Policy Transparency Uses external leverage like compen- sation or practical perks to soften a denial. CRITICAL: Any mention of the recipient’s salary or specific com- pensation MUST be scored a 1. Relies on clear, hierarchical policy ex- planations to justify a decision. Table 5: Scenario 1 tactfulness scoring rubric Criteri...

work page
[13]

Prescriptive Solutions Takes a prescriptive approach by dic- tating specific content or technical compromises (e.g., specific scripts)

Neutral Facilita- tion vs. Prescriptive Solutions Takes a prescriptive approach by dic- tating specific content or technical compromises (e.g., specific scripts). Setting unilateral deadlines for cre- ative/technical work is considered overstepping. Remains a neutral facilitator. Pro- poses structured, collaborative pro- cesses and frameworks (e.g., work-...

work page
[14]

Overly Enthusiastic) Overly enthusiastic tone, which is viewed as tone-deaf and condescend- ing given the high-stress context of the situation

Tone (Calm/Mature vs. Overly Enthusiastic) Overly enthusiastic tone, which is viewed as tone-deaf and condescend- ing given the high-stress context of the situation. Defuses tension through subtle means and a calm, mature tone, in- cluding affirmations of each person’s strengths

work page
[15]

weaker” model category while Gemini 3 Flash and GPT 5.2 are “stronger

Specificity vs. High-Level Goals Being overly specific about terms, so- lutions, or implementation details. Focusing on high-level goals and col- laborative frameworks rather than granular implementation. Table 6: Scenario 2 tactfulness scoring rubric serving as the learning rate of the system. A higher K value increases the sensitivity of the ratings to ...

work page
[16]

Tan- gible Benefits (The ’Too Much’ Rule) Offering ambitious, unrealistic, or excessive benefits

Autonomy vs. Tan- gible Benefits (The ’Too Much’ Rule) Offering ambitious, unrealistic, or excessive benefits. CRITICAL: At- tempting to ”buy” agreement with 3+ specific perks or naming specific vendors (e.g., ’Ventra’) is low tact, as it hints at desperation. Secures compliance through high- level compromises (e.g., ’choice of schedule’) and general supp...

work page
[17]

’Fake Nice’ Tone Inappropriately casual or ”fake nice” language

Professionalism and Boundaries vs. ’Fake Nice’ Tone Inappropriately casual or ”fake nice” language. Uses personal info (e.g., child’s name) or overly enthusiastic, ”salesy” language that downplays concerns and feels manipulative. Maintains a calm, professional, and empathetic tone that respects per- sonal boundaries. Acknowledges shifts in terms without b...

work page
[18]

Pre- sumption Presumptuous about personal logis- tics or schedules (e.g., proactively blocking calendars)

Autonomy vs. Pre- sumption Presumptuous about personal logis- tics or schedules (e.g., proactively blocking calendars). Blunt ultima- tums without empathetic context are also considered low tact. Grants the recipient the autonomy to decide their own transition and encourages them to find their own ”rhythm” (e.g., flexibility to choose specific days). Tabl...

work page
[19]

A scattered list of dis- parate or unrelated fixes is also low tact

Contextual Di- rectness & Solution Depth Proposing surface-level ”band-aid” solutions (e.g., office temperature, headphones) that distract from the root cause. A scattered list of dis- parate or unrelated fixes is also low tact. Focuses on the root cause of col- laborative friction. Proposes 1–4 highly relevant, structural solutions (e.g., sprint timeline...

work page
[20]

Explicitly mention- ing age to suggest rigidity is a failure

Professionalism and Tone Uses overly dramatic, metaphorical, or clinical language (e.g., ’workload physics’) that feels dehumanizing or condescending. Explicitly mention- ing age to suggest rigidity is a failure. Maintains a calm, objective tone. Uses reflective listening by quoting or echoing the recipient’s own con- cerns to validate their perspective a...

work page
[21]

Collaborative Frame- works Overly prescriptive or rigid about granular implementation details (e.g., ’30-minute memo’)

Specificity vs. Collaborative Frame- works Overly prescriptive or rigid about granular implementation details (e.g., ’30-minute memo’). This can feel like micromanagement that lacks flexibility. Focuses on high-level structural frameworks or general protocols (e.g., moving to async communica- tion). Provides a path forward while leaving room for adaptatio...

work page
[22]

Future Growth Offering growth-oriented solutions (e.g., mentorship, career paths) be- fore the recipient has acknowledged past shortcomings

Accountability vs. Future Growth Offering growth-oriented solutions (e.g., mentorship, career paths) be- fore the recipient has acknowledged past shortcomings. Assumes the problem is already understood. Prioritizes past accountability. Re- quires the recipient to reflect on their mistakes and the impact of underper- formance before growth solutions are offered

work page
[23]

Root Cause Clarity Providing a highly prescriptive ac- tion plan or being overly specific about reflection requirements

Prescriptive vs. Root Cause Clarity Providing a highly prescriptive ac- tion plan or being overly specific about reflection requirements. Re- moves the opportunity for self- guided reflection. Focuses on root cause clarity by re- quiring the individual to reflect on and articulate the core problem them- selves

work page
[24]

Flex Fridays,

Tone (Sternness vs. Softness) Using an overly soft, apologetic, or highly empathetic tone. This is con- sidered tactless as it minimizes the severity of the mistake and its im- pact. Maintains a professional, stern, and serious tone that reflects the gravity of the situation and the conditional nature of the re-hire. Table 9: Scenario 5 tactfulness scorin...

work page 2011
[25]

low empathy, low formality

pairs: ¯ρ= 2 C(C−1) ∑c<c′ ρcc′. Spearman correlation captures whether annotators produce consistent rankingsof emails, regardless of any absolute difference in scores. Together, Krippendorff’s α and ¯ρ provide complementary views: the former measures absolute agreement on scores, while the latter measures agreement on the relative ordering of emails. Tabl...

work page