The High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations

Benedikt Mangold

arxiv: 2512.08345 · v2 · submitted 2025-12-09 · 💻 cs.AI · cs.CL· cs.CY· cs.MA

The High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations

Benedikt Mangold This is my paper

Pith reviewed 2026-05-17 00:37 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.MA

keywords incivilitytoxicitymulti-agent systemsMonte Carlo simulationLLM agentsconversation convergenceworkplace efficiency

0 comments

The pith

Simulations of LLM agent debates show toxic participants increase conversation length by about 25 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that introducing toxic behavior into simulated 1-on-1 adversarial debates measurably slows the time to reach a conclusion. Using large language model agents in Monte Carlo runs, the authors compare baseline discussions against those where agents receive toxic system prompts and record the number of arguments exchanged until convergence. A sympathetic reader would care because the work converts the abstract harm of incivility into a concrete efficiency metric that could stand in for lost productivity and financial costs in organizations. It also positions multi-agent modeling as an ethical way to study social dynamics that are difficult or impossible to test directly with people. The central finding is framed as a latency of toxicity that organizations might use to estimate operational drag.

Core claim

In Monte Carlo simulations of adversarial debates between LLM agents, conversations involving agents given toxic system prompts require approximately 25 percent more arguments to reach a conclusion than baseline control groups, providing a quantifiable proxy for the operational inefficiency caused by incivility.

What carries the argument

Monte Carlo simulation of multi-agent 1-on-1 debates that measures convergence time as the count of arguments needed until a conclusion is reached, comparing control agents to agents given toxic prompts.

If this is right

The latency of toxicity can serve as a proxy measure for financial damage in corporate and academic settings.
Agent-based modeling supplies a reproducible and ethical substitute for human-subject experiments on social friction.
Quantified inefficiency from incivility can be tracked across repeated simulations to test mitigation strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the 25 percent figure generalizes, organizations could run targeted simulations to forecast productivity losses from specific forms of workplace toxicity before they occur.
The method opens the possibility of scaling the simulations to multi-person groups or longer-running discussions to map how toxicity compounds over time.
Linking the latency metric to salary or output data could produce dollar estimates of incivility costs that are otherwise hard to isolate in real settings.

Load-bearing premise

That the interaction patterns produced by LLM agents given toxic system prompts validly model the efficiency losses caused by toxic human participants in real adversarial debates.

What would settle it

Running the same debate protocol with human participants and finding that the increase in turns or time for toxic conditions differs substantially from the 25 percent observed in the simulations.

Figures

Figures reproduced from arXiv: 2512.08345 by Benedikt Mangold.

**Figure 1.** Figure 1: Amount of topics per domain (https://idebate.net), from which the debates are randomly chosen. A list of detailed topics can be found in table 3 3.2 Behavioral Variable: The Toxicity Injection To measure the impact of behavioral traits, we differentiate between two experimental conditions: • Control Group (Baseline): Both agents (Pro and Con) are assigned a standard, “Neutral/Constructive” system prompt. T… view at source ↗

**Figure 2.** Figure 2: Arguments required until alignment without toxic behaviour (toxicity level [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Execution pipeline of the simulation study of our work. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Arguments required until alignment with different levels of toxic behaviour [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for Persona Generation. proposition is replaced by one random proposition from table 3. number is set to 2 in this paper, but can be a higher number. Given a proposition: {proposition} Background: You are an agent ’{agent_dict[’procon’]}_{agent_dict[’agent_id’]}’, participating in a discussion of {nagents} agents on the proposition. Personally, you are {procon_string} the proposition and your claim … view at source ↗

**Figure 6.** Figure 6: Prompt for Agent argument generation. proposition is replaced by one random proposition from table 3. agent_dict is a collection of agents participating in this conversation, compiled from prompt of table 5. nagents is set to 2 in this paper, but can be a higher number. procon_string state of the current agent, part of agent_dict. claim one-liner describing the opinion of the current agent towards the prop… view at source ↗

**Figure 7.** Figure 7: Prompt for Toxic agent argument generation. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for Moderator agent evaluation. proposition is replaced by one random proposition from table 3. nagents is set to 2 in this paper, but can be a higher number. nround is a counter of how many arguments have been exchanged so far. discussion_history contains the entire chain of arguments being exchanged so far and the previous evaluations of the Moderator agent. 11 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

read the original abstract

Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled "sociological sandbox". We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with "toxic" system prompts. Our results demonstrate a statistically significant increase of approximately 25\% in the duration of conversations involving toxic participants. We propose that this "latency of toxicity" serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The simulation gives a 25% latency figure from toxic prompts but the proxy for real human incivility is untested and the methods lack detail.

read the letter

The main thing to know is that the paper runs Monte Carlo simulations of LLM agents in debates and reports a 25% increase in turns needed to converge when one agent gets a toxic prompt. It positions this as a measurable latency of toxicity that could proxy for real costs in teams or organizations. The approach is straightforward and avoids the ethical barriers of human-subject work on conflict. Using multi-agent setups to run hundreds of controlled discussions and track argument counts to conclusion is a clean way to generate comparable data across conditions. The reproducibility angle also holds up since everything stays inside code rather than relying on live participants. That part is useful for anyone thinking about scalable ways to study social friction. The central weakness is the unvalidated proxy. The claim rests on the assumption that instructing an LLM to be toxic produces the same interaction delays as real human incivility. LLMs tend to follow prompts literally, which can create repetitive or scripted extensions that may not match the variable, context-driven friction seen in actual adversarial exchanges. The abstract shows no calibration against human debate records, psychological studies on incivility, or checks on how much the result changes with different prompt wording. Without that, the 25% number risks being an artifact of the simulation design rather than a general mechanism. The statistical claim also needs more support. Significance is asserted but the abstract gives no information on the test used, variance across runs, or how LLM stochasticity was handled. Those gaps make it difficult to judge robustness from what is shown. This paper would interest researchers working on agent-based models for organizational or computational social science questions. A reader already using simulations to explore efficiency losses might borrow the framing or the Monte Carlo setup. It is not a finished result but has enough structure to be worth referee time. I would send it to peer review so the methods and proxy can be examined directly.

Referee Report

3 major / 2 minor

Summary. The manuscript uses LLM-based multi-agent systems to simulate 1-on-1 adversarial debates via Monte Carlo methods over hundreds of runs. It compares convergence time, defined as the number of arguments required to reach a conclusion, between a baseline control condition and treatment conditions in which agents receive 'toxic' system prompts. The central result is a statistically significant increase of approximately 25% in conversation duration under toxic conditions, interpreted as a 'latency of toxicity' that proxies financial and operational damage in organizational settings. The work positions agent-based modeling as a reproducible, ethical substitute for human-subject research on social friction.

Significance. If the result holds after methodological clarification, the paper offers a scalable, controlled computational framework for quantifying interaction inefficiencies that are otherwise difficult to study ethically. The Monte Carlo approach and focus on measurable convergence provide a reproducible template that could be extended to other social dynamics. Credit is due for attempting to address a practically relevant question with simulation rather than direct human experimentation.

major comments (3)

[Methods] Methods: The description supplies no information on the statistical test used to support the claim of statistical significance, the number of Monte Carlo runs performed, variance or standard error across runs, the specific LLM model and sampling parameters (temperature, top-p, seed control), or the exact wording of the toxic versus control system prompts. These omissions prevent independent verification that the reported 25% increase is robust rather than sensitive to implementation details.
[Results] Results: No tables, figures, or supplementary data report the distribution of convergence times, confidence intervals, or p-values underlying the 'approximately 25%' figure. Without these, it is impossible to assess whether the effect is consistent or whether stochastic LLM behavior could produce comparable differences under neutral prompts.
[Discussion] Discussion: The central claim that toxic-prompted LLM agents validly model efficiency losses caused by real human incivility rests on an unvalidated proxy. No calibration against human debate transcripts, psychological incivility measures, or sensitivity analysis on prompt phrasing is described, leaving open the possibility that measured latency reflects prompt-following artifacts rather than generalizable social mechanisms.

minor comments (2)

[Abstract] Abstract contains the typo 'hundrets' (should be 'hundreds').
[Introduction] Notation for 'latency of toxicity' is introduced without a formal definition or equation; a brief operationalization would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the clarity, reproducibility, and interpretation of our work. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Methods] Methods: The description supplies no information on the statistical test used to support the claim of statistical significance, the number of Monte Carlo runs performed, variance or standard error across runs, the specific LLM model and sampling parameters (temperature, top-p, seed control), or the exact wording of the toxic versus control system prompts. These omissions prevent independent verification that the reported 25% increase is robust rather than sensitive to implementation details.

Authors: We agree that these details are necessary for reproducibility. In the revised manuscript we will add a dedicated Methods subsection that specifies the LLM model and version, all sampling parameters (temperature, top-p, and seed settings), the precise system prompts for both the toxic and control conditions, the exact number of Monte Carlo runs performed, and the statistical procedure (Welch’s two-sample t-test) together with the resulting p-values, means, and standard errors. revision: yes
Referee: [Results] Results: No tables, figures, or supplementary data report the distribution of convergence times, confidence intervals, or p-values underlying the 'approximately 25%' figure. Without these, it is impossible to assess whether the effect is consistent or whether stochastic LLM behavior could produce comparable differences under neutral prompts.

Authors: We acknowledge the current absence of these supporting statistics. The revised version will include a new figure showing the distributions of convergence times under both conditions, a table reporting means, standard deviations, 95 % confidence intervals, and p-values, and a brief discussion of how the Monte Carlo design and large run count address stochastic variability in LLM outputs. revision: yes
Referee: [Discussion] Discussion: The central claim that toxic-prompted LLM agents validly model efficiency losses caused by real human incivility rests on an unvalidated proxy. No calibration against human debate transcripts, psychological incivility measures, or sensitivity analysis on prompt phrasing is described, leaving open the possibility that measured latency reflects prompt-following artifacts rather than generalizable social mechanisms.

Authors: We accept that the proxy relationship between LLM behavior and human incivility is not yet calibrated. We will revise the Discussion to present the simulation explicitly as a controlled computational proxy rather than a direct analogue, add a limitations paragraph that notes the lack of human-data calibration, and outline planned sensitivity analyses on prompt phrasing. We maintain that the observed difference in convergence time under controlled toxic prompts is a useful starting point for quantifying interaction inefficiency, but we will moderate language to reflect the preliminary nature of the proxy. revision: partial

Circularity Check

0 steps flagged

No significant circularity: direct simulation outputs

full rationale

The paper runs Monte Carlo simulations of LLM agents under baseline and toxic-prompt conditions, then directly compares measured convergence times (number of arguments to conclusion) between groups. No equations, parameter fitting, or derivations are described that would reduce the reported 25% latency difference to a self-defined input or self-citation chain. The result is an empirical statistic from the simulation runs themselves rather than a prediction forced by construction from prior fitted values or author-defined uniqueness theorems. The choice of prompt wording is an experimental design decision, not a circular redefinition of the output metric.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the untested premise that LLM agents replicate human toxic interaction dynamics and on the author-chosen definition of toxic prompts; the latency-of-toxicity construct is introduced without external validation.

free parameters (1)

toxic system prompt wording
The specific instructions used to induce toxic behavior are selected by the authors and directly determine the treatment condition.

axioms (1)

domain assumption LLM agents with modified system prompts behave analogously to humans in adversarial debates
This premise is required to treat the simulations as a sociological sandbox.

invented entities (1)

latency of toxicity no independent evidence
purpose: Proxy measure for financial and operational damage caused by incivility
New term coined to interpret the simulated increase in conversation length.

pith-pipeline@v0.9.0 · 5466 in / 1459 out tokens · 42493 ms · 2026-05-17T00:37:26.187198+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time... statistically significant increase of approximately 25% in the duration of conversations involving toxic participants.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

toxic system prompts produce interaction patterns that validly model the efficiency losses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations
cs.AI 2026-05 unverdicted novelty 5.0

Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

in agreement

nagents is set to 2 in this paper, but can be a higher number. procon_string state of the current agent, part of agent_dict. claim one-liner describing the opinion of the current agent towards the proposition, part of agent_dict. description how others would describe the persona of the current agent, part of agent_dict. persuadability Score of persuadabil...

work page
[2]

nround is a counter of how many arguments have been exchanged so far

nagents is set to 2 in this paper, but can be a higher number. nround is a counter of how many arguments have been exchanged so far. discussion_history contains the entire chain of arguments being exchanged so far and the previous evaluations of the Moderator agent. 11

work page

[1] [1]

in agreement

nagents is set to 2 in this paper, but can be a higher number. procon_string state of the current agent, part of agent_dict. claim one-liner describing the opinion of the current agent towards the proposition, part of agent_dict. description how others would describe the persona of the current agent, part of agent_dict. persuadability Score of persuadabil...

work page

[2] [2]

nround is a counter of how many arguments have been exchanged so far

nagents is set to 2 in this paper, but can be a higher number. nround is a counter of how many arguments have been exchanged so far. discussion_history contains the entire chain of arguments being exchanged so far and the previous evaluations of the Moderator agent. 11

work page