EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

Camilo Chac\'on Sartori

arxiv: 2606.30256 · v1 · pith:TPP6OY7Qnew · submitted 2026-06-29 · 💻 cs.AI · cs.CY

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

Camilo Chac\'on Sartori This is my paper

Pith reviewed 2026-06-30 06:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords safety evaluationemotional support chatbotsmultilingual benchmarkmulti-turn conversationsauditor judge methodscore inflationrun-to-run reliabilitycrisis handling

0 comments

The pith

A strict per-criterion rubric on multi-turn transcripts exposes score inflation on ten of nineteen safety metrics and treats run-to-run consistency as a model-specific safety property.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark that generates full conversations by having one model act as a help-seeking user across 140 seed instructions and 34 personas, then has a second model from a different family score each complete transcript on nineteen metrics grouped into five safety dimensions. It shows that switching from the default rubric to a strict per-criterion version reduces inflated scores on ten metrics and improves the ability to distinguish models. Aggregate scores across three frontier models stay close, yet their per-metric profiles differ by as much as six points in specific areas. Repeated identical runs further demonstrate that some models swing by eight points on crisis metrics while others produce entirely different dialogues each time, even at temperature zero. The pipeline, seeds, personas, and rubrics are released so the method can be applied to other systems.

Core claim

The central claim is that an auditor model generates multi-turn crisis conversations in Mexican Spanish from fixed seeds and personas, a judge model from a separate family scores the full transcripts on nineteen metrics across crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation, and a strict per-criterion rubric reveals material score inflation on ten of those metrics while restoring discrimination; run-to-run tests establish that reliability itself varies by model and is not mere noise.

What carries the argument

The auditor-judge pipeline in which one model role-plays users to produce complete transcripts and a second model scores them against calibrated per-criterion rubrics on five safety dimensions.

If this is right

Models that rank similarly on aggregate scores can still be separated by examining their distinct weak spots on individual metrics.
Safety evaluations of emotional-support systems must treat conversation-to-conversation consistency as a measurable property rather than averaging it away.
The released seeds, personas, and rubrics allow direct reuse or extension to additional languages without rebuilding the generation and scoring steps.
Cross-family judge agreement remains high under the standard rubric, with 93 percent of scores falling within one point.
Per-metric profiles provide actionable guidance on where each model fails even when overall numbers appear acceptable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that ignore multi-run variability may systematically overstate safety for models whose outputs fluctuate sharply.
Developers could add explicit consistency checks during deployment for dimensions that show high run-to-run swing.
The method could be applied to test whether cultural adaptation metrics behave differently when the same seeds are translated and run in English.
Future work might examine whether models with high run-to-run variance on crisis metrics also show variance on therapeutic quality.

Load-bearing premise

The judge model drawn from a different family can be calibrated so its scores validly reflect the five safety dimensions without introducing systematic bias that calibration cannot remove.

What would settle it

Human experts rating the same transcripts produce scores that do not align with the calibrated judge on the ten metrics previously flagged as inflated.

Figures

Figures reproduced from arXiv: 2606.30256 by Camilo Chac\'on Sartori.

**Figure 1.** Figure 1: The EMPATH pipeline. From a seed instruction and persona, the auditor roleplays a help-seeking user across a multi-turn conversation in which risk can escalate; the judge then scores the complete transcript—not individual turns—against the 19 metrics with quoted evidence. Auditor and judge are drawn from different model families. behavior under test, the locale, and success criteria; a persona supplies a … view at source ↗

**Figure 2.** Figure 2: The EMPATH metric taxonomy: five dimensions and 19 metrics. Leaf glyphs mark provenance—8 metrics are introduced by EMPATH (stars), and the new metrics concentrate where existing benchmarks are thinnest: cultural adaptation, boundaries, and conversation-level safety (risk trajectory, sensitive-context reintroduction, dependency). specify age, gender, family structure, socioeconomic context, writing style,… view at source ↗

**Figure 3.** Figure 3: Judge calibration (S1) over the 57 grid conversations: per-metric means under the standard 1–10 rubric (circles) and the strict binary sub-criteria rubric (squares); faint marks are individual probes. The strict rubric does not merely deflate: it separates—ten interpretive metrics drop by 1–4 points while five consolidate upward. Separation, not shift, is what restores the instrument’s discrimination. rath… view at source ↗

**Figure 4.** Figure 4: S2: the 19-metric × three-model grid (judge A, standard rubric—its leniency is quantified in S1; one audited conversation per metric per model). Aggregates nearly tie (8.79/8.63/8.05) while per-metric profiles diverge by up to six points, so the aggregate conceals where each model is weak. These cells are single-draw and not significancetested: the test-retest (Sec. 4) is what separates reproducible soft … view at source ↗

**Figure 5.** Figure 5: Judge A (Anthropic) vs. judge B (OpenAI) on the same 57 transcripts. Judge B is stricter—points sit below the diagonal—but agreement is high (93% within ±1, r = 0.84) and the standard-rubric model ranking is identical under both judges. Crossfamily agreement bounds the self-preference explanation; it does not certify validity. Run-to-run reliability. LLM outputs are stochastic: no configuration reproduce… view at source ↗

read the original abstract

Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for safety evaluation of emotional-support chatbots. An auditor model role-plays help-seeking users, generating multi-turn conversations from 140 seed instructions and 34 personas. A judge model scores each full transcript against 19 metrics across five dimensions: crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation. EMPATH is built for Mexican Spanish and US English; the studies reported here run in Mexican Spanish. Auditor and judge are drawn from different model families, and the judge is treated as an instrument to be calibrated rather than trusted. A strict per-criterion rubric reveals material score inflation on 10 of the 19 metrics and restores discrimination. We study the measurement properties of the benchmark through judge calibration and cross-family inter-judge agreement. We also illustrate EMPATH on three frontier models, one of them open-weight. Aggregate scores sit within 0.74 points of one another, but per-metric profiles diverge by up to six points in model-specific places. Under the standard rubric, both the ranking and the weak spots are stable across a second, cross-family judge: 93% of scores fall within plus or minus 1. A five-run test-retest adds a second axis: even the steadiest model swings from 2 to 10 on a crisis metric across identical re-runs, and deepseek-v4-pro returns a different conversation on every run even at temperature 0. Run-to-run reliability is therefore a per-model safety property, not noise to average away. EMPATH is system-agnostic; the pipeline, seeds, personas, and rubrics are released for reuse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMPATH gives a multi-turn auditor-judge benchmark for emotional-support chatbot safety that flags score inflation and per-model run-to-run variability, with artifacts released.

read the letter

EMPATH sets up an auditor model to generate multi-turn help-seeking conversations from 140 seeds and 34 personas, then has a judge from a different family score the full transcripts on 19 metrics in five dimensions. The studies are in Mexican Spanish. The central move is treating the judge as a calibratable instrument rather than an oracle, then showing that a strict per-criterion rubric cuts material inflation on 10 of the 19 metrics and restores discrimination between models.

What stands out is the explicit study of measurement properties: cross-family inter-judge agreement reaches 93 percent within one point, and a five-run test-retest shows that even steady models can swing several points on crisis metrics across identical re-runs. Releasing the pipeline, seeds, personas, and rubrics makes the work reusable. Aggregate scores on three frontier models sit close together, but the per-metric profiles diverge in model-specific ways.

The abstract already flags the main risk—that the judge could carry uncorrectable bias—and addresses it through calibration and cross-family checks, so the weakest assumption is not hidden. Without the full methods and tables it is still hard to judge how well the calibration actually corrects bias on each dimension or how representative the 34 personas are for real crisis talk. Those are the usual empirical questions a referee would press.

The paper is for groups building or auditing emotional-support systems, especially outside English. It is concrete enough and grounded enough to deserve peer review rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper introduces EMPATH, a benchmark for safety evaluation of emotional-support chatbots in multilingual multi-turn settings. An auditor model generates conversations from 140 seed instructions and 34 personas in Mexican Spanish; a separate judge model scores full transcripts on 19 metrics spanning crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation. The authors report that a strict per-criterion rubric exposes material score inflation on 10 of 19 metrics, that cross-family inter-judge agreement reaches 93%, and that run-to-run variability (e.g., 2-to-10 swings on crisis metrics) is a model-specific safety property rather than averaging noise. The pipeline, seeds, personas, and rubrics are released.

Significance. If the empirical results hold, EMPATH supplies a more ecologically valid instrument for safety assessment of emotional-support systems by moving beyond fixed single-turn prompts. The explicit treatment of the judge as a calibratable instrument, the high cross-family agreement, the release of all components, and the demonstration that reliability is per-model rather than noise constitute concrete contributions to the field.

minor comments (3)

The abstract states that the judge is 'treated as an instrument to be calibrated'; the methods section should include the exact calibration procedure and any quantitative evidence that calibration removes systematic bias on the five dimensions.
The claim that 'aggregate scores sit within 0.74 points' while 'per-metric profiles diverge by up to six points' would benefit from an explicit table or figure showing the per-model, per-metric scores for the three evaluated systems.
The five-run test-retest is described only for crisis metrics; a summary statistic (e.g., standard deviation or range) across all 19 metrics for each model would strengthen the run-to-run reliability claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of EMPATH and for the positive assessment of its significance. The recommendation of minor revision is noted; we will prepare a revised manuscript accordingly. No major comments appear in the report, so we provide no point-by-point responses.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical benchmark construction (auditor-judge pipeline, rubrics, calibration studies) rather than any derivation chain, first-principles result, or predictive model whose outputs reduce to fitted parameters or self-citations by construction. No equations, ansatzes, or uniqueness theorems appear; the central claims (score inflation under strict rubric, run-to-run variability as a model property) rest on direct measurement and cross-judge agreement, which are externally falsifiable. Self-citation is absent from load-bearing steps. This matches the default expectation for non-derivational empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that the chosen 19 metrics and five dimensions adequately capture safety in emotional-support conversations and that cross-family model separation plus rubric calibration can produce valid scores. No free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption The 19 metrics across the five dimensions validly measure crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation.
The paper defines the scoring task using these metrics without external validation cited in the abstract.
domain assumption Using auditor and judge models from different families plus strict rubric calibration removes material bias from the evaluation.
The abstract states the judge is treated as a calibratable instrument rather than trusted outright.

pith-pipeline@v0.9.1-grok · 5879 in / 1475 out tokens · 31700 ms · 2026-06-30T06:07:31.651552+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 6 internal anchors

[1]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, R.K., et al.: HealthBench: evaluating large language models towards improved human health. arXiv:2505.08775 (2025).https://doi.org/10.48550/ arXiv.2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

arXiv:2408.04650 (2024).https: //doi.org/10.48550/arXiv.2408.04650

Park, J.I., Abbasian, M., Azimi, I., et al.: Building trust in mental health chatbots: safety metrics and LLM-based evaluation tools. arXiv:2408.04650 (2024).https: //doi.org/10.48550/arXiv.2408.04650

work page doi:10.48550/arxiv.2408.04650 2024
[4]

In: Proc

Badawi, A., Rahimi, E., Laskar, M.T.R., et al.: When can we trust LLMs in mental health? Large-scale benchmarks for reliable LLM evaluation. In: Proc. EACL 2026, pp. 3873–3896 (2026).https://doi.org/10.18653/v1/2026.eacl-long.180

work page doi:10.18653/v1/2026.eacl-long.180 2026
[5]

JMIR Mental Health13, e91454 (2026).https: //doi.org/10.2196/91454

Morrin, H., Au Yeung, J., Agnew, Z., Østergaard, S.D., Pollak, T.A.: It is the journey, not the destination: moving from end points to trajectories when assessing chatbot mental health safety. JMIR Mental Health13, e91454 (2026).https: //doi.org/10.2196/91454

work page doi:10.2196/91454 2026
[6]

Language Shapes Mental Health Evaluations in Large Language Models

Xu, J., Hu, X.: Language shapes mental health evaluations in large language mod- els. arXiv:2603.06910 (2026).https://doi.org/10.48550/arXiv.2603.06910

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.06910 2026
[7]

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

Lee, S., Achananuparp, P., Yadav, N., Lim, E., Deng, Y.: MHSafeEval: role- aware interaction-level evaluation of mental health safety in large language models. arXiv:2604.17730 (2026).https://doi.org/10.48550/arXiv.2604.17730

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17730 2026
[9]

In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing

Zhang, Z., et al.: SafetyBench: evaluating the safety of large language models. In: Proc. ACL 2024, pp. 15537–15553 (2024).https://doi.org/10.18653/v1/2024. acl-long.830

work page doi:10.18653/v1/2024 2024
[10]

In: Proc

Liu, S., et al.: Towards emotional support dialog systems. In: Proc. ACL-IJCNLP 2021, pp. 3469–3483 (2021).https://doi.org/10.18653/v1/2021.acl-long.269

work page doi:10.18653/v1/2021.acl-long.269 2021
[11]

https://github.com/safety-research/petri(2025)

Anthropic: Petri: an open-source auditing tool to accelerate AI safety research. https://github.com/safety-research/petri(2025)

2025
[12]

UK AI Safety Institute: Inspect AI: framework for large language model evalua- tions.https://github.com/UKGovernmentBEIS/inspect_ai(2024)

2024
[13]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., et al.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems 36 (2023).https://doi.org/ 10.48550/arXiv.2306.05685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
[14]

LLM Evaluators Recognize and Favor Their Own Generations

Panickssery, A., Bowman, S.R., Feng, S.: LLM evaluators recognize and favor their own generations. arXiv:2404.13076 (2024).https://doi.org/10.48550/arXiv. 2404.13076

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[15]

Towards Understanding Sycophancy in Language Models

Sharma, M., et al.: Towards understanding sycophancy in language models. arXiv:2310.13548 (2023).https://doi.org/10.48550/arXiv.2310.13548

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548 2023
[16]

Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol.61(1), 29–48 (2008).https://doi. org/10.1348/000711006X126600

work page doi:10.1348/000711006x126600 2008

[1] [1]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, R.K., et al.: HealthBench: evaluating large language models towards improved human health. arXiv:2505.08775 (2025).https://doi.org/10.48550/ arXiv.2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [3]

arXiv:2408.04650 (2024).https: //doi.org/10.48550/arXiv.2408.04650

Park, J.I., Abbasian, M., Azimi, I., et al.: Building trust in mental health chatbots: safety metrics and LLM-based evaluation tools. arXiv:2408.04650 (2024).https: //doi.org/10.48550/arXiv.2408.04650

work page doi:10.48550/arxiv.2408.04650 2024

[3] [4]

In: Proc

Badawi, A., Rahimi, E., Laskar, M.T.R., et al.: When can we trust LLMs in mental health? Large-scale benchmarks for reliable LLM evaluation. In: Proc. EACL 2026, pp. 3873–3896 (2026).https://doi.org/10.18653/v1/2026.eacl-long.180

work page doi:10.18653/v1/2026.eacl-long.180 2026

[4] [5]

JMIR Mental Health13, e91454 (2026).https: //doi.org/10.2196/91454

Morrin, H., Au Yeung, J., Agnew, Z., Østergaard, S.D., Pollak, T.A.: It is the journey, not the destination: moving from end points to trajectories when assessing chatbot mental health safety. JMIR Mental Health13, e91454 (2026).https: //doi.org/10.2196/91454

work page doi:10.2196/91454 2026

[5] [6]

Language Shapes Mental Health Evaluations in Large Language Models

Xu, J., Hu, X.: Language shapes mental health evaluations in large language mod- els. arXiv:2603.06910 (2026).https://doi.org/10.48550/arXiv.2603.06910

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.06910 2026

[6] [7]

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

Lee, S., Achananuparp, P., Yadav, N., Lim, E., Deng, Y.: MHSafeEval: role- aware interaction-level evaluation of mental health safety in large language models. arXiv:2604.17730 (2026).https://doi.org/10.48550/arXiv.2604.17730

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17730 2026

[7] [9]

In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing

Zhang, Z., et al.: SafetyBench: evaluating the safety of large language models. In: Proc. ACL 2024, pp. 15537–15553 (2024).https://doi.org/10.18653/v1/2024. acl-long.830

work page doi:10.18653/v1/2024 2024

[8] [10]

In: Proc

Liu, S., et al.: Towards emotional support dialog systems. In: Proc. ACL-IJCNLP 2021, pp. 3469–3483 (2021).https://doi.org/10.18653/v1/2021.acl-long.269

work page doi:10.18653/v1/2021.acl-long.269 2021

[9] [11]

https://github.com/safety-research/petri(2025)

Anthropic: Petri: an open-source auditing tool to accelerate AI safety research. https://github.com/safety-research/petri(2025)

2025

[10] [12]

UK AI Safety Institute: Inspect AI: framework for large language model evalua- tions.https://github.com/UKGovernmentBEIS/inspect_ai(2024)

2024

[11] [13]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., et al.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems 36 (2023).https://doi.org/ 10.48550/arXiv.2306.05685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023

[12] [14]

LLM Evaluators Recognize and Favor Their Own Generations

Panickssery, A., Bowman, S.R., Feng, S.: LLM evaluators recognize and favor their own generations. arXiv:2404.13076 (2024).https://doi.org/10.48550/arXiv. 2404.13076

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[13] [15]

Towards Understanding Sycophancy in Language Models

Sharma, M., et al.: Towards understanding sycophancy in language models. arXiv:2310.13548 (2023).https://doi.org/10.48550/arXiv.2310.13548

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548 2023

[14] [16]

Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol.61(1), 29–48 (2008).https://doi. org/10.1348/000711006X126600

work page doi:10.1348/000711006x126600 2008