pith. sign in

arxiv: 2606.30256 · v1 · pith:TPP6OY7Qnew · submitted 2026-06-29 · 💻 cs.AI · cs.CY

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

Pith reviewed 2026-06-30 06:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CY
keywords safety evaluationemotional support chatbotsmultilingual benchmarkmulti-turn conversationsauditor judge methodscore inflationrun-to-run reliabilitycrisis handling
0
0 comments X

The pith

A strict per-criterion rubric on multi-turn transcripts exposes score inflation on ten of nineteen safety metrics and treats run-to-run consistency as a model-specific safety property.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark that generates full conversations by having one model act as a help-seeking user across 140 seed instructions and 34 personas, then has a second model from a different family score each complete transcript on nineteen metrics grouped into five safety dimensions. It shows that switching from the default rubric to a strict per-criterion version reduces inflated scores on ten metrics and improves the ability to distinguish models. Aggregate scores across three frontier models stay close, yet their per-metric profiles differ by as much as six points in specific areas. Repeated identical runs further demonstrate that some models swing by eight points on crisis metrics while others produce entirely different dialogues each time, even at temperature zero. The pipeline, seeds, personas, and rubrics are released so the method can be applied to other systems.

Core claim

The central claim is that an auditor model generates multi-turn crisis conversations in Mexican Spanish from fixed seeds and personas, a judge model from a separate family scores the full transcripts on nineteen metrics across crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation, and a strict per-criterion rubric reveals material score inflation on ten of those metrics while restoring discrimination; run-to-run tests establish that reliability itself varies by model and is not mere noise.

What carries the argument

The auditor-judge pipeline in which one model role-plays users to produce complete transcripts and a second model scores them against calibrated per-criterion rubrics on five safety dimensions.

If this is right

  • Models that rank similarly on aggregate scores can still be separated by examining their distinct weak spots on individual metrics.
  • Safety evaluations of emotional-support systems must treat conversation-to-conversation consistency as a measurable property rather than averaging it away.
  • The released seeds, personas, and rubrics allow direct reuse or extension to additional languages without rebuilding the generation and scoring steps.
  • Cross-family judge agreement remains high under the standard rubric, with 93 percent of scores falling within one point.
  • Per-metric profiles provide actionable guidance on where each model fails even when overall numbers appear acceptable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks that ignore multi-run variability may systematically overstate safety for models whose outputs fluctuate sharply.
  • Developers could add explicit consistency checks during deployment for dimensions that show high run-to-run swing.
  • The method could be applied to test whether cultural adaptation metrics behave differently when the same seeds are translated and run in English.
  • Future work might examine whether models with high run-to-run variance on crisis metrics also show variance on therapeutic quality.

Load-bearing premise

The judge model drawn from a different family can be calibrated so its scores validly reflect the five safety dimensions without introducing systematic bias that calibration cannot remove.

What would settle it

Human experts rating the same transcripts produce scores that do not align with the calibrated judge on the ten metrics previously flagged as inflated.

Figures

Figures reproduced from arXiv: 2606.30256 by Camilo Chac\'on Sartori.

Figure 1
Figure 1. Figure 1: The EMPATH pipeline. From a seed instruction and persona, the auditor role￾plays a help-seeking user across a multi-turn conversation in which risk can escalate; the judge then scores the complete transcript—not individual turns—against the 19 metrics with quoted evidence. Auditor and judge are drawn from different model families. behavior under test, the locale, and success criteria; a persona supplies a … view at source ↗
Figure 2
Figure 2. Figure 2: The EMPATH metric taxonomy: five dimensions and 19 metrics. Leaf glyphs mark provenance—8 metrics are introduced by EMPATH (stars), and the new metrics concentrate where existing benchmarks are thinnest: cultural adaptation, boundaries, and conversation-level safety (risk trajectory, sensitive-context reintroduction, depen￾dency). specify age, gender, family structure, socioeconomic context, writing style,… view at source ↗
Figure 3
Figure 3. Figure 3: Judge calibration (S1) over the 57 grid conversations: per-metric means under the standard 1–10 rubric (circles) and the strict binary sub-criteria rubric (squares); faint marks are individual probes. The strict rubric does not merely deflate: it separates—ten interpretive metrics drop by 1–4 points while five consolidate upward. Separation, not shift, is what restores the instrument’s discrimination. rath… view at source ↗
Figure 4
Figure 4. Figure 4: S2: the 19-metric × three-model grid (judge A, standard rubric—its leniency is quantified in S1; one audited conversation per metric per model). Aggregates nearly tie (8.79/8.63/8.05) while per-metric profiles diverge by up to six points, so the aggregate conceals where each model is weak. These cells are single-draw and not significance￾tested: the test-retest (Sec. 4) is what separates reproducible soft … view at source ↗
Figure 5
Figure 5. Figure 5: Judge A (Anthropic) vs. judge B (OpenAI) on the same 57 transcripts. Judge B is stricter—points sit below the diagonal—but agreement is high (93% within ±1, r = 0.84) and the standard-rubric model ranking is identical under both judges. Cross￾family agreement bounds the self-preference explanation; it does not certify validity. Run-to-run reliability. LLM outputs are stochastic: no configuration repro￾duce… view at source ↗
read the original abstract

Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for safety evaluation of emotional-support chatbots. An auditor model role-plays help-seeking users, generating multi-turn conversations from 140 seed instructions and 34 personas. A judge model scores each full transcript against 19 metrics across five dimensions: crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation. EMPATH is built for Mexican Spanish and US English; the studies reported here run in Mexican Spanish. Auditor and judge are drawn from different model families, and the judge is treated as an instrument to be calibrated rather than trusted. A strict per-criterion rubric reveals material score inflation on 10 of the 19 metrics and restores discrimination. We study the measurement properties of the benchmark through judge calibration and cross-family inter-judge agreement. We also illustrate EMPATH on three frontier models, one of them open-weight. Aggregate scores sit within 0.74 points of one another, but per-metric profiles diverge by up to six points in model-specific places. Under the standard rubric, both the ranking and the weak spots are stable across a second, cross-family judge: 93% of scores fall within plus or minus 1. A five-run test-retest adds a second axis: even the steadiest model swings from 2 to 10 on a crisis metric across identical re-runs, and deepseek-v4-pro returns a different conversation on every run even at temperature 0. Run-to-run reliability is therefore a per-model safety property, not noise to average away. EMPATH is system-agnostic; the pipeline, seeds, personas, and rubrics are released for reuse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces EMPATH, a benchmark for safety evaluation of emotional-support chatbots in multilingual multi-turn settings. An auditor model generates conversations from 140 seed instructions and 34 personas in Mexican Spanish; a separate judge model scores full transcripts on 19 metrics spanning crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation. The authors report that a strict per-criterion rubric exposes material score inflation on 10 of 19 metrics, that cross-family inter-judge agreement reaches 93%, and that run-to-run variability (e.g., 2-to-10 swings on crisis metrics) is a model-specific safety property rather than averaging noise. The pipeline, seeds, personas, and rubrics are released.

Significance. If the empirical results hold, EMPATH supplies a more ecologically valid instrument for safety assessment of emotional-support systems by moving beyond fixed single-turn prompts. The explicit treatment of the judge as a calibratable instrument, the high cross-family agreement, the release of all components, and the demonstration that reliability is per-model rather than noise constitute concrete contributions to the field.

minor comments (3)
  1. The abstract states that the judge is 'treated as an instrument to be calibrated'; the methods section should include the exact calibration procedure and any quantitative evidence that calibration removes systematic bias on the five dimensions.
  2. The claim that 'aggregate scores sit within 0.74 points' while 'per-metric profiles diverge by up to six points' would benefit from an explicit table or figure showing the per-model, per-metric scores for the three evaluated systems.
  3. The five-run test-retest is described only for crisis metrics; a summary statistic (e.g., standard deviation or range) across all 19 metrics for each model would strengthen the run-to-run reliability claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of EMPATH and for the positive assessment of its significance. The recommendation of minor revision is noted; we will prepare a revised manuscript accordingly. No major comments appear in the report, so we provide no point-by-point responses.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical benchmark construction (auditor-judge pipeline, rubrics, calibration studies) rather than any derivation chain, first-principles result, or predictive model whose outputs reduce to fitted parameters or self-citations by construction. No equations, ansatzes, or uniqueness theorems appear; the central claims (score inflation under strict rubric, run-to-run variability as a model property) rest on direct measurement and cross-judge agreement, which are externally falsifiable. Self-citation is absent from load-bearing steps. This matches the default expectation for non-derivational empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that the chosen 19 metrics and five dimensions adequately capture safety in emotional-support conversations and that cross-family model separation plus rubric calibration can produce valid scores. No free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption The 19 metrics across the five dimensions validly measure crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation.
    The paper defines the scoring task using these metrics without external validation cited in the abstract.
  • domain assumption Using auditor and judge models from different families plus strict rubric calibration removes material bias from the evaluation.
    The abstract states the judge is treated as a calibratable instrument rather than trusted outright.

pith-pipeline@v0.9.1-grok · 5879 in / 1475 out tokens · 31700 ms · 2026-06-30T06:07:31.651552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Arora, R.K., et al.: HealthBench: evaluating large language models towards improved human health. arXiv:2505.08775 (2025).https://doi.org/10.48550/ arXiv.2505.08775

  2. [3]

    arXiv:2408.04650 (2024).https: //doi.org/10.48550/arXiv.2408.04650

    Park, J.I., Abbasian, M., Azimi, I., et al.: Building trust in mental health chatbots: safety metrics and LLM-based evaluation tools. arXiv:2408.04650 (2024).https: //doi.org/10.48550/arXiv.2408.04650

  3. [4]

    In: Proc

    Badawi, A., Rahimi, E., Laskar, M.T.R., et al.: When can we trust LLMs in mental health? Large-scale benchmarks for reliable LLM evaluation. In: Proc. EACL 2026, pp. 3873–3896 (2026).https://doi.org/10.18653/v1/2026.eacl-long.180

  4. [5]

    JMIR Mental Health13, e91454 (2026).https: //doi.org/10.2196/91454

    Morrin, H., Au Yeung, J., Agnew, Z., Østergaard, S.D., Pollak, T.A.: It is the journey, not the destination: moving from end points to trajectories when assessing chatbot mental health safety. JMIR Mental Health13, e91454 (2026).https: //doi.org/10.2196/91454

  5. [6]

    Language Shapes Mental Health Evaluations in Large Language Models

    Xu, J., Hu, X.: Language shapes mental health evaluations in large language mod- els. arXiv:2603.06910 (2026).https://doi.org/10.48550/arXiv.2603.06910

  6. [7]

    MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

    Lee, S., Achananuparp, P., Yadav, N., Lim, E., Deng, Y.: MHSafeEval: role- aware interaction-level evaluation of mental health safety in large language models. arXiv:2604.17730 (2026).https://doi.org/10.48550/arXiv.2604.17730

  7. [9]

    In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing

    Zhang, Z., et al.: SafetyBench: evaluating the safety of large language models. In: Proc. ACL 2024, pp. 15537–15553 (2024).https://doi.org/10.18653/v1/2024. acl-long.830

  8. [10]

    In: Proc

    Liu, S., et al.: Towards emotional support dialog systems. In: Proc. ACL-IJCNLP 2021, pp. 3469–3483 (2021).https://doi.org/10.18653/v1/2021.acl-long.269

  9. [11]

    https://github.com/safety-research/petri(2025)

    Anthropic: Petri: an open-source auditing tool to accelerate AI safety research. https://github.com/safety-research/petri(2025)

  10. [12]

    UK AI Safety Institute: Inspect AI: framework for large language model evalua- tions.https://github.com/UKGovernmentBEIS/inspect_ai(2024)

  11. [13]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., et al.: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems 36 (2023).https://doi.org/ 10.48550/arXiv.2306.05685

  12. [14]

    LLM Evaluators Recognize and Favor Their Own Generations

    Panickssery, A., Bowman, S.R., Feng, S.: LLM evaluators recognize and favor their own generations. arXiv:2404.13076 (2024).https://doi.org/10.48550/arXiv. 2404.13076

  13. [15]

    Towards Understanding Sycophancy in Language Models

    Sharma, M., et al.: Towards understanding sycophancy in language models. arXiv:2310.13548 (2023).https://doi.org/10.48550/arXiv.2310.13548

  14. [16]

    Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol.61(1), 29–48 (2008).https://doi. org/10.1348/000711006X126600