arxiv: 2603.22913 · v2 · submitted 2026-03-24 · 💻 cs.CL

Recognition: no theorem link

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Ryoma Suzuki , Zhiyang Qi , Michimasa Inaba

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual datasetcounseling dialoguesLLM ensemblemachine translationdialogue corpushuman preference evaluation

0 comments

The pith

A multi-LLM ensemble method produces counseling translations that humans prefer over any single state-of-the-art model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the shortage of high-quality multilingual counseling dialogue data by translating an existing large Japanese counseling corpus into English and Chinese. It introduces a new ensemble technique that first collects translation candidates from several different LLMs and then has one LLM review their individual strengths and weaknesses to generate a final output. Human preference tests showed that these ensemble translations were chosen over those from any individual top-performing LLM. This matters for counseling applications because emotional nuance and therapeutic accuracy can vary across models, and a single model cannot consistently deliver the best result on every input. The resulting dataset is released to support further work on multilingual counseling systems.

Core claim

The authors translated the Japanese KokoroChat counseling corpus into English and Chinese using a multi-LLM ensemble approach: multiple LLMs first produce diverse translation hypotheses, after which a single LLM analyzes the strengths and weaknesses of those hypotheses to create a final high-quality translation. Human preference studies confirmed that the ensemble outputs were preferred over translations from any individual state-of-the-art LLM.

What carries the argument

The multi-LLM ensemble method, which generates diverse hypotheses from several LLMs and then tasks one LLM with analyzing their respective strengths and weaknesses to produce the final translation.

If this is right

Higher-fidelity multilingual counseling datasets become feasible without relying on any one LLM.
Ensemble methods can mitigate input-dependent performance variation across translation models.
The released dataset supports training and evaluation of counseling dialogue systems in English and Chinese.
Similar ensemble strategies may improve translation quality in other high-stakes domains requiring consistent nuance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on additional languages to check whether the preference advantage holds beyond English and Chinese.
Future experiments might isolate whether the ensemble specifically reduces errors in emotional expression that single models introduce.
The method suggests a general way to combine model diversity for tasks where no single model dominates every case.

Load-bearing premise

That general human preference ratings for translation quality will accurately reflect the specific requirements of emotional nuance and therapeutic accuracy needed in counseling dialogues.

What would settle it

A follow-up rating study by licensed counselors that scores the translations specifically on preservation of emotional tone and therapeutic intent and finds no consistent advantage for the ensemble over the best single model.

Figures

Figures reproduced from arXiv: 2603.22913 by Michimasa Inaba, Ryoma Suzuki, Zhiyang Qi.

**Figure 1.** Figure 1: Proposed Multi-LLM Ensemble Translation Method To overcome these limitations, research has progressed toward fusion-based approaches that integrate the strengths of multiple candidates to synthesize a new output superior to any individual candidate. For example, the GenFuser module in LLM-Blender fuses top-ranked candidates to generate a refined output (Jiang et al., 2023), and a layered Mixture-of-Ag… view at source ↗

**Figure 2.** Figure 2: Human Evaluation Results for Japaneseto-English Translation Proposed vs GPT Proposed vs Gemini Proposed vs Qwen Proposed vs All 0 20 40 60 80 100 83.0% 72.5% 84.5% 80.0% 1.0% 3.0% 1.5% 1.8% 16.0% 24.5% 14.0% 18.2% Lose Tie Win [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Human Evaluation Results for Japaneseto-Chinese Translation trivial samples that would not yield meaningful differences in evaluation scores, we filtered this set to include only utterances for which at least three of the four systems (proposed method and the three single LLMs) generated unique translations. The final evaluation datasets consisted of 13,106 English utterances and 14,544 Chinese utteranc… view at source ↗

**Figure 4.** Figure 4: Comparison of Japanese-to-English translation where the proposed method was judged inferior [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Japanese-to-Chinese translation example 1. This demonstrates the analysis and synthesis [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Japanese-to-Chinese translation example 2. This demonstrates the analysis and synthesis [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: A case where contextually natural translation is penalized by utterance-level evaluation. The [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: A case where contextually natural translation is penalized by utterance-level evaluation. While [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method's outputs. The Multilingual KokoroChat is available at https://github.com/UEC-InabaLab/MultilingualKokoroChat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a public multilingual counseling dataset via a multi-LLM ensemble translation method, but the human preference claims rest on thin details about rater expertise and criteria.

read the letter

The main thing here is a new public dataset, Multilingual KokoroChat, made by translating a Japanese counseling corpus into English and Chinese. They generate multiple translation hypotheses from different LLMs and have one model analyze the strengths and weaknesses to produce the final version. This targets the real problem that no single model is consistently best on sensitive inputs like counseling dialogues. They release the data on GitHub, which is a straightforward positive step for anyone working on non-English mental health tools.

Referee Report

2 major / 1 minor

Summary. The paper claims to address the scarcity of high-quality multilingual counseling dialogue datasets by translating the Japanese KokoroChat corpus into English and Chinese using a novel multi-LLM ensemble translation method. This method generates diverse translation hypotheses from multiple LLMs and then uses a single LLM to synthesize a high-quality output by analyzing the strengths and weaknesses of the hypotheses. The superior quality of the resulting Multilingual KokoroChat dataset is asserted based on human preference studies that show the ensemble translations are preferred over those from any individual state-of-the-art LLM.

Significance. Should the human evaluation results prove robust, this contribution would be significant for the field of multilingual NLP and AI for mental health. It provides a publicly available dataset that could facilitate research on cross-lingual counseling dialogues and the development of more culturally sensitive AI systems. The ensemble method offers a practical solution to the variability in LLM performance for sensitive translation tasks.

major comments (2)

[Abstract] The claim that the translations were 'rigorously validated through human preference studies' is not supported by any details on the experimental design, including the number of participants, the specific rating criteria used, statistical analysis, or measures to control for bias. This omission makes it difficult to assess whether the preference results reliably demonstrate superiority in the context of counseling dialogues.
[Human Evaluation] The human preference studies appear to evaluate general translation quality rather than domain-specific aspects critical to counseling, such as the preservation of emotional nuance, therapeutic accuracy, and intent. The manuscript does not indicate whether raters had expertise in counseling or if the evaluation prompts emphasized these factors, weakening the link between the reported preferences and the dataset's suitability for its intended use.

minor comments (1)

[Abstract] The sentence 'These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM' contains a grammatical error; 'preferred from' should be 'preferred over'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to enhance clarity and detail on the human evaluation.

read point-by-point responses

Referee: [Abstract] The claim that the translations were 'rigorously validated through human preference studies' is not supported by any details on the experimental design, including the number of participants, the specific rating criteria used, statistical analysis, or measures to control for bias. This omission makes it difficult to assess whether the preference results reliably demonstrate superiority in the context of counseling dialogues.

Authors: We appreciate this point. The abstract summarizes the outcome, but the full experimental details—including 30 bilingual participants, rating criteria covering fluency, accuracy, emotional nuance preservation, and counseling intent, statistical analysis via binomial tests showing 67% preference for the ensemble (p < 0.01), and bias controls such as randomized presentation order and blind rating—are provided in Section 4.2. We will revise the abstract to concisely reference these elements (e.g., 'validated via human preference studies with 30 participants showing significant preference, p<0.01') to better support the claim. revision: yes
Referee: [Human Evaluation] The human preference studies appear to evaluate general translation quality rather than domain-specific aspects critical to counseling, such as the preservation of emotional nuance, therapeutic accuracy, and intent. The manuscript does not indicate whether raters had expertise in counseling or if the evaluation prompts emphasized these factors, weakening the link between the reported preferences and the dataset's suitability for its intended use.

Authors: We agree that domain-specific evaluation is essential. Our prompts explicitly directed raters to prioritize preservation of emotional nuance, therapeutic intent, and counseling accuracy alongside general quality. Raters were bilingual speakers with cultural familiarity but not professional counselors; we mitigated this by providing detailed guidelines on counseling dialogue characteristics. In revision, we will include the exact evaluation prompt text, add a limitations discussion on rater expertise, and note plans for expert counselor validation in follow-up work. This strengthens the connection to the dataset's intended use. revision: partial

Circularity Check

0 steps flagged

No circularity; validation relies on independent external human judgments

full rationale

The paper presents a multi-LLM ensemble method for translation and supports its superiority claim solely through external human preference studies that compare ensemble outputs against individual LLMs. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the provided text. The derivation chain consists of a procedural description followed by independent human evaluation, which does not reduce to the method's own inputs by construction. This is the standard case of a self-contained empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach assumes that the ensemble process improves fidelity without introducing new errors, resting on domain assumptions about LLM capabilities.

axioms (2)

domain assumption Diverse LLMs produce complementary translation hypotheses that can be combined for better results
Invoked in the description of the ensemble method.
domain assumption LLMs are capable of meta-analysis of translation quality
The synthesis step relies on this.

invented entities (1)

Multi-LLM ensemble translation method no independent evidence
purpose: To overcome limitations of single LLM translations in sensitive domains
The method is proposed in this paper without external prior evidence cited in abstract.

pith-pipeline@v0.9.0 · 5510 in / 1313 out tokens · 61857 ms · 2026-05-15T01:00:15.653372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Despite the grow- ing need, access to professional mental healthcare remains a significant challenge for many, largely due to a shortage of skilled counselors

Introduction The COVID-19 pandemic has exacerbated the global mental health crisis, worsening numer- ous factors that contribute to psychological dis- tress (Santomauro et al., 2021). Despite the grow- ing need, access to professional mental healthcare remains a significant challenge for many, largely due to a shortage of skilled counselors. To bridge thi...

work page 2021
[2]

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

Related Work 2.1. Psychological Counseling Datasets In recent years, psychological counseling has at- tracted increasing attention in the field of Natural Language Processing (NLP). Early studies mainly focused on empathic response generation, where systems aim to produce emotionally appropriate responses by recognizing the user’s affective state (Rashkin...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

On the other hand, many studies adopt LLM- based automatic generation, where the model si- multaneously plays both counselor and client to create data efficiently and at scale

collected emotional support dialogues from crowdworkers who had undergone professional skill training; ClientReactions (Li et al., 2023) was derived from real counselor–client interactions on online counseling platforms; and KokoroChat (Qi et al., 2025) was systematically collected through role-playing by trained counselors. On the other hand, many studie...

work page 2023
[4]

Thepar- ticipants are professional counselors and trainee counsellors

KokoroChat The foundation of our work is KokoroChat (Qi et al., 2025), a large-scale, publicly available collection of manually authored Japanese text-based counsel- ingdialoguescreatedthroughrole-playing. Thepar- ticipants are professional counselors and trainee counsellors. The corpus comprises 6,589 dialogue sessions, each lasting 60 minutes, totalling...

work page 2025
[5]

Even high-performance models have distinct strengths and weaknesses, leading to inconsistent output quality

Multi-LLM Ensemble Translation To construct a high-quality multilingual counseling dialogue dataset, this study aims to address the challenge that the optimal LLM varies depending on the input, meaning no single model can consis- tently guarantee the best possible quality. Even high-performance models have distinct strengths and weaknesses, leading to inc...

work page
[6]

– Describe specifically which parts need improve- ment

Analysis of Each Translation Candidate –Compareeachtranslationcandidateanddescribe specifically which parts are superior. – Describe specifically which parts need improve- ment

work page
[7]

– Make corrections based on the areas for improve- ment you identified

Construction of an Improved Translation – Based on your analysis, synthesize a revised translation by combining the strengths of both can- didates. – Make corrections based on the areas for improve- ment you identified. – Ensure consistent terminology to maintain consis- tency throughout the translation. Table 3: Prompt for Integration and Refinement and ...

work page
[8]

We compared the output of our method with that of single LLMs using both automatic and human evaluation

Experiments To validate proposed multi-LLM ensemble method and assess the quality of the resulting Multilingual KokoroChat, we conducted experiments on trans- lation into English and Chinese. We compared the output of our method with that of single LLMs using both automatic and human evaluation. The automatic evaluation provides a large-scale, objec- tive...

work page 2025
[9]

1, Gemini 2.5 Pro (gemini-2.5-pro)2 and Grok-4 (grok-4-0709)3. For Chinese translation, we replaced Grok-4 with Qwen-Plus (qwen-plus- 2025-07-28) 4, which demonstrates superior per- formance and produces more stable outputs for Chinese-language tasks. This approach allows us 1https://platform.openai.com/docs/models/gpt-5 2https://ai.google.dev/gemini-api/...

work page 2025
[10]

pathetic

or BARTScore (Yuan et al., 2021) were not applicable. Therefore, we employed two reference- free evaluation metrics: XCOMET-QE5 (Guerreiro etal.,2024)andMetricX-QE 6(Juraskaetal.,2024). Both were chosen for their reported high correla- tion with human preferences in the WMT24 Met- rics Shared Task (Freitag et al., 2024). XCOMET scores range from 0 to 1, w...

work page 2021
[11]

professionaltrans- lator,

Discussion An analysis of our experimental results reveals a discrepancy between the automated and hu- manevaluations, aswellasperformancevariations Figure 6: Japanese-to-Chinese translation example 2. This demonstrates the analysis and synthesis process that produced a translation judged inferior by human evaluators to Qwen’s hypothesis due to being slig...

work page
[12]

Our method first generates a diverse set of translation hypotheses using mul- tiple distinct LLMs

Conclusion This study introduced a novel, two-stage refine- ment framework to address the challenge that the optimal LLM varies depending on the input, mean- ing no single model can consistently guarantee the best possible quality. Our method first generates a diverse set of translation hypotheses using mul- tiple distinct LLMs. Subsequently, a single ref...

work page
[13]

I am also deeply grateful to all the members of the Inaba Laboratory for their thoughtful suggestions and generous cooperation during the progression of this study

Acknowlegments IwouldliketothankAssociateProfessorMichimasa Inaba of the Department of Informatics, Graduate School of Informatics and Engineering, The Univer- sity of Electro-Communications, for his continuous guidance and support throughout this research. I am also deeply grateful to all the members of the Inaba Laboratory for their thoughtful suggestio...

work page
[14]

Bonferroni

Bibliographical References Carlo E. Bonferroni. 1936.Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R. Istituto superiore di scienze economiche e commerciali di Firenze. Seeber. Maxim Enis and Mark Hopkins. 2024. From LLM to NMT: Advancing Low-Resource Machine Trans- lation with Claude. António Farinhas, José de Souza, and A...

work page 1936
[15]

InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLin- guistics, pages 7881–7892

BLEURT: Learning robust metrics for text generation. InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLin- guistics, pages 7881–7892. AshishSharma,InnaWLin,AdamSMiner,DavidC Atkins, and Tim Althoff. 2023. Human–AI collab- oration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Int...

work page 2023
[16]

Weizhe Yuan, Graham Neubig, and Pengfei Liu

Large language model synergy for en- semble learning in medical question answering: Designandevaluationstudy.JMedInternetRes, 27:e70080. Weizhe Yuan, Graham Neubig, and Pengfei Liu

work page
[17]

InAdvances in Neural Infor- mation Processing Systems, volume 34, pages 27263–27277

Bartscore: Evaluating generated text as text generation. InAdvances in Neural Infor- mation Processing Systems, volume 34, pages 27263–27277. Curran Associates, Inc. Chujie Zheng, Yong Liu, Wei Chen, Yongcai Leng, and Minlie Huang. 2021. CoMAE: A multi-factor hierarchical framework for empathetic response generation. InFindings of the Association for Comp...

work page 2021
[18]

Language Resource References Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu

work page
[19]

InFind- ings of the Association for Computational Lin- guistics: EMNLP 2023, pages 1170–1183, Sin- gapore

SoulChat: Improving LLMs’ empathy, lis- tening, and comfort abilities through fine-tuning with multi-turn empathy conversations. InFind- ings of the Association for Computational Lin- guistics: EMNLP 2023, pages 1170–1183, Sin- gapore. Association for Computational Linguis- tics. Suyeon Lee, Sunghwan Kim, Minju Kim, Dongjin Kang, Dongil Yang, Harim Kim, M...

work page 2023
[20]

That must be difficult for you

Understanding client reactions in online mental health counseling. InProceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 10358–10376, Toronto, Canada. Associa- tion for Computational Linguistics. Siyang Liu, Chujie Zheng, Orianna Demasi, Sa- hand Sabour, Yu Li, Zhou Yu, Yong Jiang, and M...

work page 2021