Recognition: unknown
SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression
Pith reviewed 2026-05-10 04:59 UTC · model grok-4.3
The pith
Language models can self-evolve both accurate emotion recognition and consistent responses by generating and filtering their own dialogue data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SELF-EMO shows that a role-based self-play paradigm, in which the model simultaneously acts as emotion recognizer and dialogue responder, generates diverse conversational trajectories that a smoothed IoU-based reward can filter into high-quality self-supervised data; when optimized with the SELF-GRPO algorithm that incorporates multi-label alignment and group consistency, the process yields state-of-the-art emotion recognition accuracy on IEMOCAP, MELD, and EmoryNLP while producing more consistent emotional expressions.
What carries the argument
The data flywheel that uses a smoothed IoU-based reward to filter self-generated role-play trajectories and returns only the highest-quality samples for continued self-improvement.
If this is right
- Large language models can generate and curate their own training data for emotional tasks at arbitrary scale.
- Recognition accuracy on standard conversation datasets rises by several percentage points across model sizes.
- The same self-play and filtering loop produces measurable gains in emotional consistency without added human labels.
- The approach works for both 4B and 8B parameter models, indicating generalization beyond a single size.
- Continuous self-improvement becomes possible because each iteration supplies new filtered trajectories for the next round.
Where Pith is reading between the lines
- The same self-play structure could transfer to other dialogue skills where recognition and generation must stay aligned, such as intent tracking or sarcasm handling.
- Over long conversations the method may reduce drift in emotional tone that currently appears in extended interactions with language models.
- If the reward signal proves robust, similar autonomous loops could lower the cost of building socially aware AI systems in domains that lack large annotated corpora.
- The framework suggests a route toward models that adapt their emotional style based on ongoing self-generated experience rather than fixed training sets.
Load-bearing premise
Better emotion prediction accuracy will directly produce more consistent and appropriate emotional expressions in the model's own generated responses.
What would settle it
A side-by-side comparison in which models trained under SELF-EMO achieve the reported accuracy gains on the three benchmarks yet human raters judge their dialogue responses as no more emotionally coherent or consistent than those of the original baseline models.
Figures
read the original abstract
Emotion Recognition in Conversation (ERC) has become a fundamental capability for large language models (LLMs) in human-centric interaction. Beyond accurate recognition, coherent emotional expression is also crucial, yet both are limited by the scarcity and static nature of high-quality annotated data. In this work, we propose SELF-EMO, a self-evolution framework grounded in the hypothesis that better emotion prediction leads to more consistent emotional responses. We introduce two auxiliary tasks, emotional understanding and emotional expression, and design a role-based self-play paradigm where the model acts as both an emotion recognizer and a dialogue responder. Through iterative interactions, the model generates diverse conversational trajectories, enabling scalable data generation. To ensure quality, we adopt a data flywheel mechanism that filters candidate predictions and responses using a smoothed IoU-based reward and feeds selected samples back for continuous self-improvement without external supervision. We further develop SELF-GRPO, a reinforcement learning algorithm that stabilizes optimization with multi-label alignment rewards and group-level consistency signals. Experiments on IEMOCAP, MELD, and EmoryNLP show that SELF-EMO achieves state-of-the-art performance, improving accuracy by +6.33% on Qwen3-4B and +8.54% on Qwen3-8B, demonstrating strong effectiveness and generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SELF-EMO, a self-evolution framework for LLMs that jointly improves emotion recognition in conversation (ERC) and consistent emotional expression. It introduces auxiliary tasks for emotional understanding and expression, employs role-based self-play to generate conversational trajectories, applies a data flywheel that filters samples via a smoothed IoU-based reward, and optimizes with the SELF-GRPO reinforcement learning algorithm using multi-label alignment and group-level consistency signals. The central claim is that this closed-loop process, without external supervision, yields state-of-the-art ERC performance on IEMOCAP, MELD, and EmoryNLP, with accuracy gains of +6.33% on Qwen3-4B and +8.54% on Qwen3-8B.
Significance. If the self-reinforcement mechanism can be shown to produce genuine consistency gains rather than amplified prediction errors, the approach would provide a scalable, annotation-free route to training emotionally coherent LLMs, directly addressing data scarcity in human-centric dialogue systems.
major comments (3)
- [Experiments] Experiments section (and abstract): the reported SOTA improvements lack any description of baselines, ablation controls, run counts, error bars, or statistical significance tests, leaving the quantitative claims unverifiable and the contribution of the self-evolution components unisolated.
- [§3.2] §3.2 (Data Flywheel mechanism): the smoothed IoU-based reward is computed solely from the model's own emotion predictions and generated responses; no external label, human judgment, or held-out validation set is used to break potential circular reinforcement of systematic mislabelings (e.g., consistent confusion between “frustrated” and “sad”).
- [§4] §4 (SELF-GRPO): the multi-label alignment rewards and group-level consistency signals are introduced at a high level without explicit equations or comparison to standard GRPO variants, so it is unclear how they prevent the optimization from simply reinforcing the model's initial biases rather than improving true emotional consistency.
minor comments (1)
- [Abstract] Abstract: the claimed accuracy deltas are given without the underlying metric (e.g., weighted F1, accuracy per emotion class) or the precise comparison models beyond the Qwen3 family.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating revisions where the manuscript will be updated to improve clarity and verifiability while preserving the core self-evolution approach.
read point-by-point responses
-
Referee: [Experiments] Experiments section (and abstract): the reported SOTA improvements lack any description of baselines, ablation controls, run counts, error bars, or statistical significance tests, leaving the quantitative claims unverifiable and the contribution of the self-evolution components unisolated.
Authors: We agree that the current manuscript provides insufficient experimental detail. In the revised version we will expand the Experiments section to explicitly list all baselines (including prior SOTA methods on IEMOCAP, MELD, and EmoryNLP), present ablation studies isolating the data flywheel, SELF-GRPO, and auxiliary tasks, report the number of independent runs, include error bars, and add statistical significance tests (e.g., paired t-tests or McNemar’s test) for the reported accuracy gains. These additions will make the claims verifiable and clarify the isolated contribution of each self-evolution component. revision: yes
-
Referee: [§3.2] §3.2 (Data Flywheel mechanism): the smoothed IoU-based reward is computed solely from the model's own emotion predictions and generated responses; no external label, human judgment, or held-out validation set is used to break potential circular reinforcement of systematic mislabelings (e.g., consistent confusion between “frustrated” and “sad”).
Authors: The absence of external labels is intentional, as the framework is designed to operate without additional supervision. The smoothed IoU reward measures consistency between the model’s own emotion predictions and the responses it generates within the same trajectory; the smoothing term and iterative filtering are meant to penalize inconsistent pairs and gradually improve both recognition and expression. We acknowledge the risk of reinforcing initial label confusions. In the revision we will add a dedicated paragraph in §3.2 explaining the smoothing mechanism in more detail, include an analysis of label-distribution stability across iterations on a held-out subset of the training dialogues, and discuss failure modes such as persistent confusion between similar emotions. No external labels will be introduced, as that would alter the annotation-free premise. revision: partial
-
Referee: [§4] §4 (SELF-GRPO): the multi-label alignment rewards and group-level consistency signals are introduced at a high level without explicit equations or comparison to standard GRPO variants, so it is unclear how they prevent the optimization from simply reinforcing the model's initial biases rather than improving true emotional consistency.
Authors: We will revise §4 to include the full mathematical definitions of the multi-label alignment reward (formulated as a sum of per-emotion binary cross-entropy terms aligned across the group) and the group-level consistency signal (computed as the variance of emotion-label distributions within each GRPO group). We will also add a direct comparison table and discussion contrasting SELF-GRPO with vanilla GRPO, highlighting how the group-level term discourages collapse to the model’s initial bias by enforcing intra-group agreement on both recognition and expression. These equations and comparisons will be placed in the main text rather than the appendix. revision: yes
Circularity Check
Self-generated trajectories filtered by model-derived IoU reward reduce self-improvement to internal outputs by construction
specific steps
-
self definitional
[Abstract]
"we adopt a data flywheel mechanism that filters candidate predictions and responses using a smoothed IoU-based reward and feeds selected samples back for continuous self-improvement without external supervision"
The reward and selection criteria are computed from the model's self-generated trajectories and emotion predictions (IoU between recognizer outputs and responder outputs), so the 'high-quality' subset fed back is defined in terms of the same model's internal quantities; the claimed self-evolution therefore reduces to its own inputs by construction.
full rationale
The core derivation chain in the abstract is a closed self-play loop: the model generates trajectories, applies its own smoothed IoU reward for filtering, and retrains on the selected samples. This is self-definitional because both the data and the quality signal are defined entirely in terms of the model's predictions and responses, with the 'improvement' therefore equivalent to reinforcing its own outputs rather than an independent external signal. Final benchmark numbers do not break the circularity in the training derivation itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Better emotion prediction leads to more consistent emotional responses
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report, 2024
OpenAI. Gpt-4 technical report, 2024
2024
-
[2]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations
Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In IJCAI, pages 5415–5421. Macao, 2019
2019
-
[6]
Instructerc: Reforming emotion recognition in conversation with multi-task retrieval-augmented large language models, 2024
Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, Runqi Qiao, and Sirui Wang. Instructerc: Reforming emotion recognition in conversation with multi-task retrieval-augmented large language models, 2024
2024
-
[7]
Do llms feel? teaching emotion recognition with prompts, retrieval, and curriculum learning, 2025
Xinran Li, Yu Liu, Jiaqi Qiao, and Xiujuan Xu. Do llms feel? teaching emotion recognition with prompts, retrieval, and curriculum learning, 2025. 9
2025
-
[8]
BiosERC: Integrating Biography Speakers Supported by LLMs for ERC Tasks, page 277–292
Jieying Xue, Minh-Phuong Nguyen, Blake Matheny, and Le-Minh Nguyen. BiosERC: Integrating Biography Speakers Supported by LLMs for ERC Tasks, page 277–292. Springer Nature Switzerland, 2024
2024
-
[9]
preprint arXiv:1506.00019 , year=
Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015
-
[10]
The graph neural network model
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008
2008
-
[11]
Bert: a review of applications in natural language processing and understanding,
Mikhail V Koroteev. Bert: a review of applications in natural language processing and under- standing. arXiv preprint arXiv:2103.11943, 2021
-
[12]
Dialoguernn: An attentive rnn for emotion detection in conversations, 2019
Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. Dialoguernn: An attentive rnn for emotion detection in conversations, 2019
2019
-
[13]
Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations
Dou Hu, Lingwei Wei, and Xiaoyong Huai. Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume1: Long Papers), pages 7042–7052, 2021
2021
-
[14]
Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation, 2019
Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation, 2019
2019
-
[15]
Directed acyclic graph network for conversational emotion recognition, 2021
Weizhou Shen, Siyue Wu, Yunyi Yang, and Xiaojun Quan. Directed acyclic graph network for conversational emotion recognition, 2021
2021
-
[16]
Bert-erc: Fine-tuning bert is enough for emotion recognition in conversation
Xiangyu Qin, Zhiyu Wu, Tingting Zhang, Yanran Li, Jian Luan, Bin Wang, Li Wang, and Jinshi Cui. Bert-erc: Fine-tuning bert is enough for emotion recognition in conversation. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 13492–13500, 2023
2023
-
[17]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022
2022
-
[18]
Serl: Self-play reinforcement learning for large language models with limited data, 2026
Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2026
2026
-
[19]
A survey on self-play methods in reinforcement learning, 2025
Ruize Zhang, Zelai Xu, Chengdong Ma, Chao Yu, Wei-Wei Tu, Wenhao Tang, Shiyu Huang, Deheng Ye, Wenbo Ding, Yaodong Yang, and Yu Wang. A survey on self-play methods in reinforcement learning, 2025
2025
-
[20]
Spell: Self-play reinforcement learning for evolving long-context language models
Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan, Ming Yan, Xiaojun Quan, and Fei Huang. Spell: Self-play reinforcement learning for evolving long-context language models. arXiv preprint arXiv:2509.23863, 2025
-
[21]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr. arXiv preprint arXiv:2508.14029, 2025
-
[23]
Promptcot: Align prompt distribution via adapted chain-of-thought
Junyi Yao, Yijiang Liu, Zhen Dong, Mingfei Guo, Helan Hu, Kurt Keutzer, Li Du, Daquan Zhou, and Shanghang Zhang. Promptcot: Align prompt distribution via adapted chain-of-thought. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7027–7037, 2024
2024
-
[24]
Levels of emotional awareness: a cognitive- developmental theory and its application to psychopathology
Richard D Lane and Gary E Schwartz. Levels of emotional awareness: a cognitive- developmental theory and its application to psychopathology. The American journal of psychiatry, 144(2):133–143, 1987
1987
-
[25]
Affect regulation, mentalization and the development of the self
Peter Fonagy, Gyorgy Gergely, and Elliot L Jurist. Affect regulation, mentalization and the development of the self. Routledge, 2018
2018
-
[26]
Disorders of affect regulation: Alexithymia in medical and psychiatric illness
Graeme J Taylor, R Michael Bagby, and James DA Parker. Disorders of affect regulation: Alexithymia in medical and psychiatric illness. Cambridge University Press, 1999. 10
1999
-
[27]
The social power of regret: the effect of social appraisal and anticipated emotions on fair and unfair allocations in resource dilemmas
Job van der Schalk, Toon Kuppens, Martin Bruder, and Antony SR Manstead. The social power of regret: the effect of social appraisal and anticipated emotions on fair and unfair allocations in resource dilemmas. Journal of Experimental Psychology: General, 144(1):151, 2015
2015
-
[28]
One- shot entropy minimization,
Zitian Gao, Lynx Chen, Haoming Luo, Joey Zhou, and Bryan Dai. One-shot entropy minimiza- tion. arXiv preprint arXiv:2505.20282, 2025
-
[29]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Iemocap: Interactive emotional dyadic motion capture database
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008
2008
-
[31]
Meld: A multimodal multi-party dataset for emotion recognition in con- versations
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in con- versations. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019
2019
-
[32]
Emotion detection on tv show transcripts with sequence- based convolutional neural networks
Sayyed M Zahiri and Jinho D Choi. Emotion detection on tv show transcripts with sequence- based convolutional neural networks. In AAAI Workshops, volume 18, pages 44–52, 2018
2018
-
[33]
ICON: Interactive conversational memory network for multimodal emotion detection
Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. ICON: Interactive conversational memory network for multimodal emotion detection. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2594–2604, Bru...
2018
-
[34]
Cosmic: Commonsense knowledge for emotion identification in conversations, 2020
Deepanway Ghosal, Navonil Majumder, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. Cosmic: Commonsense knowledge for emotion identification in conversations, 2020
2020
-
[35]
Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation, 2021
Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation, 2021
2021
-
[36]
Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition
Minjie Ren, Xiangdong Huang, Wenhui Li, Dan Song, and Weizhi Nie. Lr-gcn: Latent relation- aware graph convolutional network for conversational emotion recognition. IEEE Transactions on Multimedia, 24:4422–4432, 2022
2022
-
[37]
Curriculum learning meets directed acyclic graph for multimodal emotion recognition, 2024
Cam-Van Thi Nguyen, Cao-Bach Nguyen, Quang-Thuy Ha, and Duc-Trong Le. Curriculum learning meets directed acyclic graph for multimodal emotion recognition, 2024
2024
-
[38]
Deep imbalanced learning for multimodal emotion recognition in conversations, 2023
Tao Meng, Yuntao Shou, Wei Ai, Nan Yin, and Keqin Li. Deep imbalanced learning for multimodal emotion recognition in conversations, 2023
2023
-
[39]
Der-gcn: Dialogue and event relation- aware graph convolutional neural network for multimodal dialogue emotion recognition, 2024
Wei Ai, Yuntao Shou, Tao Meng, Nan Yin, and Keqin Li. Der-gcn: Dialogue and event relation- aware graph convolutional neural network for multimodal dialogue emotion recognition, 2024
2024
-
[40]
Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation
Xinran Li, Xiujuan Xu, and Jiaqi Qiao. Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation. IOS Press, October 2025. 11 A Prompts In this section, we provide the prompts given to the expert models during the data cold-start stage, including the scenario&personality extraction prompt(Table4) an...
2025
-
[41]
The analysis should consider (but is not limited to) the following aspects:
Analysis Dimensions(Inference-based, no fabrication) For each speaker, infer their characteristics ONLY based on the dialogue content. The analysis should consider (but is not limited to) the following aspects:
-
[42]
20–30s",
Personality traits (e.g., rational vs. emotional, introverted vs. extroverted, cautious vs. impulsive, gentle vs. assertive) 2. Estimated age range (use vague ranges such as "20–30s", "30+"; do NOT provide a specific age) 3. Possible gender (if unclear, explicitly state "gender not obvious") 4. Likely education or cognitive background (e.g., higher-educat...
-
[43]
- The sentence should naturally integrate multiple dimensions rather than listing attributes mechanically
Personality Summary Requirement- For each speaker, compress the analysis into ONE concise but information-dense sentence. - The sentence should naturally integrate multiple dimensions rather than listing attributes mechanically. Example style (for reference only, do NOT copy): "The speaker appears to be a cautious and rational individual, likely in their ...
-
[44]
- Themes should be semantic-level summaries, not surface-level restatements
Theme Extraction- Abstract one or more core themes from the entire dialogue. - Themes should be semantic-level summaries, not surface-level restatements. - If conflicts or disagreements exist, reflect them in the theme when appropriate
-
[45]
Speaker A Name
Output Format (Critical)After completing the analysis, output a Python code block containing a dictionary in the following format: ... analysis... { "Speaker A Name": "One-sentence personality profile", "Speaker B Name": "One-sentence personality profile", ... "Theme": "Core theme(s) of the conversation" } Output rules: You should first analyze each speak...
-
[46]
This suggests she would respond in a warm, empathetic, and conversational way, likely acknowl- edging others’ feelings and keeping the tone light and friendly
**Personality analysis:** Monica is described as expressive, socially proactive, and using casual, engaging questions to foster group connection and share emotional observations. This suggests she would respond in a warm, empathetic, and conversational way, likely acknowl- edging others’ feelings and keeping the tone light and friendly
-
[47]
Monica’s previous line was “What?” — showing surprise or curiosity
**Dialogue context:** Joey just said he felt like he was falling but isn’t actually falling. Monica’s previous line was “What?” — showing surprise or curiosity. A natural Monica-like response would acknowledge Joey’s strange feeling in a supportive, slightly humorous, or curious way, without being overly analytical or poetic
-
[48]
yeah, that sounds intense—
**Candidate Answer A:** “yeah, that sounds intense—” - Tone: Casual, empathetic, conversational. - Matches Monica’s style: acknowledges Joey’s feeling (“intense”) in a simple, expressive way. - Fits the context: responds to Joey’s sudden emotional statement appropriately. - No personality mismatch
-
[49]
wisely, mark ’falling’ as a pseudonym for your favor
**Candidate Answer B:** “wisely, mark ’falling’ as a pseudonym for your favor.” - Tone: Unnaturally formal, cryptic, and poetic. - Does not match Monica’s casual, socially proactive style — sounds more like a philosophical or literary remark. - Personality mismatch: Monica wouldn’t use such abstract, intellectual language in this casual chat among friends...
-
[50]
Answer B is out of character and disrupts the natural, friendly tone
**Comparison:** Answer A fits Monica’s personality and the dialogue flow much better. Answer B is out of character and disrupts the natural, friendly tone
-
[51]
answer_A better than answer_B
**Conclusion:** "answer_A better than answer_B" Table 9: Example of LLM-as-a-Judge Evaluation Comparing Two Responses. 19 Response Comparison Example 2 (GOLD-disgust:0.85, sadness:0.4) SYSTEM You are now a person named Chandler with the following characteristics: A rational and supportive person, likely in his 20s–30s, who employs structured, future-focus...
-
[52]
- Tone should be calm, reassuring, not overly forceful or sarcastic
Personality analysis - Chandler is described as *rational, supportive, structured, future-focused, using gentle persuasion to defuse tension*. - Tone should be calm, reassuring, not overly forceful or sarcastic. - Language should be collaborative (“we”), not confrontational
-
[53]
Keep talking
Dialogue context - Monica is upset about something wedding-related. - Chandler is steering her away from the current frustration toward positive future events (wedding, Paris). - Last line from Monica: “Keep talking” — she wants to hear more reassurance about the positive future
-
[54]
I get it
Candidate Answer A - Starts with empathy (“I get it”). - Uses future-focused reminders (“we’ve got a wedding to plan. . . Paris is waiting”). - Gentle suggestion to focus on magic rather than drama. - Ends with a question to involve her (“what do you think?”). - Matches Chandler’s supportive, structured, future-oriented style well
-
[55]
you’re *not* going to believe this
Candidate Answer B - Starts with “you’re *not* going to believe this” — slightly odd phrasing for Chandler; feels more like a setup for a joke or sarcasm. - “Let’s not waste time pretending like this isn’t going to work” — slightly confrontational or dismissive of her feelings. - “We’ll make it work” is okay, but the tone is more pushy (“let’s get you dre...
-
[56]
get moving
Personality mismatch check - Answer A: No mismatch — calm, rational, future-focused, gentle. - Answer B: Some mismatch — tone is slightly abrupt, less gentle, more “get moving” than “let’s look forward.”
-
[57]
answer_A better than answer_B
Conclusion Answer A fits Chandler’s personality and the dialogue context better. "answer_A better than answer_B" Table 11: Example of LLM-as-a-Judge Evaluation Comparing Two Responses. Hyperparameter Value Hyperparameter Value train_batch_size64rollout.n8 max_prompt_length4096rollout.max_num_batched_tokens12400 max_response_length8192rollout.gpu_memory_ut...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.