pith. machine review for the scientific record. sign in

arxiv: 2605.09893 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords pseudo-deliberationvalue-action gapLLM evaluationdialogue alignmentvalue adherencemulti-agent auditinglanguage model safety
0
0 comments X

The pith

Large language models display pseudo-deliberation, where explicit reasoning about values fails to produce aligned actions in generated dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLMs suffer from a persistent value-action gap, even when they engage in explicit reasoning about their values. This failure mode, called pseudo-deliberation, means models articulate principles but do not follow them in their outputs. To measure this, the authors created VALDI, a benchmark with thousands of human-centered scenarios across multiple domains and tasks. Testing on various LLMs showed consistent misalignment between expressed values and dialogue actions. They also introduce VIVALDI, a multi-agent system to audit and intervene in the generation process.

Core claim

The central claim is that across both proprietary and open-source LLMs, there is consistent misalignment between expressed values and downstream dialogues, even under explicit reasoning, which the authors term pseudo-deliberation. This is demonstrated through systematic evaluation using the VALDI framework.

What carries the argument

VALDI, a framework that includes 4941 human-centered scenarios across five domains, three tasks for eliciting value articulation, reasoning, and action, along with five metrics to quantify value adherence in generated dialogues.

If this is right

  • Explicit reasoning steps do not eliminate the value-action gap in LLMs.
  • Both closed-source and open-source models exhibit similar levels of misalignment.
  • Interventions like the proposed VIVALDI multi-agent auditor can target different stages of generation to improve alignment.
  • The gap appears in dialogues across five domains of human-centered scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This indicates that training for value alignment may need to focus more on behavioral consistency rather than just verbal statements.
  • Applications relying on LLMs for ethical or value-sensitive decisions could be unreliable without additional safeguards.
  • Extending VALDI to more scenarios or real-world interactions could test the robustness of the observed misalignment.
  • The multi-agent approach in VIVALDI suggests a path toward modular value monitoring in AI systems.

Load-bearing premise

The specific set of 4941 scenarios and the five chosen metrics accurately reflect true value adherence in LLMs without bias introduced by scenario selection or task design.

What would settle it

Observing that LLMs generate dialogues that align with their previously articulated values across the VALDI scenarios at a high rate would falsify the claim of persistent pseudo-deliberation.

Figures

Figures reproduced from arXiv: 2605.09893 by Hanwen Zhang, Hua Shen, Sushrita Rakshit.

Figure 1
Figure 1. Figure 1: Overview of VALDI framework. We generate the DAISY dataset. Then we generate Fast and Slow Dialogue. Finally, we evaluate model’s dialogue and reasoning via our alignment metrics. reasoning traces faithfully invoke values the model endorses, yet the resulting action systematically violates them; worse, as we show, deliberation can amplify rather than repair misalignment. To study pseudo-deliberation system… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of value–action alignment across Fast and Slow thinking with 95% bootstrap [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning decomposition metrics across values, showing rates of survival, suppression, and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of VIVALDI-D’s multi-agent dialogue auditor. We create a system with Val￾ueExtractor to evaluate distorted values, rubricate them, and make a plan to correct the output. VIVALDI-R&D performs iterative reasoning-level repair to improve alignment, and applies dialogue￾level repair when alignment fails to propagate to the final output. • VIVALDI-R&D: A reasoning-centric repair approach that identifie… view at source ↗
Figure 5
Figure 5. Figure 5: Per-value macro F1 alignment across Fast (T2), Slow (T3), and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for LLM paraphrasing, where we provide GPT-4o with a scenario and request [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Full prompts used in VIVALDI-D for dialogue-level planning and rewriting. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-value value–action alignment across models and intervention variants. Error bars [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗
read the original abstract

Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call "Pseudo-Deliberation": the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit a 'Pseudo-Deliberation' failure mode in which explicit reasoning about values fails to produce aligned actions in downstream dialogue generation. To demonstrate this, the authors introduce VALDI, a benchmark comprising 4,941 human-centered scenarios across five domains, three tasks (value articulation, reasoning, and action), and five quantitative metrics of value adherence. Empirical evaluation across proprietary and open-source models reports consistent misalignment between expressed values and generated actions. The work also proposes VIVALDI, a multi-agent value auditor, as an intervention strategy.

Significance. If the empirical observations are robust, the result would indicate a systematic limitation in current LLMs' capacity for value-consistent deliberation, with direct relevance to AI safety, ethical deployment, and alignment research. The VALDI framework offers a structured, multi-task evaluation protocol that goes beyond single-prompt value elicitation, and the introduction of VIVALDI provides a concrete starting point for mitigation studies.

major comments (3)
  1. [VALDI framework (methods)] The central claim of consistent value-action misalignment across models rests on the five adherence metrics in VALDI. The abstract and methods description provide no information on how these metrics are formally defined, whether they were validated against human raters, or what controls were used for prompt sensitivity and inter-metric correlation; without such validation, it is unclear whether the reported misalignment reflects model behavior or metric artifacts.
  2. [Scenario construction] The 4,941 scenarios are described as 'human-centered' across five domains, yet no details are given on scenario curation, potential selection biases, or inter-annotator agreement for scenario construction. If scenarios were chosen to surface conflicts already prevalent in training data, the observed misalignment could be an evaluation artifact rather than evidence of Pseudo-Deliberation.
  3. [Task design] The three tasks (articulation, reasoning, action) appear to be elicited via separate prompts. The paper does not specify whether the action-generation task receives the preceding reasoning as context or is run independently; if the latter, the decoupling between reasoning and action is built into the experimental design and does not demonstrate failure of deliberation.
minor comments (2)
  1. [Results] The abstract states 'consistent misalignment' without reporting effect sizes, confidence intervals, or statistical tests; these should be added to the results section.
  2. [Introduction] The acronym VIVALDI is introduced without expanding it on first use or clarifying its relationship to VALDI.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where methodological details can be clarified and expanded, which we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [VALDI framework (methods)] The central claim of consistent value-action misalignment across models rests on the five adherence metrics in VALDI. The abstract and methods description provide no information on how these metrics are formally defined, whether they were validated against human raters, or what controls were used for prompt sensitivity and inter-metric correlation; without such validation, it is unclear whether the reported misalignment reflects model behavior or metric artifacts.

    Authors: We agree that the methods section would benefit from greater detail on the metrics. We will revise to include formal definitions of each of the five adherence metrics (using mathematical notation for scores such as consistency and similarity measures), describe the prompt sensitivity controls (multiple template variations were tested with stable misalignment patterns), and report inter-metric correlations. Human rater validation was not performed, as the metrics are designed as automated quantitative measures; we will explicitly note this choice and its rationale in the revised text. revision: yes

  2. Referee: [Scenario construction] The 4,941 scenarios are described as 'human-centered' across five domains, yet no details are given on scenario curation, potential selection biases, or inter-annotator agreement for scenario construction. If scenarios were chosen to surface conflicts already prevalent in training data, the observed misalignment could be an evaluation artifact rather than evidence of Pseudo-Deliberation.

    Authors: We will expand the scenario construction subsection to detail the curation process, including adaptation of dilemmas from ethics and psychology literature into dialogue formats across the five domains. Potential selection biases will be discussed, along with how the scale and diversity of the 4,941 scenarios mitigate them. Inter-annotator agreement is not available because scenarios were developed internally by the authors using structured templates rather than independent annotators; we will acknowledge this limitation directly. revision: partial

  3. Referee: [Task design] The three tasks (articulation, reasoning, action) appear to be elicited via separate prompts. The paper does not specify whether the action-generation task receives the preceding reasoning as context or is run independently; if the latter, the decoupling between reasoning and action is built into the experimental design and does not demonstrate failure of deliberation.

    Authors: The action-generation task is provided with the preceding reasoning as context in the prompt template (e.g., 'Given the value articulation and reasoning below, generate the dialogue action...'). This setup is intended to test whether explicit reasoning produces aligned actions. We will revise the task design section to state this explicitly and include the complete prompt templates in the appendix to eliminate ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical measurement framework

full rationale

The paper is an empirical study that introduces VALDI as a new benchmark with 4941 scenarios, three tasks, and five metrics to quantify value-action misalignment in LLMs, then reports observed inconsistencies across models and proposes VIVALDI as an intervention. No equations, derivations, parameter fittings, or self-referential definitions appear in the provided text. The central claim rests on direct application of the newly defined metrics to model outputs rather than any reduction to fitted inputs, self-citations, or ansatzes. The framework is self-contained as a measurement protocol without load-bearing reliance on prior author work or uniqueness theorems, satisfying the criteria for a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on the assumption that the introduced frameworks and metrics validly measure alignment, plus the empirical observation from testing multiple LLMs.

axioms (2)
  • domain assumption Human-centered scenarios can reliably elicit and test model values
    Invoked in the design of the 4941 scenarios across five domains.
  • domain assumption Stated values, reasoning traces, and generated dialogue can be compared via quantitative metrics
    Basis for the five metrics in VALDI.
invented entities (3)
  • Pseudo-Deliberation no independent evidence
    purpose: To label the observed failure mode of apparent reasoning without value-aligned action
    New conceptual term introduced to distinguish this from simple value-action gap.
  • VALDI no independent evidence
    purpose: Framework for systematic measurement of value adherence in dialogue
    New evaluation suite with scenarios, tasks, and metrics.
  • VIVALDI no independent evidence
    purpose: Multi-agent auditor to intervene during generation for better alignment
    Proposed intervention method.

pith-pipeline@v0.9.0 · 5452 in / 1342 out tokens · 53387 ms · 2026-05-12T05:04:22.438786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 5 internal anchors

  1. [1]

    A systematic review of the limitations of large language models in generating healthcare content.PLOS Digital Health, 5(4):e0001354, 2026

    Mohsen Khosravi, Zahra Zamaninasab, Seyyed Morteza Mojtabaeian, Emine Kübra Din- dar Demiray, and Morteza Arab-Zozani. A systematic review of the limitations of large language models in generating healthcare content.PLOS Digital Health, 5(4):e0001354, 2026. doi: 10.1371/journal.pdig.0001354

  2. [2]

    Pardos and Shreya Bhandari

    Zachary A. Pardos and Shreya Bhandari. Chatgpt-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills.PLOS ONE, 19(5):e0304013,

  3. [3]

    doi: 10.1371/journal.pone.0304013

  4. [4]

    Gender, race, and intersectional bias in resume screening via language model retrieval

    Kyra Wilson and Aylin Caliskan. Gender, race, and intersectional bias in resume screening via language model retrieval. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 1578–1590, 2024

  5. [5]

    V alue C ompass: A Framework for Measuring Contextual Value Alignment Between Human and LLM s

    Hua Shen, Tiffany Knearem, Reshmi Ghosh, Yu-Ju Yang, Nicholas Clark, Tanu Mitra, and Yun Huang. ValueCompass: A framework for measuring contextual value alignment between human and LLMs. In Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M’hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendra...

  6. [6]

    Hua Shen, Nicholas Clark, and Tanu Mitra. Mind the value-action gap: Do LLMs act in alignment with their values? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3097–3118, Suzhou, China, November 2025. Association for Co...

  7. [7]

    URLhttps://aclanthology.org/2025.emnlp-main.154/

  8. [8]

    ValueBench: Towards comprehensively evaluating value orientations and understanding of large language models

    Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. ValueBench: Towards comprehensively evaluating value orientations and understanding of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2015– ...

  9. [9]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

  10. [10]

    Civics: Building a dataset for examining culturally-informed values in large language models

    Giada Pistilli, Alina Leidinger, Yacine Jernite, Atoosa Kasirzadeh, Alexandra Sasha Luccioni, and Margaret Mitchell. Civics: Building a dataset for examining culturally-informed values in large language models. InProceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’24, page 1132–1144. AAAI Press, 2025

  11. [11]

    LLM tropes: Revealing fine-grained values and opinions in large language models

    Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, and Isabelle Au- genstein. LLM tropes: Revealing fine-grained values and opinions in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17085–17112, Miami, Florida, USA,...

  12. [12]

    URLhttps://aclanthology.org/2024.findings-emnlp.995/

  13. [13]

    Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective global opinions in language mod...

  14. [14]

    Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning,

    Sajad Sotudeh, Hanieh Deilamsalehy, Franck Dernoncourt, and Nazli Goharian. TLDR9+: A large scale resource for extreme summarization of social media posts. In Giuseppe Carenini, Jackie Chi Kit Cheung, Yue Dong, Fei Liu, and Lu Wang, editors,Proceedings of the Third Workshop on New Frontiers in Summarization, pages 142–151, Online and in Dominican Republic...

  15. [15]

    Are rules meant to be broken? understanding multilingual moral reasoning as a computational pipeline with UniMoral

    Shivani Kumar and David Jurgens. Are rules meant to be broken? understanding multilingual moral reasoning as a computational pipeline with UniMoral. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5...

  16. [16]

    Will ai tell lies to save sick children? litmus-testing ai values prioritization with airiskdilemmas, 2025

    Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, and Evan Hubinger. Will ai tell lies to save sick children? litmus-testing ai values prioritization with airiskdilemmas, 2025. URLhttps://arxiv.org/abs/2505.14633

  17. [17]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...

  18. [18]

    Schwartz

    Shalom H. Schwartz. Are there universal aspects in the structure and contents of human values? Journal of Social Issues, 50(4):19–45, 1994

  19. [19]

    Multi-stage prompting for knowledgeable dialogue generation

    Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai Prabhumoye, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. Multi-stage prompting for knowledgeable dialogue generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Associa- tion for Computational Linguistics: ACL 2022, pages 1317–1337, Dublin, Ireland, May 2022. A...

  20. [20]

    Gpt-4.1, 2025

    OpenAI. Gpt-4.1, 2025. URL https://platform.openai.com/docs/models. Model version: gpt-4.1-2025-04-14

  21. [21]

    Llama 3.1 8b instruct model card

    Meta AI. Llama 3.1 8b instruct model card. https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct, 2024. Accessed: 2026-03

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  23. [23]

    Gemini 3 flash

    Google DeepMind. Gemini 3 flash. Technical report, Google, 2025. URL https://deepmind. google/technologies/gemini/

  24. [24]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating 12 Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Mac...

  25. [25]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Keshav Sanjeev Singh, Christopher Potts, and Matei Zaharia. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2024

  26. [26]

    Optimizing instructions and demonstrations for multi-stage language model programs

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...

  27. [27]

    Overcoming the ‘value-action gap’ in environmental policy: Tensions between national policy and local experience.Local Environment, 4(3):257–278, 1999

    James Blake. Overcoming the ‘value-action gap’ in environmental policy: Tensions between national policy and local experience.Local Environment, 4(3):257–278, 1999

  28. [28]

    Bridging the intention–behaviour gap: The role of moral norm.British Journal of Social Psychology, 44(4):497–512, 2005

    Gaston Godin, Mark Conner, and Paschal Sheeran. Bridging the intention–behaviour gap: The role of moral norm.British Journal of Social Psychology, 44(4):497–512, 2005

  29. [29]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feed- bac...

  30. [30]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  31. [31]

    Farrar, Straus and Giroux, 2011

    Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011

  32. [32]

    Social-r1: Towards human-like social reasoning in llms, 2026

    Jincenzi Wu, Yuxuan Lei, Jianxun Lian, Yitian Huang, Lexin Zhou, Haotian Li, Xing Xie, and Helen Meng. Social-r1: Towards human-like social reasoning in llms, 2026. URL https://arxiv.org/abs/2603.09249

  33. [33]

    Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

    Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Do˘gruöz, Alice Oh, and Najoung Kim. Are they lovers or friends? evaluating llms’ social reasoning in english and korean dialogues, 2026. URLhttps://arxiv.org/abs/2510.19028

  34. [34]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. InProceed- ings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  35. [35]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  36. [36]

    Self-refine: 13 iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: 13 iterative refinement with self-feedback. InProceedings of the 37th International Conf...

  37. [37]

    Valueflow: Measuring the propagation of value perturba- tions in multi-agent llm systems, 2026

    Jinnuo Liu, Chuke Liu, and Hua Shen. Valueflow: Measuring the propagation of value perturba- tions in multi-agent llm systems, 2026. URLhttps://arxiv.org/abs/2602.08567

  38. [38]

    Bosch, and Emiel Krahmer

    Erkan Basar, Xin Sun, Iris Hendrickx, Jan de Wit, Tibor Bosse, Gert-Jan De Bruijn, Jos A. Bosch, and Emiel Krahmer. How well can large language models reflect? a human evaluation of LLM-generated reflections for motivational interviewing dialogues. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert,...

  39. [39]

    Syn- thetic dialogue dataset generation using LLM agents

    Yelaman Abdullin, Diego Molla, Bahadorreza Ofoghi, John Yearwood, and Qingyang Li. Syn- thetic dialogue dataset generation using LLM agents. In Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, and Hooman Sedghamiz, editors,Proceedings of the Third Workshop on Natural Language Generation, Ev...

  40. [40]

    how things are usually done

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  41. [41]

    Sampling candidate prompt variants,

  42. [42]

    Evaluating each candidate on a minibatch using the alignment metric,

  43. [43]

    Given C candidates and T trials, the procedure evaluates O(C×T) prompt variants and retains the best-performing program

    Selecting high-performing prompts and refining them over multiple trials. Given C candidates and T trials, the procedure evaluates O(C×T) prompt variants and retains the best-performing program. Training Protocol.We use a held-out tuning setup with N= 25 examples for optimization and N= 25 for evaluation (nested prompt on DAISY with first seed=50 then see...

  44. [44]

    23 Evaluation.We compare the original hand-written prompt and the optimized prompt on held-out data using macro-F1

    Optimization is run with 10 candidates and 12 trials, using minibatches of size 4. 23 Evaluation.We compare the original hand-written prompt and the optimized prompt on held-out data using macro-F1. This isolates the effect of prompt optimization independent of downstream interventions such as MAS. Prompt Optimization Results.Based on the results in Table...

  45. [45]

    DO NOT invent new values

    You may ONLY output sub-values from the list above. DO NOT invent new values

  46. [46]

    Tradition

    Output ONLY a JSON array, e.g., ["Tradition", "Stimulation"]

  47. [47]

    If no sub-values are clearly present, output []

  48. [48]

    Do NOT include explanations, text, or extra commentary

  49. [49]

    Prefer precision over recall

  50. [50]

    Benevolence

    Only include values supported by explicit textual evidence. CRITICAL: - Do NOT label "Benevolence" just because the speaker is being nice or helpful. Only label it if they explicitly prioritize the welfare of a specific in-group (family/friends) over other goals. - Do NOT label "Self-Direction" just because a choice is being made. Only label it if the aut...

  51. [51]

    repairs": [ {

    prior to annotation. Two human annotators jointly annotate each instance to produce a consensus ground truth. We then evaluate our ValueJudge pipeline, which uses GPT-4.1-mini for value detection and GPT-4.1 for stance scoring. All model parameters are deterministically set (temperature = 0, seed = 42). We use GPT-4.1-mini for value detection, as it achie...

  52. [52]

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...