arxiv: 2605.09893 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Sushrita Rakshit , Hanwen Zhang , Hua Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords pseudo-deliberationvalue-action gapLLM evaluationdialogue alignmentvalue adherencemulti-agent auditinglanguage model safety

0 comments

The pith

Large language models display pseudo-deliberation, where explicit reasoning about values fails to produce aligned actions in generated dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLMs suffer from a persistent value-action gap, even when they engage in explicit reasoning about their values. This failure mode, called pseudo-deliberation, means models articulate principles but do not follow them in their outputs. To measure this, the authors created VALDI, a benchmark with thousands of human-centered scenarios across multiple domains and tasks. Testing on various LLMs showed consistent misalignment between expressed values and dialogue actions. They also introduce VIVALDI, a multi-agent system to audit and intervene in the generation process.

Core claim

The central claim is that across both proprietary and open-source LLMs, there is consistent misalignment between expressed values and downstream dialogues, even under explicit reasoning, which the authors term pseudo-deliberation. This is demonstrated through systematic evaluation using the VALDI framework.

What carries the argument

VALDI, a framework that includes 4941 human-centered scenarios across five domains, three tasks for eliciting value articulation, reasoning, and action, along with five metrics to quantify value adherence in generated dialogues.

If this is right

Explicit reasoning steps do not eliminate the value-action gap in LLMs.
Both closed-source and open-source models exhibit similar levels of misalignment.
Interventions like the proposed VIVALDI multi-agent auditor can target different stages of generation to improve alignment.
The gap appears in dialogues across five domains of human-centered scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This indicates that training for value alignment may need to focus more on behavioral consistency rather than just verbal statements.
Applications relying on LLMs for ethical or value-sensitive decisions could be unreliable without additional safeguards.
Extending VALDI to more scenarios or real-world interactions could test the robustness of the observed misalignment.
The multi-agent approach in VIVALDI suggests a path toward modular value monitoring in AI systems.

Load-bearing premise

The specific set of 4941 scenarios and the five chosen metrics accurately reflect true value adherence in LLMs without bias introduced by scenario selection or task design.

What would settle it

Observing that LLMs generate dialogues that align with their previously articulated values across the VALDI scenarios at a high rate would falsify the claim of persistent pseudo-deliberation.

Figures

Figures reproduced from arXiv: 2605.09893 by Hanwen Zhang, Hua Shen, Sushrita Rakshit.

**Figure 1.** Figure 1: Overview of VALDI framework. We generate the DAISY dataset. Then we generate Fast and Slow Dialogue. Finally, we evaluate model’s dialogue and reasoning via our alignment metrics. reasoning traces faithfully invoke values the model endorses, yet the resulting action systematically violates them; worse, as we show, deliberation can amplify rather than repair misalignment. To study pseudo-deliberation system… view at source ↗

**Figure 2.** Figure 2: Comparison of value–action alignment across Fast and Slow thinking with 95% bootstrap [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Reasoning decomposition metrics across values, showing rates of survival, suppression, and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of VIVALDI-D’s multi-agent dialogue auditor. We create a system with ValueExtractor to evaluate distorted values, rubricate them, and make a plan to correct the output. VIVALDI-R&D performs iterative reasoning-level repair to improve alignment, and applies dialoguelevel repair when alignment fails to propagate to the final output. • VIVALDI-R&D: A reasoning-centric repair approach that identifie… view at source ↗

**Figure 5.** Figure 5: Per-value macro F1 alignment across Fast (T2), Slow (T3), and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for LLM paraphrasing, where we provide GPT-4o with a scenario and request [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Full prompts used in VIVALDI-D for dialogue-level planning and rewriting. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: Per-value value–action alignment across models and intervention variants. Error bars [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

read the original abstract

Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call "Pseudo-Deliberation": the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit a 'Pseudo-Deliberation' failure mode in which explicit reasoning about values fails to produce aligned actions in downstream dialogue generation. To demonstrate this, the authors introduce VALDI, a benchmark comprising 4,941 human-centered scenarios across five domains, three tasks (value articulation, reasoning, and action), and five quantitative metrics of value adherence. Empirical evaluation across proprietary and open-source models reports consistent misalignment between expressed values and generated actions. The work also proposes VIVALDI, a multi-agent value auditor, as an intervention strategy.

Significance. If the empirical observations are robust, the result would indicate a systematic limitation in current LLMs' capacity for value-consistent deliberation, with direct relevance to AI safety, ethical deployment, and alignment research. The VALDI framework offers a structured, multi-task evaluation protocol that goes beyond single-prompt value elicitation, and the introduction of VIVALDI provides a concrete starting point for mitigation studies.

major comments (3)

[VALDI framework (methods)] The central claim of consistent value-action misalignment across models rests on the five adherence metrics in VALDI. The abstract and methods description provide no information on how these metrics are formally defined, whether they were validated against human raters, or what controls were used for prompt sensitivity and inter-metric correlation; without such validation, it is unclear whether the reported misalignment reflects model behavior or metric artifacts.
[Scenario construction] The 4,941 scenarios are described as 'human-centered' across five domains, yet no details are given on scenario curation, potential selection biases, or inter-annotator agreement for scenario construction. If scenarios were chosen to surface conflicts already prevalent in training data, the observed misalignment could be an evaluation artifact rather than evidence of Pseudo-Deliberation.
[Task design] The three tasks (articulation, reasoning, action) appear to be elicited via separate prompts. The paper does not specify whether the action-generation task receives the preceding reasoning as context or is run independently; if the latter, the decoupling between reasoning and action is built into the experimental design and does not demonstrate failure of deliberation.

minor comments (2)

[Results] The abstract states 'consistent misalignment' without reporting effect sizes, confidence intervals, or statistical tests; these should be added to the results section.
[Introduction] The acronym VIVALDI is introduced without expanding it on first use or clarifying its relationship to VALDI.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where methodological details can be clarified and expanded, which we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [VALDI framework (methods)] The central claim of consistent value-action misalignment across models rests on the five adherence metrics in VALDI. The abstract and methods description provide no information on how these metrics are formally defined, whether they were validated against human raters, or what controls were used for prompt sensitivity and inter-metric correlation; without such validation, it is unclear whether the reported misalignment reflects model behavior or metric artifacts.

Authors: We agree that the methods section would benefit from greater detail on the metrics. We will revise to include formal definitions of each of the five adherence metrics (using mathematical notation for scores such as consistency and similarity measures), describe the prompt sensitivity controls (multiple template variations were tested with stable misalignment patterns), and report inter-metric correlations. Human rater validation was not performed, as the metrics are designed as automated quantitative measures; we will explicitly note this choice and its rationale in the revised text. revision: yes
Referee: [Scenario construction] The 4,941 scenarios are described as 'human-centered' across five domains, yet no details are given on scenario curation, potential selection biases, or inter-annotator agreement for scenario construction. If scenarios were chosen to surface conflicts already prevalent in training data, the observed misalignment could be an evaluation artifact rather than evidence of Pseudo-Deliberation.

Authors: We will expand the scenario construction subsection to detail the curation process, including adaptation of dilemmas from ethics and psychology literature into dialogue formats across the five domains. Potential selection biases will be discussed, along with how the scale and diversity of the 4,941 scenarios mitigate them. Inter-annotator agreement is not available because scenarios were developed internally by the authors using structured templates rather than independent annotators; we will acknowledge this limitation directly. revision: partial
Referee: [Task design] The three tasks (articulation, reasoning, action) appear to be elicited via separate prompts. The paper does not specify whether the action-generation task receives the preceding reasoning as context or is run independently; if the latter, the decoupling between reasoning and action is built into the experimental design and does not demonstrate failure of deliberation.

Authors: The action-generation task is provided with the preceding reasoning as context in the prompt template (e.g., 'Given the value articulation and reasoning below, generate the dialogue action...'). This setup is intended to test whether explicit reasoning produces aligned actions. We will revise the task design section to state this explicitly and include the complete prompt templates in the appendix to eliminate ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical measurement framework

full rationale

The paper is an empirical study that introduces VALDI as a new benchmark with 4941 scenarios, three tasks, and five metrics to quantify value-action misalignment in LLMs, then reports observed inconsistencies across models and proposes VIVALDI as an intervention. No equations, derivations, parameter fittings, or self-referential definitions appear in the provided text. The central claim rests on direct application of the newly defined metrics to model outputs rather than any reduction to fitted inputs, self-citations, or ansatzes. The framework is self-contained as a measurement protocol without load-bearing reliance on prior author work or uniqueness theorems, satisfying the criteria for a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on the assumption that the introduced frameworks and metrics validly measure alignment, plus the empirical observation from testing multiple LLMs.

axioms (2)

domain assumption Human-centered scenarios can reliably elicit and test model values
Invoked in the design of the 4941 scenarios across five domains.
domain assumption Stated values, reasoning traces, and generated dialogue can be compared via quantitative metrics
Basis for the five metrics in VALDI.

invented entities (3)

Pseudo-Deliberation no independent evidence
purpose: To label the observed failure mode of apparent reasoning without value-aligned action
New conceptual term introduced to distinguish this from simple value-action gap.
VALDI no independent evidence
purpose: Framework for systematic measurement of value adherence in dialogue
New evaluation suite with scenarios, tasks, and metrics.
VIVALDI no independent evidence
purpose: Multi-agent auditor to intervene during generation for better alignment
Proposed intervention method.

pith-pipeline@v0.9.0 · 5452 in / 1342 out tokens · 53387 ms · 2026-05-12T05:04:22.438786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 5 internal anchors

[1]

A systematic review of the limitations of large language models in generating healthcare content.PLOS Digital Health, 5(4):e0001354, 2026

Mohsen Khosravi, Zahra Zamaninasab, Seyyed Morteza Mojtabaeian, Emine Kübra Din- dar Demiray, and Morteza Arab-Zozani. A systematic review of the limitations of large language models in generating healthcare content.PLOS Digital Health, 5(4):e0001354, 2026. doi: 10.1371/journal.pdig.0001354

work page doi:10.1371/journal.pdig.0001354 2026
[2]

Pardos and Shreya Bhandari

Zachary A. Pardos and Shreya Bhandari. Chatgpt-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills.PLOS ONE, 19(5):e0304013,

work page
[3]

doi: 10.1371/journal.pone.0304013

work page doi:10.1371/journal.pone.0304013
[4]

Gender, race, and intersectional bias in resume screening via language model retrieval

Kyra Wilson and Aylin Caliskan. Gender, race, and intersectional bias in resume screening via language model retrieval. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 1578–1590, 2024

work page 2024
[5]

V alue C ompass: A Framework for Measuring Contextual Value Alignment Between Human and LLM s

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Yu-Ju Yang, Nicholas Clark, Tanu Mitra, and Yun Huang. ValueCompass: A framework for measuring contextual value alignment between human and LLMs. In Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M’hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendra...

work page doi:10.18653/v1/2025.winlp-main.15 2025
[6]

Hua Shen, Nicholas Clark, and Tanu Mitra. Mind the value-action gap: Do LLMs act in alignment with their values? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3097–3118, Suzhou, China, November 2025. Association for Co...

work page doi:10.18653/v1/2025.emnlp-main 2025
[7]

URLhttps://aclanthology.org/2025.emnlp-main.154/

work page 2025
[8]

ValueBench: Towards comprehensively evaluating value orientations and understanding of large language models

Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. ValueBench: Towards comprehensively evaluating value orientations and understanding of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2015– ...

work page 2015
[9]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

work page 2022
[10]

Civics: Building a dataset for examining culturally-informed values in large language models

Giada Pistilli, Alina Leidinger, Yacine Jernite, Atoosa Kasirzadeh, Alexandra Sasha Luccioni, and Margaret Mitchell. Civics: Building a dataset for examining culturally-informed values in large language models. InProceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’24, page 1132–1144. AAAI Press, 2025

work page 2024
[11]

LLM tropes: Revealing fine-grained values and opinions in large language models

Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, and Isabelle Au- genstein. LLM tropes: Revealing fine-grained values and opinions in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17085–17112, Miami, Florida, USA,...

work page doi:10.18653/v1/2024.findings-emnlp 2024
[12]

URLhttps://aclanthology.org/2024.findings-emnlp.995/

work page 2024
[13]

Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective global opinions in language mod...

work page arXiv 2024
[14]

Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning,

Sajad Sotudeh, Hanieh Deilamsalehy, Franck Dernoncourt, and Nazli Goharian. TLDR9+: A large scale resource for extreme summarization of social media posts. In Giuseppe Carenini, Jackie Chi Kit Cheung, Yue Dong, Fei Liu, and Lu Wang, editors,Proceedings of the Third Workshop on New Frontiers in Summarization, pages 142–151, Online and in Dominican Republic...

work page doi:10.18653/v1/2021 2021
[15]

Are rules meant to be broken? understanding multilingual moral reasoning as a computational pipeline with UniMoral

Shivani Kumar and David Jurgens. Are rules meant to be broken? understanding multilingual moral reasoning as a computational pipeline with UniMoral. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5...

work page doi:10.18653/v1/2025.acl-long.294 2025
[16]

Will ai tell lies to save sick children? litmus-testing ai values prioritization with airiskdilemmas, 2025

Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, and Evan Hubinger. Will ai tell lies to save sick children? litmus-testing ai values prioritization with airiskdilemmas, 2025. URLhttps://arxiv.org/abs/2505.14633

work page arXiv 2025
[17]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Schwartz

Shalom H. Schwartz. Are there universal aspects in the structure and contents of human values? Journal of Social Issues, 50(4):19–45, 1994

work page 1994
[19]

Multi-stage prompting for knowledgeable dialogue generation

Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai Prabhumoye, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. Multi-stage prompting for knowledgeable dialogue generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Associa- tion for Computational Linguistics: ACL 2022, pages 1317–1337, Dublin, Ireland, May 2022. A...

work page doi:10.18653/v1/2022.findings-acl.104 2022
[20]

Gpt-4.1, 2025

OpenAI. Gpt-4.1, 2025. URL https://platform.openai.com/docs/models. Model version: gpt-4.1-2025-04-14

work page 2025
[21]

Llama 3.1 8b instruct model card

Meta AI. Llama 3.1 8b instruct model card. https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct, 2024. Accessed: 2026-03

work page 2024
[22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Gemini 3 flash

Google DeepMind. Gemini 3 flash. Technical report, Google, 2025. URL https://deepmind. google/technologies/gemini/

work page 2025
[24]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating 12 Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Mac...

work page doi:10.1145/3600006.3613165 2023
[25]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Keshav Sanjeev Singh, Christopher Potts, and Matei Zaharia. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Optimizing instructions and demonstrations for multi-stage language model programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...

work page doi:10.18653/v1/2024.emnlp-main.525 2024
[27]

Overcoming the ‘value-action gap’ in environmental policy: Tensions between national policy and local experience.Local Environment, 4(3):257–278, 1999

James Blake. Overcoming the ‘value-action gap’ in environmental policy: Tensions between national policy and local experience.Local Environment, 4(3):257–278, 1999

work page 1999
[28]

Bridging the intention–behaviour gap: The role of moral norm.British Journal of Social Psychology, 44(4):497–512, 2005

Gaston Godin, Mark Conner, and Paschal Sheeran. Bridging the intention–behaviour gap: The role of moral norm.British Journal of Social Psychology, 44(4):497–512, 2005

work page 2005
[29]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feed- bac...

work page 2022
[30]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Farrar, Straus and Giroux, 2011

Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011

work page 2011
[32]

Social-r1: Towards human-like social reasoning in llms, 2026

Jincenzi Wu, Yuxuan Lei, Jianxun Lian, Yitian Huang, Lexin Zhou, Haotian Li, Xing Xie, and Helen Meng. Social-r1: Towards human-like social reasoning in llms, 2026. URL https://arxiv.org/abs/2603.09249

work page arXiv 2026
[33]

Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Do˘gruöz, Alice Oh, and Najoung Kim. Are they lovers or friends? evaluating llms’ social reasoning in english and korean dialogues, 2026. URLhttps://arxiv.org/abs/2510.19028

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. InProceed- ings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023
[35]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[36]

Self-refine: 13 iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: 13 iterative refinement with self-feedback. InProceedings of the 37th International Conf...

work page 2023
[37]

Valueflow: Measuring the propagation of value perturba- tions in multi-agent llm systems, 2026

Jinnuo Liu, Chuke Liu, and Hua Shen. Valueflow: Measuring the propagation of value perturba- tions in multi-agent llm systems, 2026. URLhttps://arxiv.org/abs/2602.08567

work page arXiv 2026
[38]

Bosch, and Emiel Krahmer

Erkan Basar, Xin Sun, Iris Hendrickx, Jan de Wit, Tibor Bosse, Gert-Jan De Bruijn, Jos A. Bosch, and Emiel Krahmer. How well can large language models reflect? a human evaluation of LLM-generated reflections for motivational interviewing dialogues. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert,...

work page 1964
[39]

Syn- thetic dialogue dataset generation using LLM agents

Yelaman Abdullin, Diego Molla, Bahadorreza Ofoghi, John Yearwood, and Qingyang Li. Syn- thetic dialogue dataset generation using LLM agents. In Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, and Hooman Sedghamiz, editors,Proceedings of the Third Workshop on Natural Language Generation, Ev...

work page 2023
[40]

how things are usually done

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page arXiv 2020
[41]

Sampling candidate prompt variants,

work page
[42]

Evaluating each candidate on a minibatch using the alignment metric,

work page
[43]

Given C candidates and T trials, the procedure evaluates O(C×T) prompt variants and retains the best-performing program

Selecting high-performing prompts and refining them over multiple trials. Given C candidates and T trials, the procedure evaluates O(C×T) prompt variants and retains the best-performing program. Training Protocol.We use a held-out tuning setup with N= 25 examples for optimization and N= 25 for evaluation (nested prompt on DAISY with first seed=50 then see...

work page
[44]

23 Evaluation.We compare the original hand-written prompt and the optimized prompt on held-out data using macro-F1

Optimization is run with 10 candidates and 12 trials, using minibatches of size 4. 23 Evaluation.We compare the original hand-written prompt and the optimized prompt on held-out data using macro-F1. This isolates the effect of prompt optimization independent of downstream interventions such as MAS. Prompt Optimization Results.Based on the results in Table...

work page arXiv
[45]

DO NOT invent new values

You may ONLY output sub-values from the list above. DO NOT invent new values

work page
[46]

Tradition

Output ONLY a JSON array, e.g., ["Tradition", "Stimulation"]

work page
[47]

If no sub-values are clearly present, output []

work page
[48]

Do NOT include explanations, text, or extra commentary

work page
[49]

Prefer precision over recall

work page
[50]

Benevolence

Only include values supported by explicit textual evidence. CRITICAL: - Do NOT label "Benevolence" just because the speaker is being nice or helpful. Only label it if they explicitly prioritize the welfare of a specific in-group (family/friends) over other goals. - Do NOT label "Self-Direction" just because a choice is being made. Only label it if the aut...

work page
[51]

repairs": [ {

prior to annotation. Two human annotators jointly annotate each instance to produce a consensus ground truth. We then evaluate our ValueJudge pipeline, which uses GPT-4.1-mini for value detection and GPT-4.1 for stance scoring. All model parameters are deterministically set (temperature = 0, seed = 42). We use GPT-4.1-mini for value detection, as it achie...

work page 2029
[52]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page