pith. sign in

arxiv: 2606.30116 · v1 · pith:OZMR6PP7new · submitted 2026-06-29 · 💻 cs.AI

Open Problems in Constitutional Preference Reconstruction

Pith reviewed 2026-06-30 06:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords constitutional AIpreference reconstructionpairwise preferencesLLM judgesinterpretabilityRLHF
0
0 comments X

The pith

A constitution of principles is not yet an executable decision rule until paired with a specific executor and model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats pairwise preference data as a testbed for constitutional methods that compress choices into short natural-language principle lists. It identifies three gaps: principle quality lacks complete proxies beyond coverage and accuracy, principle composition is ambiguous so that different executors agree only 73 percent of the time on the same principles, and the resulting constitutions are model-dependent with cross-model agreement at 73 percent versus 81 percent intra-model. Refinement via ICAI+ raises inter-executor agreement to 78 percent and lets transparent executors reach 66 percent accuracy against an LLM judge at 67 percent. The central argument is that any constitution must be evaluated as part of a full constitution-executor system.

Core claim

Holding principles fixed, different executors agree only 73 percent of the time and different models agree only 73 percent across models versus 81 percent within models; principle refinement improves executor agreement to 78 percent and lets transparent executors nearly match LLM-judge accuracy at 66 percent versus 67 percent.

What carries the argument

The constitution-executor system, in which a flat list of natural-language principles is combined with a concrete decision procedure such as an LLM judge or majority vote.

If this is right

  • Constitutions must be tested together with their chosen executor rather than as standalone lists.
  • Principle refinement can measurably reduce executor disagreement.
  • Transparent executors can reach accuracy comparable to LLM judges once principles are refined.
  • Constitutions produced for one model do not transfer directly to another.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same composition ambiguities are likely to appear when constitutions guide multi-turn conversations or open-ended generation.
  • Standardizing the executor step could reduce model-to-model inconsistency in LLM-as-judge applications.
  • The observed gaps suggest that interpretability gains from constitutions are currently limited by execution details rather than principle content alone.

Load-bearing premise

The pairwise setting is sufficient to reveal the composition ambiguities that would appear in richer preference or generation tasks.

What would settle it

Measure whether executor agreement and cross-model agreement rates remain below 80 percent when the same constitutions are applied to full generation or multi-turn preference data instead of isolated pairwise choices.

Figures

Figures reproduced from arXiv: 2606.30116 by Aaron Zhao, Arduin Findeis, Eleanor Clifford, Michael Amir, Robert Mullins.

Figure 1
Figure 1. Figure 1: Preference reconstruction as a discoverer–annotator–executor stack. Each datapoint is (Ai , Bi , yi), where Ai and Bi are the two candidate responses and yi ∈ {A,B} denotes the preferred response. The annotator produces votes Vij ∈ {A,B, N/A} for principle Pj on pair i (N/A = not applicable). Constitution selection and executor fitting occur using training votes and may depend on the executor class (e.g., … view at source ↗
Figure 2
Figure 2. Figure 2: K-means clustering of generated principles, between methods, on AlpacaEval using [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Coverage and accuracy of generated principles on AlpacaEval using DeepSeek v3.1 Chat. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of 10-principle majority vote executor with only refined and only unrefined [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of majority vote executor against number of principles in the constitution. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: First 10 principles selected by Algorithm 2 from generation on AlpacaEval using DeepSeek [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: First 10 principles selected by ICAI’s default principle quality metric (correct votes minus [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: First 10 principles selected by Algorithm 2 from naive ICAI on PRISM using two different [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Pairwise preference data is widely used for training and evaluating language models (e.g., RLHF), but each datapoint records a \emph{choice}, not the rationale behind it. Methods such as Inverse Constitutional AI (ICAI) attempt to improve interpretability by compressing datasets into short ``constitutions'' of natural-language principles. We argue this framing is under-specified: a flat list of principles is not yet an executable decision rule because it leaves principle composition implicit. We use the pairwise setting as a testbed to empirically characterize three open problems in constitutional methods. First, principle quality is hard to measure: coverage and accuracy are useful but incomplete proxies for end-to-end reconstruction. Second, \emph{composition is ambiguous}: holding principles fixed, different executors (LLM judge versus majority vote) agree only $73\%$ of the time. Third, \emph{constitutions differ between LLMs}: cross-model vote agreement is $73\%$, whereas intra-model agreement is $81\%$. Across PRISM, AlpacaEval, and Chatbot Arena, we show that principle refinement (ICAI+) may be a first step towards ameliorating these problems: inter-executor agreement rises to $78\%$, and transparent executors match LLM judge accuracy ($66\%$ vs.\ $67\%$). Our results highlight that constitutions should be evaluated as \emph{constitution--executor systems}, with implications for LLMs-as-a-judge broadly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that methods like Inverse Constitutional AI for reconstructing preferences from pairwise data are under-specified because they treat constitutions as flat lists of principles without specifying composition rules. Using pairwise preference data from PRISM, AlpacaEval, and Chatbot Arena, it empirically identifies three open problems: (1) principle quality is hard to measure via coverage/accuracy alone, (2) composition is ambiguous (different executors agree only 73% of the time), and (3) constitutions are model-specific (cross-model vote agreement 73% vs. intra-model 81%). It reports that an ICAI+ refinement raises inter-executor agreement to 78% and allows transparent executors to match LLM-judge accuracy (66% vs. 67%), concluding that constitutions must be evaluated as constitution-executor systems with implications for LLMs-as-a-judge.

Significance. If the empirical characterizations hold, the work provides a useful framing of under-specification in constitutional methods and supplies direct agreement counts that demonstrate the three problems across three datasets. The explicit credit to reproducible empirical counts (no fitted parameters) and the concrete ICAI+ proposal strengthen the contribution as an exploratory identification of open problems rather than a closed solution.

major comments (2)
  1. [Abstract] Abstract: The central claim that pairwise data suffices as a testbed to characterize the three open problems (and thus supports evaluating constitutions as constitution-executor systems) is load-bearing for the broader implications, yet the manuscript provides no direct comparison or argument showing that the observed 73% inter-executor and cross-model gaps persist under open-ended generation, multi-criteria scoring, or non-pairwise judgments.
  2. [Abstract] Abstract and results sections: The reported figures (73% inter-executor agreement, 81% intra-model, 66% vs. 67% accuracy) are presented without error bars, dataset sizes per split, or details on how data splits and executor prompts were fixed in advance; this directly affects the reliability of the evidence offered for the existence and severity of the three problems.
minor comments (1)
  1. [Abstract] The abstract states that 'transparent executors match LLM judge accuracy' but does not define the transparent executors or their implementation in sufficient detail for readers to replicate the 66%/67% comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below. We agree that additional methodological details are required and will incorporate them. On the scope of the pairwise testbed, we will revise to avoid overclaiming generality while preserving the exploratory contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that pairwise data suffices as a testbed to characterize the three open problems (and thus supports evaluating constitutions as constitution-executor systems) is load-bearing for the broader implications, yet the manuscript provides no direct comparison or argument showing that the observed 73% inter-executor and cross-model gaps persist under open-ended generation, multi-criteria scoring, or non-pairwise judgments.

    Authors: The manuscript explicitly frames pairwise preferences as a controlled testbed because this format is standard in preference modeling and RLHF pipelines. The three problems are characterized empirically within this setting, and the conclusion that constitutions should be evaluated as constitution-executor systems is drawn from the observed ambiguities in that setting. We do not provide, and do not claim to provide, direct evidence that the precise agreement gaps persist in open-ended generation or multi-criteria scoring. We will revise the abstract and discussion sections to state more precisely that the characterization applies to pairwise data and that extension to other judgment formats remains an open question. revision: partial

  2. Referee: [Abstract] Abstract and results sections: The reported figures (73% inter-executor agreement, 81% intra-model, 66% vs. 67% accuracy) are presented without error bars, dataset sizes per split, or details on how data splits and executor prompts were fixed in advance; this directly affects the reliability of the evidence offered for the existence and severity of the three problems.

    Authors: We agree that the absence of error bars, per-split dataset sizes, and explicit details on prompt fixation and data splits reduces the reliability assessment of the reported percentages. In the revised version we will add standard errors (or bootstrap confidence intervals) for all agreement and accuracy figures, report the exact number of preference pairs used per dataset and per split, and include a dedicated experimental-setup subsection describing how splits were constructed and how executor prompts were written and held fixed. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical counts are independent measurements

full rationale

The paper reports direct empirical agreement percentages (73% inter-executor, 81% intra-model, 78% after ICAI+) between fixed executors on extracted constitutions from pairwise data across three datasets. These are straightforward observational counts with no equations, fitted parameters, or derivations that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The use of the pairwise setting is explicitly framed as a testbed for characterization rather than a self-defining premise, and the recommendation to evaluate constitution-executor systems follows from these independent measurements without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; the work is an empirical characterization rather than a derivation. The only background assumptions are standard statistical uses of agreement percentages.

axioms (1)
  • domain assumption Agreement percentage between two decision procedures is a meaningful proxy for composition ambiguity
    Used to quantify the second open problem in the abstract

pith-pipeline@v0.9.1-grok · 5792 in / 1249 out tokens · 43928 ms · 2026-06-30T06:46:53.063388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 10 canonical work pages · 8 internal anchors

  1. [1]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  2. [2]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022

  3. [3]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

  4. [4]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132

  5. [5]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 5 2023

  6. [6]

    Loss aversion in riskless choice: A reference-dependent model.The quarterly journal of economics, 106(4):1039–1061, 1991

    Amos Tversky and Daniel Kahneman. Loss aversion in riskless choice: A reference-dependent model.The quarterly journal of economics, 106(4):1039–1061, 1991

  7. [7]

    Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research, 2023

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, et al. Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research, 2023

  8. [8]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  9. [9]

    In- verse constitutional ai: Compressing preferences into principles

    Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Samuel Albanie, and Robert Mullins. In- verse constitutional ai: Compressing preferences into principles. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/ forum?id=9FRwkPw3Cn

  10. [10]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

  11. [11]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

  12. [12]

    Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 74...

  13. [13]

    Prometheus 2: An open source language model specialized in evaluating other language models,

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Gra- ham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source lan- guage model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535, 2024. 11

  14. [14]

    Hannah Rose Kirk, Alexander Whitefield, Paul Rottger, Andrew M Bean, Katerina Margatina, Rafael Mosquera-Gomez, Juan Ciro, Max Bartolo, Adina Williams, He He, et al. The prism alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models.Advances in...

  15. [15]

    Collective constitutional ai: Aligning a language model with public input

    Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. Collective constitutional ai: Aligning a language model with public input. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1395–1417, 2024

  16. [16]

    What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data, October 2025

    Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data, October 2025

  17. [17]

    Democratic icai: Debating our way to steering principles from preferences

    Kevin Kingslin, Anish Natekar, Ashutosh Ranjan, Vivek Srivastava, Savita Bhat, and Shirish Karande. Democratic icai: Debating our way to steering principles from preferences. InICLR 2026 Workshop-From Human Cognition to AI Reasoning: Models, Methods, and Applications, 2026

  18. [18]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  19. [19]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215, 2019

    Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215, 2019

  20. [20]

    Wadsworth International Group, 1984

    Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone.Classification and Regression Trees. Wadsworth International Group, 1984

  21. [21]

    Random forests.Machine Learning, 45:5–32, 2001

    Leo Breiman. Random forests.Machine Learning, 45:5–32, 2001

  22. [22]

    Lightgbm: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  23. [23]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  24. [24]

    GPT-4o System Card

    OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

  25. [25]

    Deepseek-v3 technical report, 2024

    DeepSeek-AI. Deepseek-v3 technical report, 2024

  26. [26]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Google DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/ 2507.06261

  27. [27]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 12 A Reproducibility All code and other resources required to reproduce the results in this paper is released open source and can be found athttps:/...

  28. [29]

    Select the response that uses bullet points for clarity

  29. [31]

    Select the response that avoids unnecessary spacing in the output

  30. [33]

    Select the response that includes character development and progression

  31. [34]

    Select the response that includes more specific hashtags

  32. [35]

    Select the response that acknowledges historical complexity and uncertainty

  33. [36]

    Select the response that emphasizes mood-setting elements

  34. [37]

    (a) Naive principle generation

    Select the response that provides a concrete code example. (a) Naive principle generation

  35. [39]

    Select the response that provides concrete feature specifics, except when the prompt requests brevity or a simplified format

  36. [40]

    Select the response that provides actionable steps for implementing instructions, unless the prompt explicitly requests general concepts or product recommendations

  37. [41]

    Select the response that precisely follows quantitative constraints, unless it introduces factual inaccuracies or omits key elements

  38. [42]

    Select the response that offers precise measurements and clear instructions for reliable execution

  39. [43]

    Select the response that uses formatting like bullet points and numbered steps for enhanced readability

  40. [44]

    Select the response that correctly interprets numerical input constraints

  41. [45]

    Select the response that uses metaphors only when they creatively clarify or simplify complex concepts, not for straightforward factual descriptions

  42. [46]

    Select the response that provides accurate, actionable instructions, unless correctness of the instructions is itself the primary focus, then prioritize accuracy

  43. [47]

    Select the response that provides accurate and complete information, addressing all explicit sub-questions in the prompt. (b) Improved principle generation Figure 6: First 10 principles selected by Algorithm 2 from generation on AlpacaEval using DeepSeek v3.1 Chat 16 I Greedy majority vote selection compared to metric-only selection of principles This app...

  44. [48]

    Select the response that uses more specific and professional language

  45. [49]

    Select the response that is more descriptive and elaborate

  46. [50]

    Select the response that provides a conclusive and meaningful resolution

  47. [51]

    Select the response that provides detailed justification

  48. [52]

    Select the response that includes specific examples of benefits

  49. [53]

    Select the response that provides more detailed actionable steps

  50. [54]

    Select the response that uses clearer and more direct language

  51. [55]

    Select the response that includes full phrases not just words

  52. [56]

    Select the response that is scientifically accurate

  53. [57]

    (a) Naive principle generation

    Select the response that uses a list format for clarity. (a) Naive principle generation

  54. [58]

    Select the response with correct syntax and formatting, unless exact wording or coherent narrative continuity is required

  55. [59]

    Select the response that conveys information accurately and completely, avoiding both unnecessary wordiness and insufficient detail

  56. [60]

    Select the response that provides the most correct and complete essential information, unless excessive detail obscures accuracy

  57. [61]

    Select the response that provides concrete examples or specific details, unless they are irrelevant, incorrect, or reduce clarity

  58. [62]

    Select the response that provides specific examples or actionable details unless they are incomplete, misleading, or lack sufficient context

  59. [63]

    Select the response that provides specific, accurate details, unless the response contains incorrect or unverified claims

  60. [64]

    Select the response that answers all parts of the instruction explicitly and precisely

  61. [65]

    Select the response that provides specific, concrete examples unless the request is hypothetical, imaginative, or emphasizes simplicity

  62. [66]

    Select the response that provides comprehensive detail while avoiding speculative claims, unless summarizing existing knowledge

  63. [67]

    explanation

    Select the response that provides concrete examples or computational methods, unless brevity is requested or explanation is purely conceptual. (b) Improved principle generation Figure 7: First 10 principles selected by ICAI’s default principle quality metric (correct votes minus incorrect votes) from generation on AlpacaEval using DeepSeek v3.1 Chat Table...

  64. [68]

    Select the response that provides complete arguments without abrupt cutoff

  65. [69]

    Select the response that provides more detailed subtopics

  66. [70]

    Select the response that acknowledges multiple stated and unstated reasons

  67. [71]

    Select the response that focuses on interpersonal relationships context

  68. [72]

    Select the response that expresses understanding of the user’s concern

  69. [73]

    Select the response that follows instruction format precisely

  70. [74]

    Select the response that fully embodies the character’s traits

  71. [75]

    Select the response that covers a broader range of conflict areas

  72. [76]

    Select the response that includes personal fulfillment as a goal

  73. [77]

    Select the response that provides more detailed meal descriptions (a) DeepSeek v3.1 Chat

  74. [78]

    Select the response that invites further elaboration on topics

  75. [79]

    Select the response that provides more detailed cleaning steps

  76. [80]

    Select the response that avoids mentioning previous recipe failures

  77. [81]

    Select the response that includes practical cashflow management advice

  78. [82]

    Select the response that acknowledges complexity in celebrity influence

  79. [83]

    Select the response that acknowledges individual couple’s needs

  80. [84]

    Select the response that suggests regular maintenance for longevity

Showing first 80 references.