Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

Benjamin O. Ladd; Brian Borsari; Guangzeng Han; James G. Murphy; Xiaolei Huang

arxiv: 2605.12987 · v2 · pith:WUMKX3E3new · submitted 2026-05-13 · 💻 cs.CL

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

Guangzeng Han , James G. Murphy , Benjamin O. Ladd , Xiaolei Huang , Brian Borsari This is my paper

Pith reviewed 2026-05-20 21:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords motivational interviewingautomatic codingmultimodal self-consistencyaudio-language modelsalcohol use reductionprosody analysismajority votingutterance-level reasoning

0 comments

The pith

Multimodal self-consistency across verbal and acoustic cues outperforms single-pass baselines for automatic coding of motivational interviewing sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an automatic coding method for motivational interviewing sessions that support alcohol use reduction by feeding raw audio into language models. Four complementary prompts examine verbal content, prosody and acoustic features, evidence strength, and contrasts between utterances, with three stochastic samples drawn from each to create twelve reasoning trajectories per utterance. Majority voting across those trajectories produces the final code. On five de-identified sessions the full method reached 52.56 percent accuracy, 54.03 percent precision, 47.45 percent recall, and 46.40 percent macro-F1, beating baseline single-pass approaches. Ablation tests that removed any one prompt type produced consistent drops on all primary metrics, indicating that the combination of what clients say and how they say it improves robustness.

Core claim

The central claim is that an audio-language model supplied with four analytic prompts—one for verbal cues, one for prosody, one for evidence scoring, and one for comparative reasoning—can generate twelve independent reasoning trajectories per utterance whose majority vote yields higher accuracy, precision, recall, and macro-F1 scores than any single-pass baseline on the same five recorded MI sessions, with systematic removal of any prompt module degrading every reported metric.

What carries the argument

Multimodal self-consistency obtained by majority voting over twelve stochastic reasoning trajectories produced from four complementary analytic prompts applied to raw audio input.

Load-bearing premise

The five de-identified recorded MI sessions are representative enough that the performance numbers and ablation results generalize beyond this small set.

What would settle it

Evaluating the identical pipeline on a new collection of at least twenty additional MI sessions drawn from different counselors or client groups and finding that accuracy falls below 45 percent or that the ablation degradations disappear.

Figures

Figures reproduced from arXiv: 2605.12987 by Benjamin O. Ladd, Brian Borsari, Guangzeng Han, James G. Murphy, Xiaolei Huang.

**Figure 1.** Figure 1: Overview of the proposed method, which applies four prompting strategies to audio language model and aggregates predictions through self-consistency. 3.1 Overview of MM-SC We address utterance-level motivational interviewing (MI) coding by developing an endto-end multimodal method that predicts CT, ST, or FN directly from raw audio input [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies self-consistency and four prompt types to raw MI audio for utterance coding and shows ablation drops, but everything rests on five sessions with no reported splits or reliability checks.

read the letter

The paper puts together an audio-language model with four prompts—one for verbal content, one for prosody, one for evidence scoring, and one for comparisons—then draws three stochastic samples per prompt and votes on the final label for each utterance. On their five de-identified sessions this reaches 52.56% accuracy and beats the single-pass baselines they tried. The ablations that remove one prompt type at a time and watch the scores fall give some indication that the different reasoning angles are contributing rather than just adding noise.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an automatic coding method for Motivational Interviewing (MI) sessions focused on alcohol use reduction. It uses audio-language models with four complementary prompts (analytic for verbal cues, prosody-aware for acoustic cues, evidence-scoring for hypothesis testing, and comparative for contrastive reasoning). Three stochastic samples per prompt yield 12 trajectories per utterance, with majority voting to produce final utterance-level predictions. On five de-identified recorded MI sessions, the multimodal self-consistency method reports 52.56% accuracy, 54.03% precision, 47.45% recall, and 46.40% macro-F1, outperforming baselines; systematic ablations removing individual modules degrade these metrics.

Significance. If the performance gains hold on larger, more diverse data, the work could reduce labor costs for MI coding and support scalable analysis of client behaviors in alcohol interventions. The systematic ablation experiments, which demonstrate consistent degradation when modules are removed, provide concrete evidence for the contribution of each reasoning component and strengthen the case for multimodal self-consistency over single-pass prompting.

major comments (2)

[Methods] Methods: All reported metrics and ablation results are derived from only five de-identified MI sessions with no session-level cross-validation, leave-one-out testing, or controls for session length, client demographics, or MI fidelity scores. Given n=5, majority voting across 12 trajectories can still yield unstable estimates; the central claim that multimodal self-consistency outperforms baselines and that ablations confirm module value rests on this limited sample.
[Results] Results: No inter-rater reliability statistics are provided for the human-coded ground-truth labels, and no statistical significance tests (e.g., McNemar or bootstrap) are reported for the performance deltas versus baselines. These omissions make it difficult to interpret the 52.56% accuracy and 46.40% macro-F1 as robust evidence rather than potentially idiosyncratic to the five sessions.

minor comments (2)

[Abstract] Abstract and Methods: The description of stochastic sampling does not specify the temperature, top-p, or other generation parameters used for the three samples per prompt, which would aid reproducibility.
[Methods] Methods: Clarify whether any of the five sessions were used for prompt engineering or few-shot example selection, or whether the evaluation is entirely zero-shot on held-out material.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major concerns regarding sample size, validation procedures, inter-rater reliability, and statistical testing below. We outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Methods] Methods: All reported metrics and ablation results are derived from only five de-identified MI sessions with no session-level cross-validation, leave-one-out testing, or controls for session length, client demographics, or MI fidelity scores. Given n=5, majority voting across 12 trajectories can still yield unstable estimates; the central claim that multimodal self-consistency outperforms baselines and that ablations confirm module value rests on this limited sample.

Authors: We acknowledge the limitation of using only five sessions, which stems from the scarcity of publicly available, de-identified MI audio recordings with expert annotations. While we agree that larger-scale validation with cross-validation would be ideal, the current study is positioned as an initial exploration of multimodal self-consistency for this task. The systematic ablations showing consistent performance degradation provide supporting evidence for the contribution of each component, even within this small sample. In the revised manuscript, we will add a dedicated limitations section discussing the small sample size, the absence of cross-validation, and the need for future work on larger, more diverse datasets including controls for demographics and fidelity scores. We will also clarify that the results should be interpreted as preliminary. revision: partial
Referee: [Results] Results: No inter-rater reliability statistics are provided for the human-coded ground-truth labels, and no statistical significance tests (e.g., McNemar or bootstrap) are reported for the performance deltas versus baselines. These omissions make it difficult to interpret the 52.56% accuracy and 46.40% macro-F1 as robust evidence rather than potentially idiosyncratic to the five sessions.

Authors: We agree that reporting inter-rater reliability would enhance the credibility of the ground-truth labels. However, the annotations were provided by a single trained MI coder as part of the de-identified dataset, and we do not have access to multiple independent coders to compute such statistics. We will explicitly note this as a limitation in the revised manuscript. For statistical significance, we will implement bootstrap confidence intervals or McNemar's test for the performance differences in the revision to provide quantitative assessment of the observed improvements over baselines. revision: partial

standing simulated objections not resolved

We cannot compute inter-rater reliability statistics for the ground-truth labels, as the dataset was annotated by only one expert coder and additional annotations are unavailable.

Circularity Check

0 steps flagged

No circularity: empirical metrics computed directly on provided sessions without self-referential fitting or derivations.

full rationale

The paper reports standard accuracy, precision, recall, and macro-F1 on utterances from five de-identified MI sessions using ALM prompting and majority voting across trajectories. No equations, parameter fitting, or derivations are described that reduce to the input data by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The evaluation pipeline is self-contained as direct application of existing multimodal reasoning techniques to the MI coding task, with results presented as observed performance rather than derived predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that majority voting across prompt variants improves robustness and that the chosen prompts capture the clinically relevant verbal and acoustic cues; no new physical entities or mathematical constants are introduced.

free parameters (1)

Number of stochastic samples per prompt
Set to three to produce twelve trajectories for majority voting; value chosen by authors to balance compute and diversity.

axioms (1)

domain assumption Majority voting across multiple reasoning trajectories increases coding robustness
Invoked in the self-consistency procedure described in METHODS.

pith-pipeline@v0.9.0 · 5820 in / 1419 out tokens · 63715 ms · 2026-05-20T21:57:05.809725+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

In-session processes of brief motivational interventions in two trials with mandated college students

Borsari B, Apodaca TR, Jackson KM, et al. In-session processes of brief motivational interventions in two trials with mandated college students. J Consult Clin Psychol. 2015;83(1):56

work page 2015
[2]

Murphy JG, Dennhardt AA, Tempchin J, et al. Behavioral economic and wellness-based approaches for reducing alcohol use and consequences among diverse non-student emerging adults: study protocol for Project BLUE, a randomized controlled trial. Trials. 2024;25(1):173. doi:10.1186/s13063-024-08009-9

work page doi:10.1186/s13063-024-08009-9 2024
[3]

The motivational interviewing treatment integrity (MITI) code: Version 2.0

Moyers TB, Martin T, Manuel JK, Miller WR, Ernst D. The motivational interviewing treatment integrity (MITI) code: Version 2.0. Retrieved Verfübar Unter Www Casaa Unm Edu 0103 2005. Published online 2003

work page 2005
[4]

Motivational Interviewing: Helping People Change (3rd Edition)

Galvani S. Motivational Interviewing: Helping People Change (3rd Edition). Vol 33. Guilford Press; 2014. doi:10.1080/02615479.2014.894351

work page doi:10.1080/02615479.2014.894351 2014
[5]

Brief Motivational Intervention for Underage Young Adult Drinkers: Results from a Randomized Clinical Trial

Colby SM, Orchowski L, Magill M, et al. Brief Motivational Intervention for Underage Young Adult Drinkers: Results from a Randomized Clinical Trial. Alcohol Clin Exp Res. 2018;42(7):1342-1351. doi:10.1111/acer.13770 17

work page doi:10.1111/acer.13770 2018
[6]

LYSSN Homepage

Lyssn. LYSSN Homepage. Published online 2025. Accessed October 2, 2025. https://www.lyssn.io/

work page 2025
[7]

Multimodal automatic coding of client behavior in motivational interviewing

Tavabi L, Stefanov K, Zhang L, et al. Multimodal automatic coding of client behavior in motivational interviewing. In: Proceedings of the 2020 International Conference on Multimodal Interaction. 2020:406-413

work page 2020
[8]

Detecting change talk in motivational interviewing using verbal and facial information

Nakano YI, Hirose E, Sakato T, Okada S, Martin JC. Detecting change talk in motivational interviewing using verbal and facial information. In: Proceedings of the 2022 International Conference on Multimodal Interaction. 2022:5-14

work page 2022
[9]

Chain-of-interaction: Enhancing large language models for psychiatric behavior understanding by dyadic contexts

Han G, Liu W, Huang X, Borsari B. Chain-of-interaction: Enhancing large language models for psychiatric behavior understanding by dyadic contexts. In: 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI). IEEE; 2024:392-401

work page 2024
[10]

Modeling temporality of human intentions by domain adaptation

Huang X, Liu L, Carey K, Woolley J, Scherer S, Borsari B. Modeling temporality of human intentions by domain adaptation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018:696-701

work page 2018
[11]

Large language models for mental health applications: systematic review

Guo Z, Lai A, Thygesen JH, Farrington J, Keen T, Li K. Large language models for mental health applications: systematic review. JMIR Ment Health. 2024;11(1):e57400

work page 2024
[12]

A scoping review of large language models for generative tasks in mental health care

Hua Y, Na H, Li Z, et al. A scoping review of large language models for generative tasks in mental health care. Npj Digit Med. 2025;8(1):230

work page 2025
[13]

Qwen3-omni technical report

Xu J, Guo Z, Hu H, et al. Qwen3-omni technical report. ArXiv Prepr ArXiv250917765. Published online 2025

work page 2025
[14]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang X, Wei J, Schuurmans D, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In: The Eleventh International Conference on Learning Representations. 2023. https://openreview.net/forum?id=1PL1NIMMrw

work page 2023
[15]

Robust speech recognition via large-scale weak supervision

Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. PMLR; 2023:28492-28518

work page 2023
[16]

Seeing and Hearing What Has Not Been Said: A multimodal client behavior classifier in Motivational Interviewing with interpretable fusion

Galland L, Pelachaud C, Pecune F. Seeing and Hearing What Has Not Been Said: A multimodal client behavior classifier in Motivational Interviewing with interpretable fusion. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE; 2024:1-9

work page 2024
[17]

Multimodal Audio-Language Model for Speech Emotion Recognition

Bellver J, Martín-Fernández I, Bravo-Pacheco JM, Esteban S, Fernández-Martínez F, D’Haro LF. Multimodal Audio-Language Model for Speech Emotion Recognition. In: Proc. Odyssey

work page
[18]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei J, Wang X, Schuurmans D, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Advances in Neural Information Processing Systems. Vol 35. Curran Associates, Inc.; 2022:24824-24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f...

work page 2022
[19]

M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews

Hossain SM, Alexandersson J, Müller P. M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews. In: Calzolari N, Kan MY, Hoste V, Lenci A, Sakti S, Xue N, eds. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL;...

work page 2024
[20]

Evaluating motivational interview quality using large language models and hidden Markov models

Lim K, Jung YC, Kim BH. Evaluating motivational interview quality using large language models and hidden Markov models. BMC Psychiatry. 2025;25(1):908

work page 2025
[21]

Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge

Sharif M, Han G, Liu W, Huang X. Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge. ArXiv Prepr ArXiv250414786. Published online 2025. Appendix Table S1: Prompt templates.3 Prompt Name Prompt Template Analytic Reasoning Prompt (Experimental Prompt 1) T...

work page 2025

[1] [1]

In-session processes of brief motivational interventions in two trials with mandated college students

Borsari B, Apodaca TR, Jackson KM, et al. In-session processes of brief motivational interventions in two trials with mandated college students. J Consult Clin Psychol. 2015;83(1):56

work page 2015

[2] [2]

Murphy JG, Dennhardt AA, Tempchin J, et al. Behavioral economic and wellness-based approaches for reducing alcohol use and consequences among diverse non-student emerging adults: study protocol for Project BLUE, a randomized controlled trial. Trials. 2024;25(1):173. doi:10.1186/s13063-024-08009-9

work page doi:10.1186/s13063-024-08009-9 2024

[3] [3]

The motivational interviewing treatment integrity (MITI) code: Version 2.0

Moyers TB, Martin T, Manuel JK, Miller WR, Ernst D. The motivational interviewing treatment integrity (MITI) code: Version 2.0. Retrieved Verfübar Unter Www Casaa Unm Edu 0103 2005. Published online 2003

work page 2005

[4] [4]

Motivational Interviewing: Helping People Change (3rd Edition)

Galvani S. Motivational Interviewing: Helping People Change (3rd Edition). Vol 33. Guilford Press; 2014. doi:10.1080/02615479.2014.894351

work page doi:10.1080/02615479.2014.894351 2014

[5] [5]

Brief Motivational Intervention for Underage Young Adult Drinkers: Results from a Randomized Clinical Trial

Colby SM, Orchowski L, Magill M, et al. Brief Motivational Intervention for Underage Young Adult Drinkers: Results from a Randomized Clinical Trial. Alcohol Clin Exp Res. 2018;42(7):1342-1351. doi:10.1111/acer.13770 17

work page doi:10.1111/acer.13770 2018

[6] [6]

LYSSN Homepage

Lyssn. LYSSN Homepage. Published online 2025. Accessed October 2, 2025. https://www.lyssn.io/

work page 2025

[7] [7]

Multimodal automatic coding of client behavior in motivational interviewing

Tavabi L, Stefanov K, Zhang L, et al. Multimodal automatic coding of client behavior in motivational interviewing. In: Proceedings of the 2020 International Conference on Multimodal Interaction. 2020:406-413

work page 2020

[8] [8]

Detecting change talk in motivational interviewing using verbal and facial information

Nakano YI, Hirose E, Sakato T, Okada S, Martin JC. Detecting change talk in motivational interviewing using verbal and facial information. In: Proceedings of the 2022 International Conference on Multimodal Interaction. 2022:5-14

work page 2022

[9] [9]

Chain-of-interaction: Enhancing large language models for psychiatric behavior understanding by dyadic contexts

Han G, Liu W, Huang X, Borsari B. Chain-of-interaction: Enhancing large language models for psychiatric behavior understanding by dyadic contexts. In: 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI). IEEE; 2024:392-401

work page 2024

[10] [10]

Modeling temporality of human intentions by domain adaptation

Huang X, Liu L, Carey K, Woolley J, Scherer S, Borsari B. Modeling temporality of human intentions by domain adaptation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018:696-701

work page 2018

[11] [11]

Large language models for mental health applications: systematic review

Guo Z, Lai A, Thygesen JH, Farrington J, Keen T, Li K. Large language models for mental health applications: systematic review. JMIR Ment Health. 2024;11(1):e57400

work page 2024

[12] [12]

A scoping review of large language models for generative tasks in mental health care

Hua Y, Na H, Li Z, et al. A scoping review of large language models for generative tasks in mental health care. Npj Digit Med. 2025;8(1):230

work page 2025

[13] [13]

Qwen3-omni technical report

Xu J, Guo Z, Hu H, et al. Qwen3-omni technical report. ArXiv Prepr ArXiv250917765. Published online 2025

work page 2025

[14] [14]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang X, Wei J, Schuurmans D, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In: The Eleventh International Conference on Learning Representations. 2023. https://openreview.net/forum?id=1PL1NIMMrw

work page 2023

[15] [15]

Robust speech recognition via large-scale weak supervision

Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. PMLR; 2023:28492-28518

work page 2023

[16] [16]

Seeing and Hearing What Has Not Been Said: A multimodal client behavior classifier in Motivational Interviewing with interpretable fusion

Galland L, Pelachaud C, Pecune F. Seeing and Hearing What Has Not Been Said: A multimodal client behavior classifier in Motivational Interviewing with interpretable fusion. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE; 2024:1-9

work page 2024

[17] [17]

Multimodal Audio-Language Model for Speech Emotion Recognition

Bellver J, Martín-Fernández I, Bravo-Pacheco JM, Esteban S, Fernández-Martínez F, D’Haro LF. Multimodal Audio-Language Model for Speech Emotion Recognition. In: Proc. Odyssey

work page

[18] [18]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei J, Wang X, Schuurmans D, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Advances in Neural Information Processing Systems. Vol 35. Curran Associates, Inc.; 2022:24824-24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f...

work page 2022

[19] [19]

M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews

Hossain SM, Alexandersson J, Müller P. M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews. In: Calzolari N, Kan MY, Hoste V, Lenci A, Sakti S, Xue N, eds. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL;...

work page 2024

[20] [20]

Evaluating motivational interview quality using large language models and hidden Markov models

Lim K, Jung YC, Kim BH. Evaluating motivational interview quality using large language models and hidden Markov models. BMC Psychiatry. 2025;25(1):908

work page 2025

[21] [21]

Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge

Sharif M, Han G, Liu W, Huang X. Cultivating multidisciplinary research and education on gpu infrastructure for mid-south institutions at the university of memphis: Practice and challenge. ArXiv Prepr ArXiv250414786. Published online 2025. Appendix Table S1: Prompt templates.3 Prompt Name Prompt Template Analytic Reasoning Prompt (Experimental Prompt 1) T...

work page 2025