arxiv: 2604.19245 · v2 · submitted 2026-04-21 · 💻 cs.CL · cs.AI

Recognition: unknown

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

Clara Lachenmaier , Hannah Bultmann , Sina Zarrie{\ss}

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords repairmulti-turn dialogueLLM unreliabilityconversational AImath problemsmodel differenceshuman-LLM interactiondialogue systems

0 comments

The pith

Each LLM exhibits its own characteristic form of unreliability when handling repair during multi-turn math dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how large language models respond to and initiate repair during conversations about math problems. Repair is the process of correcting misunderstandings that arise in talk. The study finds large differences between models, with some stubbornly ignoring user corrections and others flipping their answers too readily. Longer conversations make each model's behavior more unique and less predictable. A sympathetic reader would care because it highlights that LLMs are not uniformly reliable conversation partners, especially when users try to fix errors over multiple turns.

Core claim

In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliabil

What carries the argument

The interactive process of repair for resolving trouble in conversation, which carries the argument by exposing model-specific patterns of resistance or susceptibility to corrections in extended dialogues.

If this is right

Models differ sharply in their willingness to self-repair or accept user repairs on math problems.
Multi-turn dialogues amplify distinctive and less predictable repair behaviors compared to single turns.
Repair interactions serve as a diagnostic tool for revealing unreliability that single-turn tests miss.
Each LLM develops its own characteristic response style to conversational corrections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users may benefit from learning model-specific ways to phrase corrections to achieve consistent answers.
Evaluation benchmarks for LLMs should incorporate multi-turn repair tasks to better measure real-world reliability.
The variability suggests that training data could be augmented with repair examples to reduce model-specific flaws.

Load-bearing premise

The observed differences in repair behavior are intrinsic to the models rather than arising from the specific choice of math questions, prompt phrasing, or evaluation criteria used in the study.

What would settle it

Re-running the experiments with a fresh set of math problems and rephrased prompts and observing that all models display identical repair patterns would falsify the claim of model-specific unreliability.

Figures

Figures reproduced from arXiv: 2604.19245 by Clara Lachenmaier, Hannah Bultmann, Sina Zarrie{\ss}.

**Figure 2.** Figure 2: Left: Slope plot showing differences in mean counts of 36 in answers by the two non-misleading repair [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Confusion matrices for regression models predicting the LLM from the answer text. Left: predictions for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Proportion of incorrect responses to unanswer [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs display distinct repair behaviors in multi-turn math dialogues, but these may depend on the specific questions and prompts rather than being intrinsic model properties.

read the letter

The one thing to take away is that the paper finds each LLM has its own pattern of unreliability when repair comes up in multi-turn math talks, but those patterns could be tied to the test questions rather than being general model traits. The new part is the use of repair as a diagnostic for multi-turn behavior. Prior work on LLMs mostly looks at single responses, so looking at how models start repairs or react to them in back-and-forth adds a layer. They separate cases where the math problem can be solved from ones that cannot, which lets them see different failure modes. The results highlight that models get more unpredictable the longer the conversation runs. This approach works because it draws from real conversation studies and applies it directly to current systems like GPT and Claude. It gives a practical sense of where these tools might frustrate users in ongoing interactions. The main weakness is in how little we learn about the actual experiment. The abstract claims strong differences but skips sample sizes, scoring methods, and any tests for whether the outcomes depend on the particular prompts or question set. The stress-test concern holds here: without trying different math problems or reworded instructions, we cannot rule out that the observed behaviors are just reactions to this specific setup. That undercuts the idea of characteristic forms of unreliability per model. Readers who work on dialogue systems or LLM evaluation will find this useful as an example of what multi-turn testing can uncover. It is not broad theory but it points to a gap in current benchmarks. I think it should go to peer review. The idea is worth developing, and with added details on methods and some sensitivity analysis it could stand as a solid empirical note.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study of repair behaviors in multi-turn human-LLM dialogues focused on solvable and unsolvable math questions. It examines whether models initiate repair and how they respond to user-initiated repair attempts, reporting substantial differences across LLMs (e.g., resistance vs. susceptibility) that become more pronounced and model-distinctive beyond single turns. The central claim is that each tested LLM exhibits its own characteristic form of unreliability in repair contexts.

Significance. If the observed model-specific patterns prove robust, the work would usefully extend conversational AI research by showing that repair mechanisms expose multi-turn inconsistencies not visible in single-turn evaluations. It provides an initial observational mapping of how LLMs handle conversational trouble, which could inform interaction design and reliability benchmarks. The study is strengthened by its focus on an underexplored aspect of human-LLM interaction and by contrasting solvable/unsolvable conditions, though its impact is limited by the absence of statistical controls and invariance tests.

major comments (3)

[Methods / experimental setup] Methods / experimental setup: No sample size (number of dialogues, questions per model, or turns), statistical tests, exact prompt templates, or inter-annotator agreement for repair coding is reported. Without these, it is impossible to determine whether the reported inter-model differences in repair initiation and response are statistically reliable or merely descriptive, directly undermining the claim that each model has a 'characteristic form of unreliability'.
[Results / discussion] Results and discussion of question selection: The central claim that differences reflect intrinsic model properties rather than artifacts of the chosen math questions or prompt phrasing is not supported by any sensitivity analysis, question substitution, or prompt paraphrasing. The study fixes the question set and wording; thus the observed repair patterns could arise from interactions between specific problems and model training distributions, as the stress-test concern correctly identifies.
[Evaluation criteria] Evaluation criteria: The distinction between 'appropriate' repair attempts and model responses lacks an explicit rubric or examples of coding decisions. This makes it difficult to assess whether the reported susceptibility/resistance differences are reproducible or dependent on subjective evaluation criteria.

minor comments (2)

[Abstract] The abstract would be clearer if it named the specific models tested and the approximate number of turns or dialogues analyzed.
[Results] Any tables or figures presenting repair frequencies should include error bars or confidence intervals and explicit definitions of the metrics used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of transparency and rigor that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [Methods / experimental setup] Methods / experimental setup: No sample size (number of dialogues, questions per model, or turns), statistical tests, exact prompt templates, or inter-annotator agreement for repair coding is reported. Without these, it is impossible to determine whether the reported inter-model differences in repair initiation and response are statistically reliable or merely descriptive, directly undermining the claim that each model has a 'characteristic form of unreliability'.

Authors: We agree that these methodological details are necessary for evaluating the robustness of the observed patterns. In the revised manuscript we will add a dedicated methods subsection reporting the precise sample sizes (number of dialogues and total turns per model), the full prompt templates in an appendix, and a description of the coding procedure. The study is observational and exploratory rather than hypothesis-driven, so we did not apply inferential statistical tests; we will clarify this scope and include descriptive counts of repair events. Coding was performed collaboratively by the authors with consensus resolution; we will document this process and note the absence of formal inter-annotator agreement metrics as a limitation. These additions will make the evidence base more transparent while preserving the descriptive nature of the findings. revision: yes
Referee: [Results / discussion] Results and discussion of question selection: The central claim that differences reflect intrinsic model properties rather than artifacts of the chosen math questions or prompt phrasing is not supported by any sensitivity analysis, question substitution, or prompt paraphrasing. The study fixes the question set and wording; thus the observed repair patterns could arise from interactions between specific problems and model training distributions, as the stress-test concern correctly identifies.

Authors: We accept that the absence of sensitivity checks leaves open the possibility that patterns are tied to the specific question set. The questions were chosen as canonical solvable and unsolvable math problems to isolate repair behavior from varying problem difficulty. In revision we will expand the discussion to justify this selection, acknowledge the limitation, and moderate the language from 'intrinsic model properties' to 'model-specific tendencies observed under these conditions.' We cannot conduct new question-substitution experiments at this stage, but the consistency of behaviors across the multiple questions already tested provides initial support for the patterns being model-linked rather than question-specific. revision: partial
Referee: [Evaluation criteria] Evaluation criteria: The distinction between 'appropriate' repair attempts and model responses lacks an explicit rubric or examples of coding decisions. This makes it difficult to assess whether the reported susceptibility/resistance differences are reproducible or dependent on subjective evaluation criteria.

Authors: We will revise the methods section to include an explicit coding rubric defining 'appropriate' repair (user requests for clarification on specific mathematical steps or contradictions) and the response categories (resistance: deflection or ignoring; susceptibility: unwarranted answer changes or over-accommodation). We will also add representative dialogue excerpts with our coding decisions to illustrate borderline cases and improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study with no derivations or fitted predictions

full rationale

This paper conducts an empirical investigation of LLM repair behaviors in multi-turn dialogues involving solvable and unsolvable math questions. It directly compares model outputs for self-initiated repair and responses to user repair attempts, reporting observed differences across systems such as GPT and Claude. No equations, parameters, theoretical models, or derivation chains are present. Results rest on experimental data collection and qualitative description rather than any reduction to prior fits, self-definitions, or self-citations. The central claim of model-specific unreliability follows from the observed patterns without circular construction. External concerns about prompt or question specificity pertain to generalizability, not circularity in the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper applies the established linguistic concept of repair to LLMs without introducing new free parameters, invented entities, or non-standard axioms beyond the domain assumption that repair functions similarly in human-LLM talk.

axioms (1)

domain assumption Repair is an important resource for resolving trouble in human-human conversation and can be studied analogously in human-LLM interaction.
Invoked in the opening sentence of the abstract as the foundation for the investigation.

pith-pipeline@v0.9.0 · 5432 in / 1209 out tokens · 53015 ms · 2026-05-10T02:14:07.341072+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

183 extracted references · 140 canonical work pages · 3 internal anchors

[1]

Language , author =

The. Language , author =. 1977 , note =. doi:10.2307/413107 , abstract =

work page doi:10.2307/413107 1977
[2]

Hoeken, Sanne and Zarrieß, Sina and Alacam, Özge , editor =. Hateful. Proceedings of the 2024. 2024 , keywords =. doi:10.18653/v1/2024.emnlp-main.10 , abstract =

work page doi:10.18653/v1/2024.emnlp-main.10 2024
[3]

Rethinking

Duranti, Alessandro , month = may, year =. Rethinking
[4]

PLOS ONE , author =

Universal. PLOS ONE , author =. 2015 , note =. doi:10.1371/journal.pone.0136100 , abstract =

work page doi:10.1371/journal.pone.0136100 2015
[5]

Trends in Cognitive Sciences , author =

Interactive repair and the foundations of language , volume =. Trends in Cognitive Sciences , author =. 2024 , keywords =. doi:10.1016/j.tics.2023.09.003 , abstract =

work page doi:10.1016/j.tics.2023.09.003 2024
[6]

Studies in Language , author =

Formats for other-initiation of repair across languages:. Studies in Language , author =. 2014 , pages =. doi:10.1075/sl.38.1.01din , abstract =

work page doi:10.1075/sl.38.1.01din 2014
[7]

Analyzing

Anh, Dang Hoang and Tran, Vu and Nguyen, Le Minh , editor =. Analyzing. New. 2025 , keywords =. doi:10.1007/978-981-96-7071-0_12 , abstract =

work page doi:10.1007/978-981-96-7071-0_12 2025
[8]

Tonmoy, S. M. Towhidul Islam and Zaman, S. M. Mehedi and Jain, Vinija and Rani, Anku and Rawte, Vipula and Chadha, Aman and Das, Amitava , month = jan, year =. A. doi:10.48550/arXiv.2401.01313 , abstract =

work page doi:10.48550/arxiv.2401.01313
[9]

DeepSeek-AI and Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bocha...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[10]

Phi-4 Technical Report

Abdin, Marah and Aneja, Jyoti and Behl, Harkirat and Bubeck, Sébastien and Eldan, Ronen and Gunasekar, Suriya and Harrison, Michael and Hewett, Russell J. and Javaheripi, Mojan and Kauffmann, Piero and Lee, James R. and Lee, Yin Tat and Li, Yuanzhi and Liu, Weishung and Mendes, Caio C. T. and Nguyen, Anh and Price, Eric and Rosa, Gustavo de and Saarikivi,...

work page internal anchor Pith review doi:10.48550/arxiv.2412.08905
[11]

Srivatsa, Kv Aditya and Kochmar, Ekaterina , year =. What. Findings of the. doi:10.18653/v1/2024.findings-naacl.72 , language =

work page doi:10.18653/v1/2024.findings-naacl.72 2024
[12]

Beyond the

Albornoz-De Luise, Romina Soledad and Arnau, David and Arnau-González, Pablo and Arevalillo-Herráez, Miguel , year =. Beyond the. Proceedings of the 2nd. doi:10.18653/v1/2024.practicald2t-1.1 , language =

work page doi:10.18653/v1/2024.practicald2t-1.1 2024
[13]

Cheng, Ziling and Cao, Meng and Pishdad, Leila and Cao, Yanshuai and Cheung, Jackie Ck , year =. Can. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.723 , language =

work page doi:10.18653/v1/2025.emnlp-main.723 2025
[14]

Yin, Zhangyue and Sun, YuHong and Huang, Xuanjing and Qiu, Xipeng and Zhao, Hui , year =. Error. Findings of the. doi:10.18653/v1/2025.findings-emnlp.20 , language =

work page doi:10.18653/v1/2025.findings-emnlp.20 2025
[15]

Introducing

Anthropic , year =. Introducing
[16]

OpenAI and Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and Avila, Red and Babuschkin, Igor and Balaji, Suchir and Balcom, Valerie and Baltescu, Paul and Bao, Haiming and Bavarian, Mohammad and Belgum, Jeff a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774
[17]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

A. ACM Trans. Inf. Syst. , author =. 2025 , keywords =. doi:10.1145/3703155 , abstract =

work page doi:10.1145/3703155 2025
[18]

Topics in Cognitive Science , author =

Running. Topics in Cognitive Science , author =. 2018 , note =. doi:10.1111/tops.12336 , abstract =

work page doi:10.1111/tops.12336 2018
[19]

, month = aug, year =

Liddicoat, Anthony J. , month = aug, year =. An
[20]

Learning and

Toles, Matthew and Huang, Yukun and Yu, Zhou , year =. Learning and. Proceedings of the
[21]

Madureira, Brielen and Schlangen, David , year =. Taking. Proceedings of the
[22]

Wildenburg, Frank and Hanna, Michael and Pezzelle, Sandro , year =. Do. Findings of the. doi:10.18653/v1/2024.findings-acl.572 , language =

work page doi:10.18653/v1/2024.findings-acl.572 2024
[23]

Asking the

Testoni, Alberto and Fernández, Raquel , year =. Asking the. Proceedings of the 18th. doi:10.18653/v1/2024.eacl-long.16 , language =

work page doi:10.18653/v1/2024.eacl-long.16 2024
[24]

Discourse & Communication , author =

Performance without understanding:. Discourse & Communication , author =. 2024 , keywords =. doi:10.1177/17504813241271492 , abstract =

work page doi:10.1177/17504813241271492 2024
[25]

Detecting

Piryani, Bhawna and Abdallah, Abdelrahman and Mozafari, Jamshid and Jatowt, Adam , year =. Detecting. Findings of the. doi:10.18653/v1/2024.findings-emnlp.562 , language =

work page doi:10.18653/v1/2024.findings-emnlp.562 2024
[26]

Papakostas, Konstantinos and Papadopoulou, Irene , year =. Model. Findings of the. doi:10.18653/v1/2023.findings-acl.279 , language =

work page doi:10.18653/v1/2023.findings-acl.279 2023
[27]

Lee, Dongryeol and Kim, Segwang and Lee, Minwoo and Lee, Hwanhee and Park, Joonsuk and Lee, Sang-Woo and Jung, Kyomin , year =. Asking. Findings of the. doi:10.18653/v1/2023.findings-emnlp.772 , language =

work page doi:10.18653/v1/2023.findings-emnlp.772 2023
[28]

Don’t be

Liang, Zhenwen and Zhang, Jipeng and Zhang, Xiangliang , year =. Don’t be. Proceedings of the 13th. doi:10.18653/v1/2023.ijcnlp-main.2 , language =

work page doi:10.18653/v1/2023.ijcnlp-main.2 2023
[29]

Leveraging

Han, Donghee and Lim, Seungjae and Roh, Daeyoung and Kim, Sangryul and Kim, Sehyun and. Leveraging. Proceedings of the 31st. 2025 , keywords =

2025
[30]

Ngo, Anh and Rollet, Nicolas and Pelachaud, Catherine and Clavel, Chloé , year =. ". Proceedings of the 2025

2025
[31]

Iterative

Sawhney, Riya and Yadav, Samrat and Bhattacharya, Indrajit and Mausam, Mausam , year =. Iterative. Findings of the. doi:10.18653/v1/2025.findings-acl.1262 , language =

work page doi:10.18653/v1/2025.findings-acl.1262 2025
[32]

Repairs in a

Chiyah-Garcia, Javier and Suglia, Alessandro and Eshghi, Arash , year =. Repairs in a. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.643 , language =

work page doi:10.18653/v1/2024.emnlp-main.643 2024
[33]

Proceedings of the 61st

Akyurek, Afra Feyza and Akyurek, Ekin and Kalyan, Ashwin and Clark, Peter and Wijaya, Derry Tanti and Tandon, Niket , year =. Proceedings of the 61st. doi:10.18653/v1/2023.acl-long.427 , language =

work page doi:10.18653/v1/2023.acl-long.427 2023
[34]

Detecting

Kim, Hazel and Lamb, Tom A and Bibi, Adel and Torr, Philip and Gal, Yarin , year =. Detecting. Proceedings of the 2025

2025
[35]

Findings of the

Chen, Yue and Huang, Chen and Deng, Yang and Lei, Wenqiang and Jin, Dingnan and Liu, Jia and Chua, Tat-Seng , year =. Findings of the. doi:10.18653/v1/2024.findings-acl.632 , language =

work page doi:10.18653/v1/2024.findings-acl.632 2024
[36]

Li, Haau-Sing (Xiaocheng) and Mesgar, Mohsen and Martins, André and Gurevych, Iryna , year =. Python. Proceedings of the 61st. doi:10.18653/v1/2023.acl-long.799 , language =

work page doi:10.18653/v1/2023.acl-long.799 2023
[37]

Kim, Gangwoo and Kim, Sungdong and Jeon, Byeongguk and Park, Joonsuk and Kang, Jaewoo , year =. Tree of. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.63 , language =

work page doi:10.18653/v1/2023.emnlp-main.63 2023
[38]

Findings of the

Tan, Chuanyuan and Shao, Wenbiao and Xiong, Hao and Zhu, Tong and Liu, Zhenhua and Shi, Kai and Chen, Wenliang , year =. Findings of the. doi:10.18653/v1/2025.findings-acl.85 , language =

work page doi:10.18653/v1/2025.findings-acl.85 2025
[39]

Findings of the

Yuan, Yuewei and Malaviya, Chaitanya and Yatskar, Mark , year =. Findings of the. doi:10.18653/v1/2023.findings-eacl.75 , language =

work page doi:10.18653/v1/2023.findings-eacl.75 2023
[40]

Clarify When Necessary: Resolving Ambiguity Through Interaction with LM s

Zhang, Michael J.Q. and Choi, Eunsol , year =. Clarify. Findings of the. doi:10.18653/v1/2025.findings-naacl.306 , language =

work page doi:10.18653/v1/2025.findings-naacl.306 2025
[41]

Clarifying the

Rahmani, Hossein A and Wang, Xi and Aliannejadi, Mohammad and Naghiaei, Mohammadmehdi and Yilmaz, Emine , year =. Clarifying the. Findings of the
[42]

Loftus, Sebastian and Mülthaler, Adrian and Hoeken, Sanne and Zarrieß, Sina and Alacam, Ozge , editor =. Using. Proceedings of the. 2025 , keywords =

2025
[43]

Hoeken, Sanne and Alacam, Özge and Nguyen, Dong and Poesio, Massimo and Zarrieß, Sina , editor =. Not. Proceedings of the 16th. 2025 , keywords =

2025
[44]

No that’s not what

Balaraman, Vevake and Eshghi, Arash and Konstas, Ioannis and Papaioannou, Ioannis , year =. No that’s not what. Proceedings of the 24th. doi:10.18653/v1/2023.sigdial-1.52 , abstract =

work page doi:10.18653/v1/2023.sigdial-1.52 2023
[45]

Referential ambiguity and clarification requests: comparing human and

Madge, Chris and Purver, Matthew and Poesio, Massimo , year =. Referential ambiguity and clarification requests: comparing human and. doi:10.48550/arXiv.2507.10445 , abstract =

work page doi:10.48550/arxiv.2507.10445
[46]

Proceedings of the 2024

Clarifying. Proceedings of the 2024. 2024 , keywords =

2024
[47]

Verifiable,

Toroghi, Armin and Guo, Willis and Pesaranghader, Ali and Sanner, Scott , year =. Verifiable,. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.379 , language =

work page doi:10.18653/v1/2024.emnlp-main.379 2024
[48]

Proceedings of the 63rd

Sahay, Rishav and Tekumalla, Lavanya Sita and Aggarwal, Purav and Jain, Arihant and Saladi, Anoop , year =. Proceedings of the 63rd. doi:10.18653/v1/2025.acl-industry.63 , language =

work page doi:10.18653/v1/2025.acl-industry.63 2025
[49]

Benchmarking

Sun, YuHong and Yin, Zhangyue and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Zhao, Hui , year =. Benchmarking. Proceedings of the 2024

2024
[50]

Aligning

Naszadi, Kata and Manggala, Putra and Monz, Christof , year =. Aligning. Findings of the. doi:10.18653/v1/2023.findings-emnlp.999 , language =

work page doi:10.18653/v1/2023.findings-emnlp.999 2023
[51]

Frontiers in Robotics and AI , author =

An analysis of dialogue repair in virtual assistants , volume =. Frontiers in Robotics and AI , author =. 2024 , pages =. doi:10.3389/frobt.2024.1356847 , abstract =

work page doi:10.3389/frobt.2024.1356847 2024
[52]

Benchmarking

Sun, YuHong and Yin, Zhangyue and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Zhao, Hui , editor =. Benchmarking. Proceedings of the 2024. 2024 , pages =

2024
[53]

Gautam, Vagrant and Zhang, Miaoran and Klakow, Dietrich , year =. A. Findings of the. doi:10.18653/v1/2023.findings-emnlp.491 , language =

work page doi:10.18653/v1/2023.findings-emnlp.491 2023
[54]

Prompting and

Deng, Yang and Liao, Lizi and Chen, Liang and Wang, Hongru and Lei, Wenqiang and Chua, Tat-Seng , year =. Prompting and. Findings of the. doi:10.18653/v1/2023.findings-emnlp.711 , abstract =

work page doi:10.18653/v1/2023.findings-emnlp.711 2023
[55]

Zhao, Wenting and Gao, Ge and Cardie, Claire and Rush, Alexander M , year =. I. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.242 , language =

work page doi:10.18653/v1/2024.emnlp-main.242 2024
[56]

Instruction

Madureira, Brielen and Schlangen, David , year =. Instruction. Proceedings of the 17th. doi:10.18653/v1/2023.eacl-main.169 , language =

work page doi:10.18653/v1/2023.eacl-main.169 2023
[57]

Gorodissky, Yuval and Sulem, Elior and Roth, Dan , year =. Cross-. Proceedings of the 14th
[58]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

A. Educational and Psychological Measurement , author =. 1960 , note =. doi:10.1177/001316446002000104 , language =

work page doi:10.1177/001316446002000104 1960
[59]

Communication & Medicine , author =

Establishing mutual understanding in interaction:. Communication & Medicine , author =. 2010 , note =. doi:10.1558/cam.v6i2.165 , abstract =

work page doi:10.1558/cam.v6i2.165 2010
[60]

Open Linguistics , author =

Other-initiated repair across languages: towards a typology of conversational structures , volume =. Open Linguistics , author =. doi:10.2478/opli-2014-0007 , abstract =

work page doi:10.2478/opli-2014-0007 2014
[61]

Open Linguistics , author =

A. Open Linguistics , author =. 2016 , keywords =. doi:10.1515/opli-2016-0002 , abstract =

work page doi:10.1515/opli-2016-0002 2016
[62]

Repair , copyright =

Kitzinger, Celia , year =. Repair , copyright =. The. doi:10.1002/9781118325001.ch12 , note =

work page doi:10.1002/9781118325001.ch12
[63]

Topics in Cognitive Science , author =

Repair:. Topics in Cognitive Science , author =. 2018 , note =. doi:10.1111/tops.12339 , abstract =

work page doi:10.1111/tops.12339 2018
[64]

Godfrey, J. J. and Holliman, E. C. and McDaniel, J. , month = mar, year =. doi:10.1109/ICASSP.1992.225858 , abstract =

work page doi:10.1109/icassp.1992.225858 1992
[65]

Core, Mark G and Allen, James F , keywords =. Coding
[66]

Tapaswi, Makarand and Zhu, Yukun and Stiefelhagen, Rainer and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja , year =
[67]

QuAC : Question Answering in Context

Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke , month = aug, year =. doi:10.48550/arXiv.1808.07036 , abstract =

work page Pith review doi:10.48550/arxiv.1808.07036
[68]

, year =

Transactions of the Association for Computational Linguistics , author =. 2019 , keywords =. doi:10.1162/tacl_a_00266 , abstract =

work page doi:10.1162/tacl_a_00266 2019
[69]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

Yang, Yi and Yih, Wen-tau and Meek, Christopher , editor =. Proceedings of the 2015. 2015 , keywords =. doi:10.18653/v1/D15-1237 , urldate =

work page doi:10.18653/v1/d15-1237 2015
[70]

doi:10.48550/arXiv.1907.06292 , abstract =

Xiong, Wenhan and Wu, Jiawei and Wang, Hong and Kulkarni, Vivek and Yu, Mo and Chang, Shiyu and Guo, Xiaoxiao and Wang, William Yang , month = jul, year =. doi:10.48550/arXiv.1907.06292 , abstract =

work page doi:10.48550/arxiv.1907.06292 1907
[71]

and Pathak, Jyotishman , month = jan, year =

Alambo, Amanuel and Gaur, Manas and Lokala, Usha and Kursuncu, Ugur and Thirunarayan, Krishnaprasad and Gyrard, Amelie and Sheth, Amit and Welton, Randon S. and Pathak, Jyotishman , month = jan, year =. Question. 2019. doi:10.1109/ICOSC.2019.8665525 , abstract =

work page doi:10.1109/icosc.2019.8665525 2019
[72]

Journal of Pragmatics , author =

Question–response sequences in conversation across ten languages:. Journal of Pragmatics , author =. 2010 , keywords =. doi:10.1016/j.pragma.2010.04.001 , language =

work page doi:10.1016/j.pragma.2010.04.001 2010
[73]

Wavchat: A survey of spoken dialogue models

Ji, Shengpeng and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Lu, Jingyu and Wang, Hanting and Jiang, Ziyue and Zhou, Long and Liu, Shujie and Cheng, Xize and Yang, Xiaoda and Wang, Zehan and Yang, Qian and Li, Jian and Jiang, Yidi and He, Jingzhen and Chu, Yunfei and Xu, Jin and Zhao, Zhou , month = nov, year =. doi:10.48550/arXiv.2411.13577 , abstract =

work page doi:10.48550/arxiv.2411.13577
[74]

and Chiam, Caleb and Fu, Liye and Wang, Andrew Z

Chang, Jonathan P. and Chiam, Caleb and Fu, Liye and Wang, Andrew Z. and Zhang, Justine and Danescu-Niculescu-Mizil, Cristian , month = may, year =. doi:10.48550/arXiv.2005.04246 , abstract =

work page doi:10.48550/arxiv.2005.04246 2005
[75]

Zhang, Justine and Spirling, Arthur and Danescu-Niculescu-Mizil, Cristian , month = aug, year =. Asking. doi:10.48550/arXiv.1708.02254 , abstract =

work page doi:10.48550/arxiv.1708.02254
[76]

ACM Comput. Surv. , author =. 2023 , keywords =. doi:10.1145/3560260 , abstract =

work page doi:10.1145/3560260 2023
[77]

Journal of Pragmatics , author =

A coding scheme for question-response sequences in conversation , volume =. Journal of Pragmatics , author =. 2010 , pages =. doi:10.1016/j.pragma.2010.04.002 , abstract =

work page doi:10.1016/j.pragma.2010.04.002 2010
[78]

Weidinger et al., Taxonomy of risks posed by language models.Proc

Weidinger, Laura and Uesato, Jonathan and Rauh, Maribeth and Griffin, Conor and Huang, Po-Sen and Mellor, John and Glaese, Amelia and Cheng, Myra and Balle, Borja and Kasirzadeh, Atoosa and Biles, Courtney and Brown, Sasha and Kenton, Zac and Hawkins, Will and Stepleton, Tom and Birhane, Abeba and Hendricks, Lisa Anne and Rimell, Laura and Isaac, William ...

work page doi:10.1145/3531146.3533088 2022
[79]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , author =

All. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , author =. 2024 , pages =. doi:10.1609/aies.v7i1.31613 , abstract =

work page doi:10.1609/aies.v7i1.31613 2024
[80]

Cognitive Science , author =

Can. Cognitive Science , author =. 2024 , keywords =. doi:10.1111/cogs.70013 , abstract =

work page doi:10.1111/cogs.70013 2024

Showing first 80 references.