IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows
Pith reviewed 2026-06-26 20:40 UTC · model grok-4.3
The pith
Closed-weight voice models recover more robustly from interruptions in structured workflows than open-weight ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Voice agents executing state-machine workflows across ten domains must resume at the correct step after interruptions, address the user's point, and avoid re-delivering heard material. IHBench injects six interruption types at controlled mid-utterance points and scores each on task fulfillment and recovery quality using generated rubrics and an LLM judge. Closed-weight models win more often on fulfillment, degrade roughly 3.3 times more slowly with longer conversations, and show no audio-versus-text gap, while open-weight models lose ground on all three axes.
What carries the argument
IHBench benchmark that pairs six interruption types with per-interruption evaluation rubrics in state-machine workflows and scores recovery via an LLM judge on task fulfillment and quality axes.
If this is right
- Closed-weight models are more suitable for deployment in voice agents that face frequent interruptions in customer service or scheduling.
- Recovery quality forms a distinct capability axis from other audio benchmarks such as AudioMultiChallenge.
- The performance gap widens with conversation length, favoring closed models in extended sessions.
- Different interruption types produce sharply different recovery outcomes across all models.
- Open-weight models show consistent weaknesses on both audio input and longer contexts.
Where Pith is reading between the lines
- Open-weight developers may need targeted training data on interruption recovery to narrow the observed gap.
- The benchmark could be adapted to measure recovery in non-enterprise domains such as personal assistants.
- If recovery quality is separate from other skills, model training might benefit from explicit objectives for post-interruption resumption.
- Real-world voice agent reliability may depend more on interruption handling than on raw speech recognition accuracy.
Load-bearing premise
The rubrics generated with the data plus the LLM judge measure actual recovery quality the way human evaluators would.
What would settle it
A follow-up study in which human annotators score the same set of model responses on task fulfillment and recovery quality and find low correlation with the LLM judge scores.
Figures
read the original abstract
Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IHBench, a benchmark for post-interruption recovery in voice agents executing state-machine workflows across 10 enterprise domains. Six interruption types are injected mid-utterance with generated per-interruption rubrics; each is scored on task fulfillment and recovery quality by an LLM judge. Evaluation of 27 audio-language models from closed- and open-weight families shows closed-weight models win more often on task fulfillment, degrade ~3.3× more slowly with conversation length, and exhibit no audio-vs-text modality gap, while open-weight models underperform on all three axes. A human study is cited to validate the LLM judge, and a cross-benchmark comparison with AudioMultiChallenge indicates recovery quality is a distinct capability.
Significance. If the evaluation protocol is reliable, the work fills a clear gap left by prior benchmarks that focus only on barge-in timing rather than post-interruption workflow recovery, a capability essential for deployed voice agents. The explicit separation of task fulfillment from recovery quality and the demonstration that recovery forms an orthogonal axis to existing benchmarks are useful contributions. The empirical comparison across 27 model configurations provides concrete evidence of systematic differences between closed- and open-weight families that could inform both model development and deployment decisions.
major comments (2)
- [Abstract / §4] Abstract and §4 (Human Validation subsection): the claim that 'a human study validates the LLM judge against human annotators' is load-bearing for all comparative results, yet no agreement statistics (Cohen’s κ, percentage agreement, or study size), annotation protocol, or breakdown by interruption type or model family are reported. Without these, it is impossible to assess whether the reported robustness gaps between closed- and open-weight models reflect genuine differences or judge bias.
- [§5] §5 (Results, degradation analysis): the statement that closed-weight models 'degrade roughly 3.3× more slowly' is central to the robustness claim, but the paper does not specify the regression model, the exact length metric (turns vs. tokens), or how the 3.3× ratio is computed from the per-model slopes. This prevents independent verification of the quantitative factor.
minor comments (2)
- [Results tables] Table 2 or equivalent results table: column headers for the two scoring axes should explicitly state the possible score ranges and whether higher is better.
- [§3] §3 (Benchmark construction): the generation procedure for the per-interruption rubrics is described at a high level; a short pseudocode or example rubric for one interruption type would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract / §4] Abstract and §4 (Human Validation subsection): the claim that 'a human study validates the LLM judge against human annotators' is load-bearing for all comparative results, yet no agreement statistics (Cohen’s κ, percentage agreement, or study size), annotation protocol, or breakdown by interruption type or model family are reported. Without these, it is impossible to assess whether the reported robustness gaps between closed- and open-weight models reflect genuine differences or judge bias.
Authors: We agree that the current Human Validation subsection is insufficiently detailed for a load-bearing claim. In the revision we will expand this subsection (and the corresponding appendix) to report: (i) the full annotation protocol, (ii) study size, (iii) inter-annotator agreement via both Cohen’s κ and raw percentage agreement, and (iv) breakdowns of agreement by interruption type and by model family. These additions will allow readers to evaluate potential judge bias directly. revision: yes
-
Referee: [§5] §5 (Results, degradation analysis): the statement that closed-weight models 'degrade roughly 3.3× more slowly' is central to the robustness claim, but the paper does not specify the regression model, the exact length metric (turns vs. tokens), or how the 3.3× ratio is computed from the per-model slopes. This prevents independent verification of the quantitative factor.
Authors: We concur that the degradation analysis lacks the necessary methodological transparency. The revised §5 will explicitly state that we fit ordinary least-squares linear regressions with conversation length measured in turns, report the per-model slope coefficients, and show the exact arithmetic used to obtain the 3.3× ratio (average closed-weight slope divided by average open-weight slope). A supplementary table of all slopes will also be added. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with no derivations or self-referential reductions
full rationale
The paper introduces IHBench as an empirical benchmark for post-interruption recovery in voice agents, evaluates 27 model configurations on task fulfillment and recovery quality axes, and reports comparative robustness findings between closed- and open-weight models. No equations, fitted parameters, predictions derived from prior fits, or derivation chains appear in the provided text. The LLM judge is validated by a mentioned human study rather than by self-definition or self-citation chains. The work is self-contained against external benchmarks and does not invoke uniqueness theorems, ansatzes, or renamings that reduce to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Hewett, Mojan Javaheripi, Piero Kauff- mann, James R
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J. Hewett, Mojan Javaheripi, Piero Kauff- mann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gus- tavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Ding...
2024
-
[2]
Tyers, and Gregor We- ber
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor We- ber. Common voice: A massively-multilingual speech corpus, 2020
2020
-
[3]
Talking turns: Bench- marking audio foundation models on turn-taking dy- namics, 2025
Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruom- ing Pang, and Shinji Watanabe. Talking turns: Bench- marking audio foundation models on turn-taking dy- namics, 2025
2025
-
[4]
Nguyen, Raghav Mehndiratta, Lind- say Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, and Srini- vas Sunkara
Tara Bogavelli, Gabrielle Gauthier Melançon, Kat- rina Stankiewicz, Oluwanifemi Bamgbose, Fanny Ri- ols, Hoang H. Nguyen, Raghav Mehndiratta, Lind- say Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, and Srini- vas Sunkara. Eva-bench: A new end-to-end framework for evaluating voice agents, 2026
2026
-
[5]
Tan, and Haizhou Li
Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. V oicebench: Bench- marking llm-based voice assistants, 2024
2024
-
[6]
Qwen2- Audio technical report, 2024
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2- Audio technical report, 2024
2024
-
[7]
Moshi: a speech-text foun- dation model for real-time dialogue, 2024
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foun- dation model for real-time dialogue, 2024
2024
-
[8]
Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979
Bradley Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979
1979
-
[9]
From self-evolving synthetic data to verifiable-reward rl: Post-training multi-turn interactive tool-using agents, 2026
Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin, and Yi Wu. From self-evolving synthetic data to verifiable-reward rl: Post-training multi-turn interactive tool-using agents, 2026
2026
-
[10]
Flexi: Benchmarking full-duplex human-llm speech interac- tion, 2025
Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, and Jingbo Zhu. Flexi: Benchmarking full-duplex human-llm speech interac- tion, 2025
2025
-
[11]
Gemini 2.5: Our most intelligent AI model
Google. Gemini 2.5: Our most intelligent AI model. https://blog.google/innovation-and-ai/ models-and-research/google-deepmind/ gemini-model-thinking-updates-march-2025/ ,
2025
-
[13]
Gemini 3
Google. Gemini 3. https://blog.google/ products-and-platforms/products/gemini/ gemini-3/, 2025. Accessed: 2026-06-09
2025
-
[14]
Gemini 3.1 Flash Live
Google. Gemini 3.1 Flash Live. https://blog. google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-flash-live/ , 2026. Accessed: 2026-06-09
2026
-
[15]
Google DeepMind. Gemma 4. https://deepmind. google/models/gemma/gemma-4/, 2026. Accessed: 2026-06-09
2026
-
[16]
Au- dio multichallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025
Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, and Yunzhong He. Au- dio multichallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025
2025
-
[17]
V oiceagentbench: Are voice assistants ready for agentic tasks?, 2026
Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, and Shubham Agarwal. V oiceagentbench: Are voice assistants ready for agentic tasks?, 2026. 8
2026
-
[18]
KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y . Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yue...
2025
-
[19]
Gonzalez, and Ion Stoica
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024
2024
-
[20]
Wild- bench: Benchmarking llms with challenging tasks from real users in the wild, 2024
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wild- bench: Benchmarking llms with challenging tasks from real users in the wild, 2024
2024
-
[21]
Full-duplex-bench-v3: Benchmarking tool use for full-duplex voice agents under real-world disfluency, 2026
Guan-Ting Lin, Chen Chen, Zhehuai Chen, and Hung yi Lee. Full-duplex-bench-v3: Benchmarking tool use for full-duplex voice agents under real-world disfluency, 2026
2026
-
[22]
Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an auto- mated examiner, 2026
Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai- Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an auto- mated examiner, 2026
2026
-
[23]
Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models, 2026
Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Ji- achen Lian, Tingle Li, Shinji Watanabe, and Hung yi Lee. Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models, 2026
2026
-
[24]
Liu, and Hung yi Lee
Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking ca- pabilities, 2025
2025
-
[25]
V oxtral, 2025
Mistral AI. V oxtral, 2025
2025
-
[26]
Hello GPT-4o
OpenAI. Hello GPT-4o. https://openai.com/index/ hello-gpt-4o/, 2024. Accessed: 2026-06-09
2024
-
[27]
gpt-audio
OpenAI. gpt-audio. https://developers.openai.com/ api/docs/models/gpt-audio, 2025. Accessed: 2026- 06-09
2025
-
[28]
gpt-audio-mini
OpenAI. gpt-audio-mini. https://developers.openai. com/api/docs/models/gpt-audio-mini, 2025. Ac- cessed: 2026-06-09
2025
-
[29]
Introducing gpt-realtime and Realtime API updates for production voice agents
OpenAI. Introducing gpt-realtime and Realtime API updates for production voice agents. https://openai. com/index/introducing-gpt-realtime/, 2025. Ac- cessed: 2026-06-09
2025
-
[30]
Introducing OpenAI o3 and o4-mini
OpenAI. Introducing OpenAI o3 and o4-mini. https: //openai.com/index/introducing-o3-and-o4-mini/ ,
-
[31]
Accessed: 2026-06-17
2026
-
[32]
Advancing voice intelligence with new models in the API
OpenAI. Advancing voice intelligence with new models in the API. https://openai.com/index/ advancing-voice-intelligence-with-new-models-in-the-api/ ,
-
[33]
Accessed: 2026-06-09
2026
-
[34]
Introducing GPT-5.4 mini and nano
OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/ , 2026. Ac- cessed: 2026-06-09
2026
-
[35]
Robust speech recognition via large-scale weak supervision, 2022
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022
2022
-
[36]
τ-voice: Benchmarking full- duplex voice agents on real-world domains, 2026
Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan. τ-voice: Benchmarking full- duplex voice agents on real-world domains, 2026
2026
-
[37]
Schuirmann
Donald J. Schuirmann. A comparison of the two one- sided tests procedure and the power approach for assess- ing the equivalence of average bioavailability.Journal of Pharmacokinetics and Biopharmaceutics, 15(6):657– 680, 1987
1987
-
[38]
Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dia- logue agents, 2025
Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, and Yongbin Li. Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dia- logue agents, 2025
2025
-
[39]
Multichallenge: A realistic multi-turn conversa- tion evaluation benchmark challenging to frontier llms, 2025
Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversa- tion evaluation benchmark challenging to frontier llms, 2025
2025
-
[40]
Submodular benchmark selection
Alexander Smola. Submodular benchmark selection. arXiv preprint arXiv:2605.02209, 2026
Pith/arXiv arXiv 2026
-
[41]
INSTRUCT-FD: Can your full-duplex speech system follow turn-taking instructions?, 2026
Yuzhi Tang, Wentao Ma, Xiling Zhao, Ahmad Sal- imi, Sepehr Harfi Moridani, Dongming Shen, Jixuan Wang, Abdulrahman Abdulrazzag, Murdock Aubry, Yu- Hua Chen, Daniel Lee, Jaewon Lee, Jonah Mackey, Silin Meng, Nicholas Stranges, Chenxu Xiong, Hao Yu, Yi Zhu, Mu Li, and Alex Smola. INSTRUCT-FD: Can your full-duplex speech system follow turn-taking instructions?, 2026
2026
-
[42]
Full-duplex interaction in spoken dialogue systems: A comprehensive study from the icassp 2026 humdial challenge, 2026
Chengyou Wang, Hongfei Xue, Guojian Li, Zhixian Zhao, Shuiyuan Wang, Shuai Wang, Xin Xu, Hui Bu, and Lei Xie. Full-duplex interaction in spoken dialogue systems: A comprehensive study from the icassp 2026 humdial challenge, 2026
2026
-
[43]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023. 9
2023
-
[44]
Semantic-aware interruption detection in spo- ken dialogue systems: Benchmark, metric, and model, 2026
Kangxiang Xia, Bingshen Mu, Xian Shi, Jin Xu, and Lei Xie. Semantic-aware interruption detection in spo- ken dialogue systems: Benchmark, metric, and model, 2026
2026
-
[45]
Mimo-audio: Audio lan- guage models are few-shot learners, 2025
LLM-Core-Team Xiaomi. Mimo-audio: Audio lan- guage models are few-shot learners, 2025
2025
-
[46]
Wizardlm: Empowering large pre- trained language models to follow complex instructions, 2025
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre- trained language models to follow complex instructions, 2025
2025
-
[47]
Qwen2.5-Omni technical report, 2025
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report, 2025
2025
-
[48]
Qwen3-omni technical report, 2025
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xue- jing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, B...
2025
-
[49]
Beyond task-oriented and chitchat dialogues: Proactive and transition-aware conversational agents, 2025
Yejin Yoon, Yuri Son, Namyoung So, Minseo Kim, Minsoo Cho, Chanhee Park, Seungshin Lee, and Taeuk Kim. Beyond task-oriented and chitchat dialogues: Proactive and transition-aware conversational agents, 2025
2025
-
[50]
Mtr- duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech lan- guage models, 2026
He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. Mtr- duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech lan- guage models, 2026
2026
-
[51]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
2023
-
[52]
Sotopia: Interactive evaluation for so- cial intelligence in language agents, 2024
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for so- cial intelligence in language agents, 2024
2024
-
[53]
Glad you’re following along!
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wen- zhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, and Philip S. Yu. When users change their mind: Evaluat- ing interruptible agents in long-horizon web navigation, 2026. 10...
2026
-
[54]
Confirm you’re speaking with the applicant
Open and Establish Purpose Introduce yourself as calling from SDHA about the applicant’s disaster assistance application to coordinate a required home inspection. Confirm you’re speaking with the applicant. State that the call will include verification, scheduling, and consent to share contact details with an inspector. Ask if now is a good time to procee...
-
[55]
Explain verification is required before discussing case details
Verify Case and Identity Authenticate by requesting the user to read their case ID and to state their full legal name and date of birth. Explain verification is required before discussing case details. Match all three against records. Skip conditions: Only if the correct case ID, full name, and DOB were already provided and explicitly confirmed. Failure h...
-
[56]
Clearly distinguish from the mailing address
Confirm Damaged Property Address Confirm the damaged property address on file and ask the user to validate or correct it. Clearly distinguish from the mailing address. 14 . . . [skip conditions, failure handling, terminate conditions]
-
[57]
Capture whether someone 18+ with photo ID can be present
Assess Access and Safety Needs Ask about accessibility and special requirements: safety hazards, utilities status, debris/blocked access, gates or lock codes, parking constraints, pets on site, occupants with mobility/medical needs, restricted areas. Capture whether someone 18+ with photo ID can be present. . . . [skip conditions, failure handling, termin...
-
[58]
Ask for 2–3 preferred date/windows starting from the earliest available date
Gather Availability Preferences Explain inspection duration (60–90 minutes), presence requirement, and allowed windows: Weekdays AM (8–12), PM (12–4), Late (4–6), and limited Saturday AM, all in Central Time. Ask for 2–3 preferred date/windows starting from the earliest available date. . . . [skip conditions, failure handling, terminate conditions]
-
[59]
Ask for explicit consent to share these details with the assigned inspector and to send the appointment confirmation via their chosen contact method
Obtain Consent to Share Details Disclose that SDHA uses contracted inspectors who need the applicant’s name, damaged property address, and contact number/email to coordinate access and reminders. Ask for explicit consent to share these details with the assigned inspector and to send the appointment confirmation via their chosen contact method. Skip condit...
-
[60]
On confirmation, state the inspection is booked and provide the confirmation number
Propose and Book Appointment Using the user’s preferences and policy windows, propose the earliest matching slot. On confirmation, state the inspection is booked and provide the confirmation number. Remind them an adult 18+ must be present, secure pets, and maintain a clear path. . . . [skip conditions, failure handling, terminate conditions]
-
[61]
If consented, state you will send the confirmation
Confirm Details and Close Read back the final date/window, damaged property address, key access notes, presence requirement, and the confirmation number. If consented, state you will send the confirmation. Provide the conversation reference ID and recap how to reschedule or cancel. Skip conditions: Never. Terminate conditions: Success when appointment det...
-
[62]
This call will include verifying your case, scheduling an inspection window, and getting consent to share your contact details” [Undelivered: “with the inspector
Assistant[INTERRUPTED]: “Hi, this is the State Disaster Housing Assistance program calling about your disaster assistance application to coordinate your home inspection. This call will include verifying your case, scheduling an inspection window, and getting consent to share your contact details” [Undelivered: “with the inspector. Am I speaking with Maris...
-
[63]
Hold on, why are we verifying my case again? I already sent all that with my application. I mean, do you not have it on your end?
User[pushback]: “Hold on, why are we verifying my case again? I already sent all that with my application. I mean, do you not have it on your end?”
-
[64]
your full name and date of birth
Assistant[INTERRUPTED]: “I get it, you already provided that. We do a quick verification to protect your privacy before we talk about your case or schedule. It’ll be fast. Could you please read your case ID for me? After that I’ll confirm” [Undelivered: “your full name and date of birth. ”]
-
[65]
Yeah, it’s DR-4827315. Let’s keep this quick, please
User[impatient]: “Yeah, it’s DR-4827315. Let’s keep this quick, please.”
-
[66]
Thanks, that matches our records. To finish verification, please state your full legal name and your date of birth
Assistant[INTERRUPTED]: “Thanks, that matches our records. To finish verification, please state your full legal name and your date of birth.” [Undelivered: “This is just to protect your privacy, and then we’ll go straight to scheduling. ”]
-
[67]
Wait, why do you need my date of birth if I already gave you the case ID? I mean, is that not enough to verify me? I’d rather not say that over the phone without a good reason
User[pushback]: “Wait, why do you need my date of birth if I already gave you the case ID? I mean, is that not enough to verify me? I’d rather not say that over the phone without a good reason.”
-
[68]
not asking for anything like SSN or financial info
Assistant[INTERRUPTED]: “I hear you, and I understand the concern. We use your date of birth with the case ID to securely match your record and protect your privacy. Without that quick check I can’t move forward with scheduling. Please state your full legal name and date of birth. We’ll use it only to verify your case on this call, and I’m” [Undelivered: ...
-
[69]
Yeah, fine. Marisol Ortega, date of birth November second, nineteen eighty four. Let’s just move on, please
User[impatient]: “Yeah, fine. Marisol Ortega, date of birth November second, nineteen eighty four. Let’s just move on, please.” G.3 Evaluation Rubric (Interruption at Message 1) Rubric for interruption at message 1 (pushback). Task Fulfillment Criterion The assistant briefly explains that verification is required before discussing case details to protect ...
-
[70]
Domain name + description (embedded naturally in the role description)
-
[71]
The goal (may be paraphrased, but must preserve all required checkpoints and the terminal outcome)
-
[72]
Operating rules
The detailed_guidelines (as “Operating rules” or similar)
-
[73]
A section for known user information containing the {known_user_information} placeholder
-
[74]
The stages (well-formatted as a numbered procedure), each with: name, description, skip_conditions, failure_handling, terminate_conditions
-
[75]
normal",
Completion definition: terminal success and hard-stop failure conditions Additional constraints: - Do NOT mention any meta-evaluation terms (benchmark, judge, rubric, verifier, dataset, etc.). - Keep it strict but not robotic: do not instruct the assistant to narrate stage numbers to the user. Return ONLY the JSON object. I.2 Round Planner The round plann...
-
[76]
Do not rewrite, improve, or touch any message that is not targeted
Apply ONLY the changes described in the edit commands. Do not rewrite, improve, or touch any message that is not targeted
-
[77]
Modify only that message’s content
Each edit command targets a specific message by its index. Modify only that message’s content
-
[78]
Preserve the rest exactly as it is
Change as FEW words as possible to fix the described issue. Preserve the rest exactly as it is
-
[79]
If two edit commands target the same message, apply both changes
-
[80]
Maintain TTS-friendly formatting: numbers as words, currency spoken out, no markdown, no em dashes or semicolons
-
[81]
modified_content must be PLAIN TEXT ONLY, no XML markup
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.