Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
Pith reviewed 2026-05-24 21:26 UTC · model grok-4.3
The pith
One clarifying question improves retrieval P@1 by over 170% in open-domain conversations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper formulates the task of asking clarifying questions in open-domain information-seeking conversations. It releases the Qulac dataset built on TREC Web Track 2009-2012 topics and shows via an oracle model that one well-chosen clarifying question produces over 170% relative gain in P@1. The authors further present a retrieval framework whose question-selection component conditions on both the initial query and previous question-answer turns, yielding statistically significant gains over competitive baselines.
What carries the argument
The question selection model that scores candidate clarifying questions using the original query together with the history of prior question-answer exchanges.
If this is right
- Conversational systems limited to one result per turn gain substantial accuracy by asking even a single clarifying question.
- Question selection improves when the model explicitly conditions on both the initial query and accumulated conversation history.
- The Qulac dataset supplies an offline testbed that enables repeatable comparison of clarifying-question strategies.
- Releasing the dataset and evaluation methodology supports community progress on the formulated task.
Where Pith is reading between the lines
- The reported gains assume users will answer the system's questions; real deployments must handle non-responses or off-topic replies.
- Extending the framework to generate rather than retrieve questions could increase coverage beyond the collected facets.
- The 170% figure is an oracle upper bound; practical systems will need robust question-ranking methods to approach it.
- The same selection logic could be tested on multi-turn clarification sequences rather than single questions.
Load-bearing premise
The crowdsourced questions and answers in Qulac accurately capture real-world user clarifying needs and interactions in open-domain conversations.
What would settle it
A live user study comparing task-completion rates and satisfaction when a system asks questions chosen by the proposed model versus a no-question baseline on the same TREC topics.
Figures
read the original abstract
Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates the task of asking clarifying questions in open-domain information-seeking conversational systems. It introduces an offline evaluation methodology and releases the Qulac dataset of over 10K crowdsourced question-answer pairs built on 198 TREC Web Track 2009-2012 topics with 762 facets. An oracle model that selects one good question reports over 170% improvement in P@1 retrieval performance. The authors further propose a three-component framework (question retrieval, question selection accounting for prior interactions, and document retrieval) whose question selection component outperforms competitive baselines.
Significance. If the results hold, the work would be significant for highlighting the potential value of proactive clarification in conversational IR and for releasing a public dataset that can support further research. The oracle result quantifies a large potential upside, and the public Qulac release is a clear strength for reproducibility.
major comments (2)
- [Abstract / experimental section] Abstract and experimental results: the central oracle claim of >170% P@1 improvement is presented without error bars, statistical significance tests, details on data exclusion criteria, or variance across topics/facets. This directly affects assessment of whether the reported gain reliably supports the 'potential impact' conclusion.
- [Dataset section] Dataset construction (Qulac): the crowdsourcing protocol on predefined TREC facets is described, but no validation against real user logs or naturally occurring clarifying questions is provided. This assumption is load-bearing for interpreting the oracle gains as indicative of practical value in open-domain conversations.
minor comments (1)
- [Model section] The description of the question selection model could clarify how previous QA pairs are encoded and whether the model is trained end-to-end or in stages.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our paper. We are pleased that the significance of the work and the release of the Qulac dataset are recognized. We address the major comments point-by-point below, and will incorporate revisions as indicated.
read point-by-point responses
-
Referee: [Abstract / experimental section] Abstract and experimental results: the central oracle claim of >170% P@1 improvement is presented without error bars, statistical significance tests, details on data exclusion criteria, or variance across topics/facets. This directly affects assessment of whether the reported gain reliably supports the 'potential impact' conclusion.
Authors: We agree that additional statistical details would improve the robustness of the oracle claim. The reported 170% improvement is the average relative gain in P@1 when using an oracle to select one clarifying question versus no question, computed over the entire Qulac dataset derived from 198 TREC topics. Data exclusion was limited to the TREC Web Track 2009-2012 topics that have multiple facets. In the revised version, we will report the standard deviation of the improvement across topics, include error bars in the relevant figure or table, and perform a paired statistical significance test (e.g., Wilcoxon signed-rank test) to confirm the gain is reliable. This addresses the concern about variance and supports the potential impact conclusion more rigorously. revision: yes
-
Referee: [Dataset section] Dataset construction (Qulac): the crowdsourcing protocol on predefined TREC facets is described, but no validation against real user logs or naturally occurring clarifying questions is provided. This assumption is load-bearing for interpreting the oracle gains as indicative of practical value in open-domain conversations.
Authors: We recognize that direct validation against real user logs would provide stronger evidence for practical applicability. The Qulac dataset leverages TREC topics and facets, which were created to represent diverse user interpretations of ambiguous queries, and the crowdsourcing process generates questions that help distinguish between these facets. This setup allows for controlled offline evaluation of the task. We do not have access to real conversational logs for validation in this work. In the revision, we will expand the discussion section to explicitly note this as a limitation and explain why TREC-based facets serve as a reasonable proxy for studying clarifying questions in open-domain settings. We believe this maintains the value of the dataset as a benchmark while being transparent about its construction. revision: partial
Circularity Check
No circularity: empirical results from new crowdsourced dataset and external baselines
full rationale
The paper's core claims rest on collecting a new dataset (Qulac) via crowdsourcing over TREC Web topics/facets, then running an oracle experiment and a three-component retrieval framework that is compared to competitive baselines. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs. The 170% P@1 gain is an observed experimental outcome on the collected data, not a self-defined or self-cited tautology. Self-citations, if any, are not load-bearing for the central empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard IR metrics such as P@1 are suitable for evaluating the impact of clarifying questions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
oracle model ... over 170% retrieval performance improvement in terms of P@1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mohammad Aliannejadi, Masoud Kiaeeha, Shahram Khadivi, and Saeed Shiry Ghidary. 2014. Graph-Based Semi-Supervised Conditional Random Fields For Spoken Language Understanding Using Unaligned Data. In ALTA. 98–103
work page 2014
-
[3]
In Situ and Context-Aware Target Apps Selection for Unified Mobile Search. In CIKM. 1383–1392
- [4]
- [5]
-
[6]
Omar Alonso and Maria Stone. 2014. Building a Query Log via Crowdsourcing. In SIGIR. 939–942
work page 2014
-
[7]
Harald Aust, Martin Oerder, Frank Seide, and Volker Steinbiss. 1995. The Philips automatic train timetable information system. Speech Communication 17, 3-4 (1995), 249–262
work page 1995
-
[8]
Seyed Ali Bahrainian and Fabio Crestani. 2018. Augmentation of Human Memory: Anticipating Topics that Continue in the Next Meeting. In CHIIR. 150–159
work page 2018
-
[9]
Nicholas J Belkin, Colleen Cool, Adelheit Stein, and Ulrich Thiel. 1995. Cases, scripts, and information-seeking strategies: On the design of interactive informa- tion retrieval systems. Expert systems with applications 9, 3 (1995), 379–395
work page 1995
-
[10]
Benetka, Krisztian Balog, and Kjetil Nørvåg
Jan R. Benetka, Krisztian Balog, and Kjetil Nørvåg. 2017. Anticipating Information Needs Based on Check-in Activity. In WSDM. 41–50
work page 2017
-
[11]
Pavel Braslavski, Denis Savenkov, Eugene Agichtein, and Alina Dubatovka. 2017. What Do You Mean Exactly?: Analyzing Clarification Questions in CQA. InCHIIR. 345–348
work page 2017
-
[12]
Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. 2005. Learning to rank using gradient descent. In ICML. 89–96
work page 2005
-
[13]
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question Answering in Context. In EMNLP. 2174–2184
work page 2018
-
[14]
Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards Conversational Recommender Systems. In KDD. 815–824
work page 2016
-
[15]
Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. In TREC
work page 2009
-
[16]
Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. In TREC
work page 2011
-
[17]
Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of the TREC 2012 Web Track. In TREC
work page 2012
-
[18]
W. Bruce Croft and R. H. Thompson. 1987. I3R: A new approach to the design of document retrieval systems. JASIS 38, 6 (1987), 389–404
work page 1987
-
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Yulan He and Steve J. Young. 2005. Semantic processing using the Hidden Vector State model. Computer Speech & Language 19, 1 (2005), 85–106
work page 2005
-
[21]
Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS Spoken Language Systems Pilot Corpus. In HLT. 96–101
work page 1990
-
[22]
Di Jiang, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015. Query suggestion with diversification and personalization. Knowl.-Based Syst. 89 (2015), 553–568
work page 2015
-
[23]
Makoto P. Kato and Katsumi Tanaka. 2016. To Suggest, or Not to Suggest for Queries with Diverse Intents: Optimizing Search Result Presentation. In WSDM. 133–142
work page 2016
-
[24]
Johannes Kiesel, Arefeh Bahrami, Benno Stein, Avishek Anand, and Matthias Hagen. 2018. Toward Voice Query Clarification. In SIGIR. 1257–1260
work page 2018
-
[25]
Weize Kong and James Allan. 2013. Extracting query facets from search results. In SIGIR. 93–102
work page 2013
-
[26]
John Lafferty and Chengxiang Zhai. 2001. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In SIGIR. 111–119
work page 2001
-
[27]
Victor Lavrenko and W. Bruce Croft. 2001. Relevance-Based Language Models. In SIGIR. 120–127
work page 2001
-
[28]
Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Inf. Retr. Journal 19, 4 (2016), 416–445
work page 2016
-
[29]
A Deep Look into Neural Ranking Models for Information Retrieval
Harshith Padigela, Hamed Zamani, and W. Bruce Croft. 2019. Investigating the Successes and Failures of BERT for Passage Re-Ranking. arXiv:1903.06902 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[30]
Joaquín Pérez-Iglesias and Lourdes Araujo. 2010. Standard Deviation as a Query Hardness Estimator. In SPIRE. 207–212
work page 2010
-
[31]
Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon
Roberto Pieraccini, Evelyne Tzoukermann, Z. Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon. 1992. A speech understanding system based on statistical representation of semantics. In ICASSP. 193–196
work page 1992
-
[32]
Jay M. Ponte and W. Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In SIGIR. 275–281
work page 1998
-
[33]
Minghui Qiu, Liu Yang, Feng Ji, Wei Zhou, Jun Huang, Haiqing Chen, W. Bruce Croft, and Wei Lin. 2018. Transfer Learning for Context-Aware Question Match- ing in Information-seeking Conversations in E-commerce. In ACL (2). 208–213
work page 2018
-
[34]
Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. Analyzing and Characterizing User Intent in Information- seeking Conversations. In SIGIR. 989–992
work page 2018
-
[35]
Filip Radlinski and Nick Craswell. 2017. A Theoretical Framework for Conversa- tional Search. In CHIIR. 117–126
work page 2017
-
[36]
Sudha Rao and Hal Daumé. 2018. Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information. In ACL (1). 2736–2745
work page 2018
-
[37]
Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. arXiv:1904.02281 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[38]
Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. CoQA: A Conversa- tional Question Answering Challenge. arXiv:1808.07042 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
Gary Ren, Xiaochuan Ni, Manish Malik, and Qifa Ke. 2018. Conversational Query Understanding Using Sequence to Sequence Modeling. In WWW. 1715–1724
work page 2018
-
[40]
Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In TREC. 109–126
work page 1994
-
[41]
Trippas, Lawrence Cavedon, and Mark Sanderson
Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, and Mark Sanderson
-
[42]
JASIST 68, 9 (2017), 2101–2115
Extracting audio summaries to support effective spoken document search. JASIST 68, 9 (2017), 2101–2115
work page 2017
-
[43]
Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In SIGIR. 235–244
work page 2018
-
[44]
Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao
-
[45]
How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models. In ACL (2). 231–236
-
[46]
Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson
Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson. 2018. Informing the Design of Spoken Conversational Search: Per- spective Paper. In CHIIR. 32–41
work page 2018
-
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
Alexandra Vtyurina, Denis Savenkov, Eugene Agichtein, and Charles L. A. Clarke
-
[49]
Exploring Conversational Search With Humans, Assistants, and Wizards. In CHI Extended Abstracts. 2187–2193
-
[50]
Marilyn A. Walker, Rebecca J. Passonneau, and Julie E. Boland. 2001. Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems. In ACL. 515–522
work page 2001
-
[51]
Yansen Wang, Chenyi Liu, Minlie Huang, and Liqiang Nie. 2018. Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders. In ACL (1). 2193–2203
work page 2018
-
[52]
Williams, Antoine Raux, Deepak Ramachandran, and Alan W
Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Black
- [53]
-
[54]
Qiang Wu, Christopher J. C. Burges, Krysta Marie Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Inf. Retr. 13, 3 (2010), 254–270
work page 2010
-
[55]
Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In SIGIR. 55–64
work page 2016
-
[56]
Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In SIGIR. 685–694
work page 2017
-
[57]
Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation
Liu Yang, Hamed Zamani, Yongfeng Zhang, Jiafeng Guo, and W. Bruce Croft. 2017. Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation. arXiv:1707.05409 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[58]
Chengxiang Zhai and John Lafferty. 2017. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR Forum 51, 2 (2017), 268–276
work page 2017
-
[59]
Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. Towards Conversational Search and Recommendation: System Ask, User Respond. In CIKM. 177–186
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.