pith. sign in

arxiv: 1907.06554 · v1 · pith:2HQFJ24Knew · submitted 2019-07-15 · 💻 cs.CL · cs.AI· cs.IR

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations

Pith reviewed 2026-05-24 21:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords clarifying questionsopen-domain conversationsinformation-seekingconversational searchQulac datasetquestion selectionretrieval performance
0
0 comments X

The pith

One clarifying question improves retrieval P@1 by over 170% in open-domain conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Users struggle to state complex needs in one query, forcing them to scan results or reformulate. The paper shows that open-domain conversational systems can instead ask clarifying questions to resolve ambiguity before retrieving documents. They build the Qulac dataset of more than 10,000 crowdsourced question-answer pairs over 198 TREC topics and 762 facets. An oracle experiment demonstrates that selecting one effective question more than doubles top-result precision. A three-part retrieval framework that selects the next question based on the original query and prior answers outperforms baselines that ignore conversation history.

Core claim

The paper formulates the task of asking clarifying questions in open-domain information-seeking conversations. It releases the Qulac dataset built on TREC Web Track 2009-2012 topics and shows via an oracle model that one well-chosen clarifying question produces over 170% relative gain in P@1. The authors further present a retrieval framework whose question-selection component conditions on both the initial query and previous question-answer turns, yielding statistically significant gains over competitive baselines.

What carries the argument

The question selection model that scores candidate clarifying questions using the original query together with the history of prior question-answer exchanges.

If this is right

  • Conversational systems limited to one result per turn gain substantial accuracy by asking even a single clarifying question.
  • Question selection improves when the model explicitly conditions on both the initial query and accumulated conversation history.
  • The Qulac dataset supplies an offline testbed that enables repeatable comparison of clarifying-question strategies.
  • Releasing the dataset and evaluation methodology supports community progress on the formulated task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported gains assume users will answer the system's questions; real deployments must handle non-responses or off-topic replies.
  • Extending the framework to generate rather than retrieve questions could increase coverage beyond the collected facets.
  • The 170% figure is an oracle upper bound; practical systems will need robust question-ranking methods to approach it.
  • The same selection logic could be tested on multi-turn clarification sequences rather than single questions.

Load-bearing premise

The crowdsourced questions and answers in Qulac accurately capture real-world user clarifying needs and interactions in open-domain conversations.

What would settle it

A live user study comparing task-completion rates and satisfaction when a system asks questions chosen by the proposed model versus a no-question baseline on the same TREC topics.

Figures

Figures reproduced from arXiv: 1907.06554 by Fabio Crestani, Hamed Zamani, Mohammad Aliannejadi, W. Bruce Croft.

Figure 1
Figure 1. Figure 1: Example conversations with clarifying questions [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A workflow for asking clarifying questions in an [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of topic type, facet type, and query length [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper formulates the task of asking clarifying questions in open-domain information-seeking conversational systems. It introduces an offline evaluation methodology and releases the Qulac dataset of over 10K crowdsourced question-answer pairs built on 198 TREC Web Track 2009-2012 topics with 762 facets. An oracle model that selects one good question reports over 170% improvement in P@1 retrieval performance. The authors further propose a three-component framework (question retrieval, question selection accounting for prior interactions, and document retrieval) whose question selection component outperforms competitive baselines.

Significance. If the results hold, the work would be significant for highlighting the potential value of proactive clarification in conversational IR and for releasing a public dataset that can support further research. The oracle result quantifies a large potential upside, and the public Qulac release is a clear strength for reproducibility.

major comments (2)
  1. [Abstract / experimental section] Abstract and experimental results: the central oracle claim of >170% P@1 improvement is presented without error bars, statistical significance tests, details on data exclusion criteria, or variance across topics/facets. This directly affects assessment of whether the reported gain reliably supports the 'potential impact' conclusion.
  2. [Dataset section] Dataset construction (Qulac): the crowdsourcing protocol on predefined TREC facets is described, but no validation against real user logs or naturally occurring clarifying questions is provided. This assumption is load-bearing for interpreting the oracle gains as indicative of practical value in open-domain conversations.
minor comments (1)
  1. [Model section] The description of the question selection model could clarify how previous QA pairs are encoded and whether the model is trained end-to-end or in stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our paper. We are pleased that the significance of the work and the release of the Qulac dataset are recognized. We address the major comments point-by-point below, and will incorporate revisions as indicated.

read point-by-point responses
  1. Referee: [Abstract / experimental section] Abstract and experimental results: the central oracle claim of >170% P@1 improvement is presented without error bars, statistical significance tests, details on data exclusion criteria, or variance across topics/facets. This directly affects assessment of whether the reported gain reliably supports the 'potential impact' conclusion.

    Authors: We agree that additional statistical details would improve the robustness of the oracle claim. The reported 170% improvement is the average relative gain in P@1 when using an oracle to select one clarifying question versus no question, computed over the entire Qulac dataset derived from 198 TREC topics. Data exclusion was limited to the TREC Web Track 2009-2012 topics that have multiple facets. In the revised version, we will report the standard deviation of the improvement across topics, include error bars in the relevant figure or table, and perform a paired statistical significance test (e.g., Wilcoxon signed-rank test) to confirm the gain is reliable. This addresses the concern about variance and supports the potential impact conclusion more rigorously. revision: yes

  2. Referee: [Dataset section] Dataset construction (Qulac): the crowdsourcing protocol on predefined TREC facets is described, but no validation against real user logs or naturally occurring clarifying questions is provided. This assumption is load-bearing for interpreting the oracle gains as indicative of practical value in open-domain conversations.

    Authors: We recognize that direct validation against real user logs would provide stronger evidence for practical applicability. The Qulac dataset leverages TREC topics and facets, which were created to represent diverse user interpretations of ambiguous queries, and the crowdsourcing process generates questions that help distinguish between these facets. This setup allows for controlled offline evaluation of the task. We do not have access to real conversational logs for validation in this work. In the revision, we will expand the discussion section to explicitly note this as a limitation and explain why TREC-based facets serve as a reasonable proxy for studying clarifying questions in open-domain settings. We believe this maintains the value of the dataset as a benchmark while being transparent about its construction. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from new crowdsourced dataset and external baselines

full rationale

The paper's core claims rest on collecting a new dataset (Qulac) via crowdsourcing over TREC Web topics/facets, then running an oracle experiment and a three-component retrieval framework that is compared to competitive baselines. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs. The 170% P@1 gain is an observed experimental outcome on the collected data, not a self-defined or self-cited tautology. Self-citations, if any, are not load-bearing for the central empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from information retrieval evaluation and crowdsourcing practices without introducing new free parameters or invented entities.

axioms (1)
  • domain assumption Standard IR metrics such as P@1 are suitable for evaluating the impact of clarifying questions.
    Invoked when reporting the 170% improvement without additional justification for the metric choice in the new task.

pith-pipeline@v0.9.0 · 5792 in / 1085 out tokens · 44744 ms · 2026-05-24T21:26:41.967252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 6 internal anchors

  1. [1]

    Mohammad Aliannejadi, Masoud Kiaeeha, Shahram Khadivi, and Saeed Shiry Ghidary. 2014. Graph-Based Semi-Supervised Conditional Random Fields For Spoken Language Understanding Using Unaligned Data. In ALTA. 98–103

  2. [3]

    In Situ and Context-Aware Target Apps Selection for Unified Mobile Search. In CIKM. 1383–1392

  3. [4]

    Bruce Croft

    Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft

  4. [5]

    In SIGIR

    Target Apps Selection: Towards a Unified Search Framework for Mobile Devices. In SIGIR. 215–224

  5. [6]

    Omar Alonso and Maria Stone. 2014. Building a Query Log via Crowdsourcing. In SIGIR. 939–942

  6. [7]

    Harald Aust, Martin Oerder, Frank Seide, and Volker Steinbiss. 1995. The Philips automatic train timetable information system. Speech Communication 17, 3-4 (1995), 249–262

  7. [8]

    Seyed Ali Bahrainian and Fabio Crestani. 2018. Augmentation of Human Memory: Anticipating Topics that Continue in the Next Meeting. In CHIIR. 150–159

  8. [9]

    Nicholas J Belkin, Colleen Cool, Adelheit Stein, and Ulrich Thiel. 1995. Cases, scripts, and information-seeking strategies: On the design of interactive informa- tion retrieval systems. Expert systems with applications 9, 3 (1995), 379–395

  9. [10]

    Benetka, Krisztian Balog, and Kjetil Nørvåg

    Jan R. Benetka, Krisztian Balog, and Kjetil Nørvåg. 2017. Anticipating Information Needs Based on Check-in Activity. In WSDM. 41–50

  10. [11]

    Pavel Braslavski, Denis Savenkov, Eugene Agichtein, and Alina Dubatovka. 2017. What Do You Mean Exactly?: Analyzing Clarification Questions in CQA. InCHIIR. 345–348

  11. [12]

    Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. 2005. Learning to rank using gradient descent. In ICML. 89–96

  12. [13]

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question Answering in Context. In EMNLP. 2174–2184

  13. [14]

    Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards Conversational Recommender Systems. In KDD. 815–824

  14. [15]

    Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. In TREC

  15. [16]

    Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. In TREC

  16. [17]

    Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of the TREC 2012 Web Track. In TREC

  17. [18]

    Bruce Croft and R

    W. Bruce Croft and R. H. Thompson. 1987. I3R: A new approach to the design of document retrieval systems. JASIS 38, 6 (1987), 389–404

  18. [19]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018)

  19. [20]

    Yulan He and Steve J. Young. 2005. Semantic processing using the Hidden Vector State model. Computer Speech & Language 19, 1 (2005), 85–106

  20. [21]

    Hemphill, John J

    Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS Spoken Language Systems Pilot Corpus. In HLT. 96–101

  21. [22]

    Di Jiang, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015. Query suggestion with diversification and personalization. Knowl.-Based Syst. 89 (2015), 553–568

  22. [23]

    Kato and Katsumi Tanaka

    Makoto P. Kato and Katsumi Tanaka. 2016. To Suggest, or Not to Suggest for Queries with Diverse Intents: Optimizing Search Result Presentation. In WSDM. 133–142

  23. [24]

    Johannes Kiesel, Arefeh Bahrami, Benno Stein, Avishek Anand, and Matthias Hagen. 2018. Toward Voice Query Clarification. In SIGIR. 1257–1260

  24. [25]

    Weize Kong and James Allan. 2013. Extracting query facets from search results. In SIGIR. 93–102

  25. [26]

    John Lafferty and Chengxiang Zhai. 2001. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In SIGIR. 111–119

  26. [27]

    Bruce Croft

    Victor Lavrenko and W. Bruce Croft. 2001. Relevance-Based Language Models. In SIGIR. 120–127

  27. [28]

    Shane Culpepper

    Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Inf. Retr. Journal 19, 4 (2016), 416–445

  28. [29]

    A Deep Look into Neural Ranking Models for Information Retrieval

    Harshith Padigela, Hamed Zamani, and W. Bruce Croft. 2019. Investigating the Successes and Failures of BERT for Passage Re-Ranking. arXiv:1903.06902 (2019)

  29. [30]

    Joaquín Pérez-Iglesias and Lourdes Araujo. 2010. Standard Deviation as a Query Hardness Estimator. In SPIRE. 207–212

  30. [31]

    Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon

    Roberto Pieraccini, Evelyne Tzoukermann, Z. Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon. 1992. A speech understanding system based on statistical representation of semantics. In ICASSP. 193–196

  31. [32]

    Ponte and W

    Jay M. Ponte and W. Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In SIGIR. 275–281

  32. [33]

    Bruce Croft, and Wei Lin

    Minghui Qiu, Liu Yang, Feng Ji, Wei Zhou, Jun Huang, Haiqing Chen, W. Bruce Croft, and Wei Lin. 2018. Transfer Learning for Context-Aware Question Match- ing in Information-seeking Conversations in E-commerce. In ACL (2). 208–213

  33. [34]

    Bruce Croft, Johanne R

    Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. Analyzing and Characterizing User Intent in Information- seeking Conversations. In SIGIR. 989–992

  34. [35]

    Filip Radlinski and Nick Craswell. 2017. A Theoretical Framework for Conversa- tional Search. In CHIIR. 117–126

  35. [36]

    Sudha Rao and Hal Daumé. 2018. Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information. In ACL (1). 2736–2745

  36. [37]

    Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. arXiv:1904.02281 (2019)

  37. [38]

    Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. CoQA: A Conversa- tional Question Answering Challenge. arXiv:1808.07042 (2018)

  38. [39]

    Gary Ren, Xiaochuan Ni, Manish Malik, and Qifa Ke. 2018. Conversational Query Understanding Using Sequence to Sequence Modeling. In WWW. 1715–1724

  39. [40]

    Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

    Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In TREC. 109–126

  40. [41]

    Trippas, Lawrence Cavedon, and Mark Sanderson

    Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, and Mark Sanderson

  41. [42]

    JASIST 68, 9 (2017), 2101–2115

    Extracting audio summaries to support effective spoken document search. JASIST 68, 9 (2017), 2101–2115

  42. [43]

    Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In SIGIR. 235–244

  43. [44]

    Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao

  44. [45]

    In ACL (2)

    How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models. In ACL (2). 231–236

  45. [46]

    Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson

    Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson. 2018. Informing the Design of Spoken Conversational Search: Per- spective Paper. In CHIIR. 32–41

  46. [47]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 (2017)

  47. [48]

    Alexandra Vtyurina, Denis Savenkov, Eugene Agichtein, and Charles L. A. Clarke

  48. [49]

    In CHI Extended Abstracts

    Exploring Conversational Search With Humans, Assistants, and Wizards. In CHI Extended Abstracts. 2187–2193

  49. [50]

    Walker, Rebecca J

    Marilyn A. Walker, Rebecca J. Passonneau, and Julie E. Boland. 2001. Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems. In ACL. 515–522

  50. [51]

    Yansen Wang, Chenyi Liu, Minlie Huang, and Liqiang Nie. 2018. Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders. In ACL (1). 2193–2203

  51. [52]

    Williams, Antoine Raux, Deepak Ramachandran, and Alan W

    Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Black

  52. [53]

    In SIGDIAL

    The Dialog State Tracking Challenge. In SIGDIAL. 404–413

  53. [54]

    Qiang Wu, Christopher J. C. Burges, Krysta Marie Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Inf. Retr. 13, 3 (2010), 254–270

  54. [55]

    Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In SIGIR. 55–64

  55. [56]

    Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In SIGIR. 685–694

  56. [57]

    Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation

    Liu Yang, Hamed Zamani, Yongfeng Zhang, Jiafeng Guo, and W. Bruce Croft. 2017. Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation. arXiv:1707.05409 (2017)

  57. [58]

    Chengxiang Zhai and John Lafferty. 2017. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR Forum 51, 2 (2017), 268–276

  58. [59]

    Bruce Croft

    Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. Towards Conversational Search and Recommendation: System Ask, User Respond. In CIKM. 177–186