Asking Clarifying Questions in Open-Domain Information-Seeking Conversations

Fabio Crestani; Hamed Zamani; Mohammad Aliannejadi; W. Bruce Croft

arxiv: 1907.06554 · v1 · pith:2HQFJ24Knew · submitted 2019-07-15 · 💻 cs.CL · cs.AI· cs.IR

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations

Mohammad Aliannejadi , Hamed Zamani , Fabio Crestani , W. Bruce Croft This is my paper

Pith reviewed 2026-05-24 21:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords clarifying questionsopen-domain conversationsinformation-seekingconversational searchQulac datasetquestion selectionretrieval performance

0 comments

The pith

One clarifying question improves retrieval P@1 by over 170% in open-domain conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Users struggle to state complex needs in one query, forcing them to scan results or reformulate. The paper shows that open-domain conversational systems can instead ask clarifying questions to resolve ambiguity before retrieving documents. They build the Qulac dataset of more than 10,000 crowdsourced question-answer pairs over 198 TREC topics and 762 facets. An oracle experiment demonstrates that selecting one effective question more than doubles top-result precision. A three-part retrieval framework that selects the next question based on the original query and prior answers outperforms baselines that ignore conversation history.

Core claim

The paper formulates the task of asking clarifying questions in open-domain information-seeking conversations. It releases the Qulac dataset built on TREC Web Track 2009-2012 topics and shows via an oracle model that one well-chosen clarifying question produces over 170% relative gain in P@1. The authors further present a retrieval framework whose question-selection component conditions on both the initial query and previous question-answer turns, yielding statistically significant gains over competitive baselines.

What carries the argument

The question selection model that scores candidate clarifying questions using the original query together with the history of prior question-answer exchanges.

If this is right

Conversational systems limited to one result per turn gain substantial accuracy by asking even a single clarifying question.
Question selection improves when the model explicitly conditions on both the initial query and accumulated conversation history.
The Qulac dataset supplies an offline testbed that enables repeatable comparison of clarifying-question strategies.
Releasing the dataset and evaluation methodology supports community progress on the formulated task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported gains assume users will answer the system's questions; real deployments must handle non-responses or off-topic replies.
Extending the framework to generate rather than retrieve questions could increase coverage beyond the collected facets.
The 170% figure is an oracle upper bound; practical systems will need robust question-ranking methods to approach it.
The same selection logic could be tested on multi-turn clarification sequences rather than single questions.

Load-bearing premise

The crowdsourced questions and answers in Qulac accurately capture real-world user clarifying needs and interactions in open-domain conversations.

What would settle it

A live user study comparing task-completion rates and satisfaction when a system asks questions chosen by the proposed model versus a no-question baseline on the same TREC topics.

Figures

Figures reproduced from arXiv: 1907.06554 by Fabio Crestani, Hamed Zamani, Mohammad Aliannejadi, W. Bruce Croft.

**Figure 2.** Figure 2: A workflow for asking clarifying questions in an [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of topic type, facet type, and query length [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qulac gives a usable new dataset and task setup for clarifying questions, with big oracle gains that still need real-user validation.

read the letter

The paper defines the task of asking clarifying questions in open-domain conversational search and releases Qulac, a dataset of over 10k crowdsourced QA pairs on 198 TREC Web topics with 762 facets. That resource and the offline evaluation setup are the concrete additions beyond prior work on facets and clarification in IR. Their oracle experiment shows that feeding one good question-answer pair into retrieval lifts P@1 by more than 170 percent on the TREC topics, which is a clear signal that the direction has headroom. The three-component framework (question retrieval, selection that uses prior turns, and document retrieval) also beats the baselines they report. Those pieces are straightforward and reproducible enough to be useful starting points for others. The soft spot is the gap between crowdsourced questions on fixed facets and what actual users would ask or answer in a live system. The large oracle number assumes the collected questions are representative and that users would respond in the expected way; the paper does not test that assumption directly. Minor issues include limited detail on variance across runs and how facets were sampled, but nothing that breaks the central empirical claim. Readers working on conversational retrieval or dialogue systems will get immediate value from the dataset and the task framing. The work is grounded enough and the contribution timely enough that it should go to peer review rather than desk rejection, even if revisions will be needed on the evaluation realism.

Referee Report

2 major / 1 minor

Summary. The paper formulates the task of asking clarifying questions in open-domain information-seeking conversational systems. It introduces an offline evaluation methodology and releases the Qulac dataset of over 10K crowdsourced question-answer pairs built on 198 TREC Web Track 2009-2012 topics with 762 facets. An oracle model that selects one good question reports over 170% improvement in P@1 retrieval performance. The authors further propose a three-component framework (question retrieval, question selection accounting for prior interactions, and document retrieval) whose question selection component outperforms competitive baselines.

Significance. If the results hold, the work would be significant for highlighting the potential value of proactive clarification in conversational IR and for releasing a public dataset that can support further research. The oracle result quantifies a large potential upside, and the public Qulac release is a clear strength for reproducibility.

major comments (2)

[Abstract / experimental section] Abstract and experimental results: the central oracle claim of >170% P@1 improvement is presented without error bars, statistical significance tests, details on data exclusion criteria, or variance across topics/facets. This directly affects assessment of whether the reported gain reliably supports the 'potential impact' conclusion.
[Dataset section] Dataset construction (Qulac): the crowdsourcing protocol on predefined TREC facets is described, but no validation against real user logs or naturally occurring clarifying questions is provided. This assumption is load-bearing for interpreting the oracle gains as indicative of practical value in open-domain conversations.

minor comments (1)

[Model section] The description of the question selection model could clarify how previous QA pairs are encoded and whether the model is trained end-to-end or in stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our paper. We are pleased that the significance of the work and the release of the Qulac dataset are recognized. We address the major comments point-by-point below, and will incorporate revisions as indicated.

read point-by-point responses

Referee: [Abstract / experimental section] Abstract and experimental results: the central oracle claim of >170% P@1 improvement is presented without error bars, statistical significance tests, details on data exclusion criteria, or variance across topics/facets. This directly affects assessment of whether the reported gain reliably supports the 'potential impact' conclusion.

Authors: We agree that additional statistical details would improve the robustness of the oracle claim. The reported 170% improvement is the average relative gain in P@1 when using an oracle to select one clarifying question versus no question, computed over the entire Qulac dataset derived from 198 TREC topics. Data exclusion was limited to the TREC Web Track 2009-2012 topics that have multiple facets. In the revised version, we will report the standard deviation of the improvement across topics, include error bars in the relevant figure or table, and perform a paired statistical significance test (e.g., Wilcoxon signed-rank test) to confirm the gain is reliable. This addresses the concern about variance and supports the potential impact conclusion more rigorously. revision: yes
Referee: [Dataset section] Dataset construction (Qulac): the crowdsourcing protocol on predefined TREC facets is described, but no validation against real user logs or naturally occurring clarifying questions is provided. This assumption is load-bearing for interpreting the oracle gains as indicative of practical value in open-domain conversations.

Authors: We recognize that direct validation against real user logs would provide stronger evidence for practical applicability. The Qulac dataset leverages TREC topics and facets, which were created to represent diverse user interpretations of ambiguous queries, and the crowdsourcing process generates questions that help distinguish between these facets. This setup allows for controlled offline evaluation of the task. We do not have access to real conversational logs for validation in this work. In the revision, we will expand the discussion section to explicitly note this as a limitation and explain why TREC-based facets serve as a reasonable proxy for studying clarifying questions in open-domain settings. We believe this maintains the value of the dataset as a benchmark while being transparent about its construction. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from new crowdsourced dataset and external baselines

full rationale

The paper's core claims rest on collecting a new dataset (Qulac) via crowdsourcing over TREC Web topics/facets, then running an oracle experiment and a three-component retrieval framework that is compared to competitive baselines. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs. The 170% P@1 gain is an observed experimental outcome on the collected data, not a self-defined or self-cited tautology. Self-citations, if any, are not load-bearing for the central empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from information retrieval evaluation and crowdsourcing practices without introducing new free parameters or invented entities.

axioms (1)

domain assumption Standard IR metrics such as P@1 are suitable for evaluating the impact of clarifying questions.
Invoked when reporting the 170% improvement without additional justification for the metric choice in the new task.

pith-pipeline@v0.9.0 · 5792 in / 1085 out tokens · 44744 ms · 2026-05-24T21:26:41.967252+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

oracle model ... over 170% retrieval performance improvement in terms of P@1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 6 internal anchors

[1]

Mohammad Aliannejadi, Masoud Kiaeeha, Shahram Khadivi, and Saeed Shiry Ghidary. 2014. Graph-Based Semi-Supervised Conditional Random Fields For Spoken Language Understanding Using Unaligned Data. In ALTA. 98–103

work page 2014
[3]

In Situ and Context-Aware Target Apps Selection for Unified Mobile Search. In CIKM. 1383–1392

work page
[4]

Bruce Croft

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft

work page
[5]

In SIGIR

Target Apps Selection: Towards a Unified Search Framework for Mobile Devices. In SIGIR. 215–224

work page
[6]

Omar Alonso and Maria Stone. 2014. Building a Query Log via Crowdsourcing. In SIGIR. 939–942

work page 2014
[7]

Harald Aust, Martin Oerder, Frank Seide, and Volker Steinbiss. 1995. The Philips automatic train timetable information system. Speech Communication 17, 3-4 (1995), 249–262

work page 1995
[8]

Seyed Ali Bahrainian and Fabio Crestani. 2018. Augmentation of Human Memory: Anticipating Topics that Continue in the Next Meeting. In CHIIR. 150–159

work page 2018
[9]

Nicholas J Belkin, Colleen Cool, Adelheit Stein, and Ulrich Thiel. 1995. Cases, scripts, and information-seeking strategies: On the design of interactive informa- tion retrieval systems. Expert systems with applications 9, 3 (1995), 379–395

work page 1995
[10]

Benetka, Krisztian Balog, and Kjetil Nørvåg

Jan R. Benetka, Krisztian Balog, and Kjetil Nørvåg. 2017. Anticipating Information Needs Based on Check-in Activity. In WSDM. 41–50

work page 2017
[11]

Pavel Braslavski, Denis Savenkov, Eugene Agichtein, and Alina Dubatovka. 2017. What Do You Mean Exactly?: Analyzing Clarification Questions in CQA. InCHIIR. 345–348

work page 2017
[12]

Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. 2005. Learning to rank using gradient descent. In ICML. 89–96

work page 2005
[13]

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question Answering in Context. In EMNLP. 2174–2184

work page 2018
[14]

Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards Conversational Recommender Systems. In KDD. 815–824

work page 2016
[15]

Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. In TREC

work page 2009
[16]

Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. In TREC

work page 2011
[17]

Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of the TREC 2012 Web Track. In TREC

work page 2012
[18]

Bruce Croft and R

W. Bruce Croft and R. H. Thompson. 1987. I3R: A new approach to the design of document retrieval systems. JASIS 38, 6 (1987), 389–404

work page 1987
[19]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Yulan He and Steve J. Young. 2005. Semantic processing using the Hidden Vector State model. Computer Speech & Language 19, 1 (2005), 85–106

work page 2005
[21]

Hemphill, John J

Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS Spoken Language Systems Pilot Corpus. In HLT. 96–101

work page 1990
[22]

Di Jiang, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015. Query suggestion with diversification and personalization. Knowl.-Based Syst. 89 (2015), 553–568

work page 2015
[23]

Kato and Katsumi Tanaka

Makoto P. Kato and Katsumi Tanaka. 2016. To Suggest, or Not to Suggest for Queries with Diverse Intents: Optimizing Search Result Presentation. In WSDM. 133–142

work page 2016
[24]

Johannes Kiesel, Arefeh Bahrami, Benno Stein, Avishek Anand, and Matthias Hagen. 2018. Toward Voice Query Clarification. In SIGIR. 1257–1260

work page 2018
[25]

Weize Kong and James Allan. 2013. Extracting query facets from search results. In SIGIR. 93–102

work page 2013
[26]

John Lafferty and Chengxiang Zhai. 2001. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In SIGIR. 111–119

work page 2001
[27]

Bruce Croft

Victor Lavrenko and W. Bruce Croft. 2001. Relevance-Based Language Models. In SIGIR. 120–127

work page 2001
[28]

Shane Culpepper

Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Inf. Retr. Journal 19, 4 (2016), 416–445

work page 2016
[29]

A Deep Look into Neural Ranking Models for Information Retrieval

Harshith Padigela, Hamed Zamani, and W. Bruce Croft. 2019. Investigating the Successes and Failures of BERT for Passage Re-Ranking. arXiv:1903.06902 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[30]

Joaquín Pérez-Iglesias and Lourdes Araujo. 2010. Standard Deviation as a Query Hardness Estimator. In SPIRE. 207–212

work page 2010
[31]

Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon

Roberto Pieraccini, Evelyne Tzoukermann, Z. Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon. 1992. A speech understanding system based on statistical representation of semantics. In ICASSP. 193–196

work page 1992
[32]

Ponte and W

Jay M. Ponte and W. Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In SIGIR. 275–281

work page 1998
[33]

Bruce Croft, and Wei Lin

Minghui Qiu, Liu Yang, Feng Ji, Wei Zhou, Jun Huang, Haiqing Chen, W. Bruce Croft, and Wei Lin. 2018. Transfer Learning for Context-Aware Question Match- ing in Information-seeking Conversations in E-commerce. In ACL (2). 208–213

work page 2018
[34]

Bruce Croft, Johanne R

Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. Analyzing and Characterizing User Intent in Information- seeking Conversations. In SIGIR. 989–992

work page 2018
[35]

Filip Radlinski and Nick Craswell. 2017. A Theoretical Framework for Conversa- tional Search. In CHIIR. 117–126

work page 2017
[36]

Sudha Rao and Hal Daumé. 2018. Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information. In ACL (1). 2736–2745

work page 2018
[37]

Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. arXiv:1904.02281 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. CoQA: A Conversa- tional Question Answering Challenge. arXiv:1808.07042 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Gary Ren, Xiaochuan Ni, Manish Malik, and Qifa Ke. 2018. Conversational Query Understanding Using Sequence to Sequence Modeling. In WWW. 1715–1724

work page 2018
[40]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In TREC. 109–126

work page 1994
[41]

Trippas, Lawrence Cavedon, and Mark Sanderson

Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, and Mark Sanderson

work page
[42]

JASIST 68, 9 (2017), 2101–2115

Extracting audio summaries to support effective spoken document search. JASIST 68, 9 (2017), 2101–2115

work page 2017
[43]

Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In SIGIR. 235–244

work page 2018
[44]

Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao

work page
[45]

In ACL (2)

How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models. In ACL (2). 231–236

work page
[46]

Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson

Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson. 2018. Informing the Design of Spoken Conversational Search: Per- spective Paper. In CHIIR. 32–41

work page 2018
[47]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Alexandra Vtyurina, Denis Savenkov, Eugene Agichtein, and Charles L. A. Clarke

work page
[49]

In CHI Extended Abstracts

Exploring Conversational Search With Humans, Assistants, and Wizards. In CHI Extended Abstracts. 2187–2193

work page
[50]

Walker, Rebecca J

Marilyn A. Walker, Rebecca J. Passonneau, and Julie E. Boland. 2001. Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems. In ACL. 515–522

work page 2001
[51]

Yansen Wang, Chenyi Liu, Minlie Huang, and Liqiang Nie. 2018. Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders. In ACL (1). 2193–2203

work page 2018
[52]

Williams, Antoine Raux, Deepak Ramachandran, and Alan W

Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Black

work page
[53]

In SIGDIAL

The Dialog State Tracking Challenge. In SIGDIAL. 404–413

work page
[54]

Qiang Wu, Christopher J. C. Burges, Krysta Marie Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Inf. Retr. 13, 3 (2010), 254–270

work page 2010
[55]

Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In SIGIR. 55–64

work page 2016
[56]

Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In SIGIR. 685–694

work page 2017
[57]

Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation

Liu Yang, Hamed Zamani, Yongfeng Zhang, Jiafeng Guo, and W. Bruce Croft. 2017. Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation. arXiv:1707.05409 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

Chengxiang Zhai and John Lafferty. 2017. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR Forum 51, 2 (2017), 268–276

work page 2017
[59]

Bruce Croft

Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. Towards Conversational Search and Recommendation: System Ask, User Respond. In CIKM. 177–186

work page 2018

[1] [1]

Mohammad Aliannejadi, Masoud Kiaeeha, Shahram Khadivi, and Saeed Shiry Ghidary. 2014. Graph-Based Semi-Supervised Conditional Random Fields For Spoken Language Understanding Using Unaligned Data. In ALTA. 98–103

work page 2014

[2] [3]

In Situ and Context-Aware Target Apps Selection for Unified Mobile Search. In CIKM. 1383–1392

work page

[3] [4]

Bruce Croft

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft

work page

[4] [5]

In SIGIR

Target Apps Selection: Towards a Unified Search Framework for Mobile Devices. In SIGIR. 215–224

work page

[5] [6]

Omar Alonso and Maria Stone. 2014. Building a Query Log via Crowdsourcing. In SIGIR. 939–942

work page 2014

[6] [7]

Harald Aust, Martin Oerder, Frank Seide, and Volker Steinbiss. 1995. The Philips automatic train timetable information system. Speech Communication 17, 3-4 (1995), 249–262

work page 1995

[7] [8]

Seyed Ali Bahrainian and Fabio Crestani. 2018. Augmentation of Human Memory: Anticipating Topics that Continue in the Next Meeting. In CHIIR. 150–159

work page 2018

[8] [9]

Nicholas J Belkin, Colleen Cool, Adelheit Stein, and Ulrich Thiel. 1995. Cases, scripts, and information-seeking strategies: On the design of interactive informa- tion retrieval systems. Expert systems with applications 9, 3 (1995), 379–395

work page 1995

[9] [10]

Benetka, Krisztian Balog, and Kjetil Nørvåg

Jan R. Benetka, Krisztian Balog, and Kjetil Nørvåg. 2017. Anticipating Information Needs Based on Check-in Activity. In WSDM. 41–50

work page 2017

[10] [11]

Pavel Braslavski, Denis Savenkov, Eugene Agichtein, and Alina Dubatovka. 2017. What Do You Mean Exactly?: Analyzing Clarification Questions in CQA. InCHIIR. 345–348

work page 2017

[11] [12]

Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. 2005. Learning to rank using gradient descent. In ICML. 89–96

work page 2005

[12] [13]

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question Answering in Context. In EMNLP. 2174–2184

work page 2018

[13] [14]

Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards Conversational Recommender Systems. In KDD. 815–824

work page 2016

[14] [15]

Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. In TREC

work page 2009

[15] [16]

Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. In TREC

work page 2011

[16] [17]

Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of the TREC 2012 Web Track. In TREC

work page 2012

[17] [18]

Bruce Croft and R

W. Bruce Croft and R. H. Thompson. 1987. I3R: A new approach to the design of document retrieval systems. JASIS 38, 6 (1987), 389–404

work page 1987

[18] [19]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [20]

Yulan He and Steve J. Young. 2005. Semantic processing using the Hidden Vector State model. Computer Speech & Language 19, 1 (2005), 85–106

work page 2005

[20] [21]

Hemphill, John J

Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS Spoken Language Systems Pilot Corpus. In HLT. 96–101

work page 1990

[21] [22]

Di Jiang, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015. Query suggestion with diversification and personalization. Knowl.-Based Syst. 89 (2015), 553–568

work page 2015

[22] [23]

Kato and Katsumi Tanaka

Makoto P. Kato and Katsumi Tanaka. 2016. To Suggest, or Not to Suggest for Queries with Diverse Intents: Optimizing Search Result Presentation. In WSDM. 133–142

work page 2016

[23] [24]

Johannes Kiesel, Arefeh Bahrami, Benno Stein, Avishek Anand, and Matthias Hagen. 2018. Toward Voice Query Clarification. In SIGIR. 1257–1260

work page 2018

[24] [25]

Weize Kong and James Allan. 2013. Extracting query facets from search results. In SIGIR. 93–102

work page 2013

[25] [26]

John Lafferty and Chengxiang Zhai. 2001. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In SIGIR. 111–119

work page 2001

[26] [27]

Bruce Croft

Victor Lavrenko and W. Bruce Croft. 2001. Relevance-Based Language Models. In SIGIR. 120–127

work page 2001

[27] [28]

Shane Culpepper

Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Inf. Retr. Journal 19, 4 (2016), 416–445

work page 2016

[28] [29]

A Deep Look into Neural Ranking Models for Information Retrieval

Harshith Padigela, Hamed Zamani, and W. Bruce Croft. 2019. Investigating the Successes and Failures of BERT for Passage Re-Ranking. arXiv:1903.06902 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[29] [30]

Joaquín Pérez-Iglesias and Lourdes Araujo. 2010. Standard Deviation as a Query Hardness Estimator. In SPIRE. 207–212

work page 2010

[30] [31]

Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon

Roberto Pieraccini, Evelyne Tzoukermann, Z. Gorelov, Jean-Luc Gauvain, Esther Levin, Chin-Hui Lee, and Jay Wilpon. 1992. A speech understanding system based on statistical representation of semantics. In ICASSP. 193–196

work page 1992

[31] [32]

Ponte and W

Jay M. Ponte and W. Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In SIGIR. 275–281

work page 1998

[32] [33]

Bruce Croft, and Wei Lin

Minghui Qiu, Liu Yang, Feng Ji, Wei Zhou, Jun Huang, Haiqing Chen, W. Bruce Croft, and Wei Lin. 2018. Transfer Learning for Context-Aware Question Match- ing in Information-seeking Conversations in E-commerce. In ACL (2). 208–213

work page 2018

[33] [34]

Bruce Croft, Johanne R

Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. Analyzing and Characterizing User Intent in Information- seeking Conversations. In SIGIR. 989–992

work page 2018

[34] [35]

Filip Radlinski and Nick Craswell. 2017. A Theoretical Framework for Conversa- tional Search. In CHIIR. 117–126

work page 2017

[35] [36]

Sudha Rao and Hal Daumé. 2018. Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information. In ACL (1). 2736–2745

work page 2018

[36] [37]

Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. arXiv:1904.02281 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [38]

Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. CoQA: A Conversa- tional Question Answering Challenge. arXiv:1808.07042 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [39]

Gary Ren, Xiaochuan Ni, Manish Malik, and Qifa Ke. 2018. Conversational Query Understanding Using Sequence to Sequence Modeling. In WWW. 1715–1724

work page 2018

[39] [40]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In TREC. 109–126

work page 1994

[40] [41]

Trippas, Lawrence Cavedon, and Mark Sanderson

Damiano Spina, Johanne R. Trippas, Lawrence Cavedon, and Mark Sanderson

work page

[41] [42]

JASIST 68, 9 (2017), 2101–2115

Extracting audio summaries to support effective spoken document search. JASIST 68, 9 (2017), 2101–2115

work page 2017

[42] [43]

Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In SIGIR. 235–244

work page 2018

[43] [44]

Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao

work page

[44] [45]

In ACL (2)

How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models. In ACL (2). 231–236

work page

[45] [46]

Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson

Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson. 2018. Informing the Design of Spoken Conversational Search: Per- spective Paper. In CHIIR. 32–41

work page 2018

[46] [47]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [48]

Alexandra Vtyurina, Denis Savenkov, Eugene Agichtein, and Charles L. A. Clarke

work page

[48] [49]

In CHI Extended Abstracts

Exploring Conversational Search With Humans, Assistants, and Wizards. In CHI Extended Abstracts. 2187–2193

work page

[49] [50]

Walker, Rebecca J

Marilyn A. Walker, Rebecca J. Passonneau, and Julie E. Boland. 2001. Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems. In ACL. 515–522

work page 2001

[50] [51]

Yansen Wang, Chenyi Liu, Minlie Huang, and Liqiang Nie. 2018. Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders. In ACL (1). 2193–2203

work page 2018

[51] [52]

Williams, Antoine Raux, Deepak Ramachandran, and Alan W

Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Black

work page

[52] [53]

In SIGDIAL

The Dialog State Tracking Challenge. In SIGDIAL. 404–413

work page

[53] [54]

Qiang Wu, Christopher J. C. Burges, Krysta Marie Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Inf. Retr. 13, 3 (2010), 254–270

work page 2010

[54] [55]

Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In SIGIR. 55–64

work page 2016

[55] [56]

Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In SIGIR. 685–694

work page 2017

[56] [57]

Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation

Liu Yang, Hamed Zamani, Yongfeng Zhang, Jiafeng Guo, and W. Bruce Croft. 2017. Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation. arXiv:1707.05409 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [58]

Chengxiang Zhai and John Lafferty. 2017. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR Forum 51, 2 (2017), 268–276

work page 2017

[58] [59]

Bruce Croft

Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. Towards Conversational Search and Recommendation: System Ask, User Respond. In CIKM. 177–186

work page 2018