pith. sign in

arxiv: 2306.08550 · v2 · submitted 2023-06-14 · 💻 cs.HC · cs.AI· cs.IR

User Simulation for Evaluating Information Access Systems

Pith reviewed 2026-05-24 08:02 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.IR
keywords user simulationinformation access systemsevaluationsearch enginesrecommender systemsconversational assistantsuser modelinginteractive systems
0
0 comments X

The pith

User simulation emerges as a solution for evaluating interactive information access systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews techniques for simulating users to evaluate systems like search engines, recommender systems, and conversational assistants. These systems are central to daily information needs, yet judging their overall effectiveness is hard because of wide differences in how real users behave and what they prefer. User simulation models interactive behaviors to stand in for actual people during testing. The review covers general design frameworks, specific algorithms for each system type, and links to fields such as machine learning and economics. It closes by outlining future directions that reach beyond evaluation of these systems.

Core claim

User simulation emerges as a promising solution to the long-standing challenge of evaluating information access systems' overall effectiveness in assisting users through interactive support, a challenge rooted in the difficulty of assessment and exacerbated by substantial variation in user behaviour and preferences. The book supplies a thorough understanding by covering background on evaluation, applications of simulation, major research progress on frameworks and models for simulating interactions, and connections to related disciplines.

What carries the argument

User simulators, including general frameworks for their design and specific models and algorithms that simulate interactions with search engines, recommender systems, and conversational assistants.

If this is right

  • System effectiveness can be measured repeatedly without recruiting new participants for each test.
  • Models from machine learning and dialogue systems can be imported to improve simulator accuracy.
  • Evaluation methods developed here can extend to other interactive intelligent systems.
  • Tailored simulators can be built for each class of information access system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rapid prototyping of new interfaces becomes feasible when simulators replace early user tests.
  • Simulators that vary preference parameters could reveal which system features hold up across different user groups.
  • Links to economics might yield evaluation metrics that treat user effort as a cost to be minimized.

Load-bearing premise

Simulated user behaviors can sufficiently capture the substantial variation in real user behaviour and preferences to allow reliable assessment of system effectiveness.

What would settle it

An experiment that runs the same set of systems through both simulation-based evaluations and real-user studies and finds that the two methods produce inconsistent rankings of which system performs best.

Figures

Figures reproduced from arXiv: 2306.08550 by ChengXiang Zhai, Krisztian Balog.

Figure 5.1
Figure 5.1. Figure 5.1: The classic model of information retrieval, adapted from Broder (2002). to model decision-making processes mathematically using Markov decision processes (MDP). The MDP framework provides a general formal framework for constructing user simulators, which we will use to discuss specific user sim￾ulation techniques in the next two chapters, simulating interactions with search and recommender systems (Chapt… view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Integrated information seeking and retrieval (IS&R) research framework, adapted from Ingwersen and Järvelin (2005). formation and influence over time, and (2) longitudinal interaction of cognitive structures. The IS&R model treats the information seeker as the central actor and emphasizes “the interaction between the information seeker(s) and the en￾vironment surrounding that individual, also over time” … view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Information seeking process model, adapted from Marchionini (1995). (1991) also describes the affective (feelings) and cognitive (thoughts) patterns associated with each of these steps. It is to be noted that these stages characterize complex information needs and “are not necessarily representative for more light-weight tasks” (Hearst, 2009). Marchionini (1995) propose another model, where the informati… view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: Search, as an evolving process, according to the berry-picking model (Bates, 1989) (illustration adapted from White (2016)). 5.1.3 Strategic Models Next, we present two conceptual models that describe high level search strate￾gies for exploratory search. These models are different from the models dis￾cussed above in that they do not conceptualize the nature of information inter￾action or specify the stag… view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: Flowchart of a naive searcher model, corresponding to highly abstracted user (illustration adapted from (Maxwell, 2019)). Formulate query Scan a snippet Click a link Read a document Stop session Judge document relevance P=1 P≤1 P=1 P≤1 P≤1 P≤1 P≤1 P≤1 P≤1 P=1 [PITH_FULL_IMAGE:figures/full_fig_p078_6_1.png] view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: Automaton expressing the subtasks performed during a search session, according to Baskaya et al. (2013) (illustration adapted from (Baskaya et al., 2013)). 6.1.1 Search Workflows To begin, consider the simple searcher model shown in [PITH_FULL_IMAGE:figures/full_fig_p078_6_2.png] view at source ↗
Figure 6.3
Figure 6.3. Figure 6.3: Searcher model by Baskaya et al. (2013), visualized as a flowchart (illustration adapted from Maxwell (2019)). with transition probabilities that define the probability of the user performing an action at a given state (which represents the previous action taken by the user). There are more detailed, and thereby more realistic, models of the search process, which we will look at shortly. Nevertheless, th… view at source ↗
Figure 6.4
Figure 6.4. Figure 6.4: Flowchart of the Complex Searcher Model (illustration adapted from Maxwell (2019)). The main components are: (A) topic examination, (B) querying, (C) SERP examination, (D) result summary examination, (E) document examination, and (F) deciding to stop. queries to address the underlying information need and deciding which ones to issue; (C) SERP examination, where the searcher obtains an initial impression… view at source ↗
Figure 6.5
Figure 6.5. Figure 6.5: Flowchart of a basic user model for recommender systems that present recommen￾dations as a ranked list. stop, i.e., making decisions about whether or not to stay on the current SERP and, if not, whether to continue with the search session (and issue further queries) or terminate the session. There exist various extensions to the Complex Searcher Model. For exam￾ple, the User State Model (Maxwell and Azzo… view at source ↗
Figure 6.6
Figure 6.6. Figure 6.6: Flowchart of an advanced user model for recommender systems, corresponding to carousel-based interfaces with multiple ranked lists (rows). The main components are: (A) row examination, (B) item examination, (C) feedback, and (D) stopping decisions. 6.1.2 Recommendation Workflow We are not aware of any workflow models being proposed specifically for recommendation, though elements of a workflow pertaining… view at source ↗
Figure 6.7
Figure 6.7. Figure 6.7: Example TREC topic definition (from Robust 2003 track). The terms present in such topic definitions are often used as the basis of query generation. individual queries, the generation of query sequences based on discriminative power, and query reformulations; see [PITH_FULL_IMAGE:figures/full_fig_p086_6_7.png] view at source ↗
Figure 6.8
Figure 6.8. Figure 6.8: Examples of different result presentation layouts (illustration adapted from Ooster￾huis and Rijke (2018)). 6.3.3 Complex Presentation Layouts A major limitation of the existing work on scanning behaviour, and in user simulation in general, is that they have rarely considered modern SERPs and alternative presentation layouts. Indeed, the vast majority of prior work on user simulation has been conducted i… view at source ↗
Figure 6.9
Figure 6.9. Figure 6.9: Excerpt from the updated Complex Searcher Model (Maxwell and Azzopardi, 2018), highlighting various stopping decision points: (1) SERP-level stopping, (2) query-level stopping, and (3) session-level stopping. with the satisfaction heuristic: if the SERP yields a high volume of relevant content early on, then satisfaction-based stopping would be triggered, while if relevant items are at greater depths the… view at source ↗
Figure 7.1
Figure 7.1. Figure 7.1: Modular user simulator architecture for conversational information access. 7.2.1 Modular Systems Following the architecture of traditional task-oriented dialogue systems, user simulation can be decomposed into the sequentially dependent modules of natural language understanding, dialogue management, and natural language generation. Additionally, we separate the modeling of individual character￾istics int… view at source ↗
Figure 7.2
Figure 7.2. Figure 7.2: End-to-end user simulator. the information available for inferring a user’s state and the observable actions are different. In particular, for conversational systems, we would need to model how a user’s latent state influences the natural language utterances generated by the user as well as infer a user’s latent state based on the natural language system responses. 7.2.2 End-to-End Systems More recently,… view at source ↗
Figure 7.3
Figure 7.3. Figure 7.3: CIR6 dialogue workflow model, designed specifically for the task of conversational item recommendation (illustration adapted from Zhang and Balog (2020)). Q F R A [PITH_FULL_IMAGE:figures/full_fig_p145_7_3.png] view at source ↗
Figure 7.4
Figure 7.4. Figure 7.4: The QRFA model for conversational search; user actions are displayed with a white, system actions with a grey background (illustration adapted from Vakulenko et al. (2019)). Dialogue structure can be designed by envisioning the optimal sequencing of turns, identifying key dialogue acts and their dependencies, anticipating potential user intents and system responses, and incorporating strategies for ef￾fe… view at source ↗
Figure 7.5
Figure 7.5. Figure 7.5: Flowchart of the user model in (Lipani et al., 2021) for simulating user interactions with a conversational search system. on the relevance of the system’s response, the user will ask further questions (about the same subtopic or a different one) or stop querying; see [PITH_FULL_IMAGE:figures/full_fig_p148_7_5.png] view at source ↗
Figure 8.1
Figure 8.1. Figure 8.1: User simulation in the evaluation workflow. number of experiments that can be conducted within a reasonable timeframe). This is where user simulation comes into play, offering a cost-effective and efficient way to explore a large number of system variations before committing to resource-intensive user studies or online experiments. However, it is impor￾tant to emphasize that simulation is not a replaceme… view at source ↗
read the original abstract

Information access systems, such as search engines, recommender systems, and conversational assistants, have become integral to our daily lives as they help us satisfy our information needs. However, evaluating the effectiveness of these systems presents a long-standing and complex scientific challenge. This challenge is rooted in the difficulty of assessing a system's overall effectiveness in assisting users to complete tasks through interactive support, and further exacerbated by the substantial variation in user behaviour and preferences. To address this challenge, user simulation emerges as a promising solution. This book focuses on providing a thorough understanding of user simulation techniques designed specifically for evaluation purposes. We begin with a background of information access system evaluation and explore the diverse applications of user simulation. Subsequently, we systematically review the major research progress in user simulation, covering both general frameworks for designing user simulators, utilizing user simulation for evaluation, and specific models and algorithms for simulating user interactions with search engines, recommender systems, and conversational assistants. Realizing that user simulation is an interdisciplinary research topic, whenever possible, we attempt to establish connections with related fields, including machine learning, dialogue systems, user modeling, and economics. We end the book with a detailed discussion of important future research directions, many of which extend beyond the evaluation of information access systems and are expected to have broader impact on how to evaluate interactive intelligent systems in general.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This manuscript is a book-length survey on user simulation for evaluating information access systems (search engines, recommender systems, conversational assistants). It argues that user simulation addresses the long-standing evaluation challenge rooted in assessing overall effectiveness amid substantial variation in user behavior and preferences; the book reviews background and applications, general frameworks, specific models and algorithms for different system types, interdisciplinary connections (ML, dialogue systems, user modeling, economics), and future research directions.

Significance. A thorough, systematic synthesis of user simulation literature could serve as a key reference for researchers in information retrieval and HCI, clarifying frameworks and highlighting cross-field links that may inform evaluation practices for interactive systems.

major comments (2)
  1. [Abstract / Introduction] The central claim (Abstract) that user simulation 'emerges as a promising solution' to the evaluation challenge is load-bearing on the assumption that reviewed simulators sufficiently capture real-user variation; however, the survey provides no meta-analysis, aggregated fidelity metrics, or quantitative synthesis demonstrating reproduction of observed diversity (e.g., task-completion variance or preference distributions) across the collected techniques.
  2. [Specific models and algorithms] § on specific models for search/recommenders/conversational systems: without reported cross-model comparisons or coverage statistics on behavioral variation, it is unclear whether the reviewed approaches collectively support system-level effectiveness conclusions as asserted.
minor comments (2)
  1. Notation for simulator components (e.g., user state, action spaces) could be standardized across chapters for easier comparison.
  2. Some figure captions describing simulation pipelines are terse and would benefit from explicit mapping to the reviewed frameworks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. As this is a survey synthesizing existing literature rather than an empirical study, we address the comments by clarifying scope and offering targeted revisions to the framing and discussion sections.

read point-by-point responses
  1. Referee: [Abstract / Introduction] The central claim (Abstract) that user simulation 'emerges as a promising solution' to the evaluation challenge is load-bearing on the assumption that reviewed simulators sufficiently capture real-user variation; however, the survey provides no meta-analysis, aggregated fidelity metrics, or quantitative synthesis demonstrating reproduction of observed diversity (e.g., task-completion variance or preference distributions) across the collected techniques.

    Authors: We agree the survey contains no new meta-analysis or aggregated quantitative fidelity metrics, as its purpose is to organize and review the existing body of work. The claim draws from the qualitative patterns across the cited studies showing utility in specific evaluation scenarios. We will revise the abstract and introduction to qualify the language (e.g., 'has been positioned as a promising approach in the literature') and add an explicit paragraph in the future directions section noting the current absence of field-wide meta-analyses or standardized fidelity benchmarks as an important gap. revision: partial

  2. Referee: [Specific models and algorithms] § on specific models for search/recommenders/conversational systems: without reported cross-model comparisons or coverage statistics on behavioral variation, it is unclear whether the reviewed approaches collectively support system-level effectiveness conclusions as asserted.

    Authors: The relevant sections review published models but introduce no new cross-model empirical comparisons, which would fall outside the scope of a literature survey. We will add a concise discussion subsection (or summary table) that tabulates the behavioral dimensions (e.g., task types, preference distributions) addressed by the reviewed simulators according to the original papers, and explicitly state that standardized cross-model benchmarks remain unavailable. revision: partial

Circularity Check

0 steps flagged

No circularity: literature review with no derivations or fitted quantities

full rationale

The document is a survey book reviewing existing user simulation frameworks, models, and applications for evaluating information access systems. It presents background, systematic reviews of prior work, connections to related fields, and future directions, but contains no original equations, predictions, parameter fits, or derivation chains. The central claim that user simulation is promising is presented as a literature synthesis rather than a result derived from the paper's own inputs. No self-definitional steps, fitted-input predictions, or load-bearing self-citation reductions are present. This matches the default expectation for non-circular survey papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a review book; no new free parameters, axioms, or invented entities are introduced in the provided abstract.

pith-pipeline@v0.9.0 · 5765 in / 866 out tokens · 54771 ms · 2026-05-24T08:02:54.538495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ)

    DOI: 10.1145/3459637.3482231. Aliannejadi, M., J. Kiseleva, A. Chuklin, J. Dalton, and M. Burtsev. (2020). “ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ)”. arXiv: 2009.11352[cs.CL]. Aliannejadi,M.,H.Zamani,F.Crestani,andW.B.Croft.(2019).“AskingClar- ifying Questions in Open-Domain Information-Seeking Conversations”. In...

  2. [2]

    Optimal Foraging, the Marginal Value Theorem

    DOI: 10.1145/1645953.1646033. Charnov, E. L. (1976). “Optimal Foraging, the Marginal Value Theorem”. Theoretical Population Biology. 9(2): 129–136.DOI: 10.1016/0040- 5809(76)90040-X. Chen, D., W. Chen, H. Wang, Z. Chen, and Q. Yang. (2012). “Beyond Ten Blue Links: Enabling User Click Modeling in Federated Web Search”. In: Proceedings of the Fifth ACM Inte...

  3. [3]

    A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions

    DOI: 10.1007/s11257-011-9108-6. Chen, X., L. Yao, J. McAuley, G. Zhou, and X. Wang. (2021). “A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions”. arXiv: 2109.03540[cs.IR]. Cheng, H.-T., L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir,et al.(2016). “Wid...

  4. [4]

    Search as Learning (Dagstuhl seminar 17092)

    DOI: 10.1037/h0076540. Collins-Thompson, K., P. Hansen, and C. Hauff. (2017). “Search as Learning (Dagstuhl seminar 17092)”. Cooper, M. D. (1973a). “A Simulation Model of an Information Retrieval System”.Information Storage and Retrieval. 9(1): 13–32.DOI: 10.1016/ 0020-0271(73)90004-1. Cooper,W.S.(1968).“ExpectedSearchLength:ASingleMeasureofRetrieval Effe...

  5. [5]

    Research Frontiers in Information Retrieval: Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018)

    DOI: 10.21437/Interspeech.2017-161. Culpepper, J. S., F. Diaz, and M. D. Smucker. (2018). “Research Frontiers in Information Retrieval: Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018)”.SIGIR Forum. 52(1): 34–

  6. [6]

    CAsT-19: A Dataset for Conversational Information Seeking

    DOI: 10.1145/3274784.3274788. Dalton, J., C. Xiong, V. Kumar, and J. Callan. (2020). “CAsT-19: A Dataset for Conversational Information Seeking”. In:Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’20. 1985–1988.DOI: 10.1145/3397271. 3401206. Davidson,S.,S.Romeo,R.Shu,J.Gung,A.Gupta,S....

  7. [7]

    Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions

    DOI: 10.1145/582415.582418. Järvelin, K., S. L. Price, L. M. L. Delcambre, and M. L. Nielsen. (2008). “Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions”. In:Proceedings of the 30th European Conference on Advances inInformationRetrieval .ECIR’08.4–15.DOI:10.1007/978-3-540-78646- 7_4. Preprint (v2.0) REFERENCES 215 Joachims, T., L. G...

  8. [8]

    MMConv: An Environment for Multimodal Conversational Search across Multiple Domains

    DOI: 10.18653/v1/2022.findings-emnlp.318. Liao,L.,L.H.Long,Z.Zhang,M.Huang,andT. -S.Chua.(2021).“MMConv: An Environment for Multimodal Conversational Search across Multiple Domains”. In:Proceedings of the 44th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval.SIGIR ’21. 675–684.DOI: 10.1145/3404835.3462970. Lin, H.-...

  9. [9]

    SMOG Grading-a New Readability Formula

    DOI: 10.1145/3460231.3474259. McLaughlin, G. H. (1969). “SMOG Grading-a New Readability Formula”. Journal of Reading. 12(8): 639–646. McTear, M. (2021).Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots. Springer Nature.DOI: 10.1007/978-3-031-02176-3. McTear, M., Z. Callejas, and D. Griol. (2016).The Conversational Interface: Talking...

  10. [10]

    A Query Model Based on NormalizedLog-Likelihood

    DOI: 10.1002/asi.10244. Meij, E., W. Weerkamp, and M. de Rijke. (2009). “A Query Model Based on NormalizedLog-Likelihood”.In:Proceedingsofthe18thACMConference on Information and Knowledge Management. CIKM ’09. 1903–1906.DOI: 10.1145/1645953.1646261. Merinov, P., D. Massimo, and F. Ricci. (2023). “Behaviour-aware Tourist Profiles Data Generation”. In:Proce...

  11. [11]

    Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere

    DOI: 10.18653/v1/D17-1238. Preprint (v2.0) REFERENCES 225 O’Day, V. L. and R. Jeffries. (1993). “Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere”.In: Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems.CHI ’93. 438–445.DOI: 10.1145/169059.169365. Olston, C. and E. H. Chi. (2003). ...

  12. [12]

    Incorporating Vertical Results into Search Click Models

    DOI: 10.18653/v1/2022.findings-emnlp.277. Wang,C.,Y.Liu,M.Zhang,S.Ma,M.Zheng,J.Qian,andK.Zhang.(2013a). “Incorporating Vertical Results into Search Click Models”. In:Proceed- ings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval.SIGIR ’13. 503–512.DOI: 10.1145/ 2484028.2484036. Wang, H., C. Zhai, A. Dong,...

  13. [13]

    Beyond Ranking: Optimizing Whole-Page Presentation

    DOI: 10.1145/3580305.3599387. Wang, Y., D. Yin, L. Jie, P. Wang, M. Yamada, Y. Chang, and Q. Mei. (2016). “Beyond Ranking: Optimizing Whole-Page Presentation”. In:Proceedings of the Ninth ACM International Conference on Web Search and Data Mining.WSDM ’16. 103–112.DOI: 10.1145/2835776.2835824. Wang, Z., Z. Xu, V. Srikumar, and Q. Ai. (2024c). “An In-depth...

  14. [14]

    Characterizing and Predicting Search Engine Switching Behavior

    DOI: 10.18653/v1/D15-1199. Preprint (v2.0) 236 REFERENCES White, R. W. and S. T. Dumais. (2009). “Characterizing and Predicting Search Engine Switching Behavior”. In:Proceedings of the 18th ACM conference on Information and knowledge management.CIKM ’09. 87–96.DOI: 10. 1145/1645953.1645967. White, R. W. (2006). “Using Searcher Simulations to Redesign a Po...

  15. [15]

    The Hidden Information State Model: A Practical Framework for POMDP-based Spoken Dialogue Management

    DOI: 10.1098/rsta.2000.0593. Preprint (v2.0) REFERENCES 239 Young, S., M. Gašić, S. Keizer, F. Mairesse, J. Schatzmann, B. Thomson, and K. Yu. (2010). “The Hidden Information State Model: A Practical Framework for POMDP-based Spoken Dialogue Management”.Computer Speech & Language. 24(2): 150–174.DOI: 10.1016/j.csl.2009.04.001. Zach, L. (2005). “When is “E...