arxiv: 2604.23406 · v1 · submitted 2026-04-25 · 💻 cs.IR · cs.HC

Recognition: unknown

IIRSim Studio: A Dashboard for User Simulation

Saber Zerhoudi , Adam Roegiest , Michael Granitzer

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:19 UTC · model grok-4.3

classification 💻 cs.IR cs.HC

keywords user simulationinformation retrievalevaluationreproducibilityprovenanceweb-based workbenchsimulation pipelinesshared tasks

0 comments

The pith

IIRSim Studio supplies a web workbench that lets researchers visually build, version, and share user simulation pipelines for information retrieval experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

User simulation supports low-cost and counterfactual evaluation in information retrieval, yet existing frameworks stay as code libraries that demand heavy setup and block reproducibility. The paper argues that the core limit is not the engines themselves but the missing infrastructure that joins design, execution, and sharing into one verifiable process. IIRSim Studio supplies that infrastructure through a visual composer for pipelines, a Git-backed system for creating and distributing components, a provenance model that records experiment bundles and templates, and a workflow for redeploying shared tasks. If these features work, both novices and experts could run and reuse simulations with far less manual effort.

Core claim

The paper introduces IIRSim Studio, a web-based workbench that supplies a visual environment for composing simulation pipelines on top of existing frameworks, a component lifecycle for authoring, versioning, and sharing custom components through Git-backed storage and runtime injection, a provenance model built on experiment bundles and environment templates that makes replication scope explicit, and a shared-task workflow illustrated by redeploying a Sim4IA micro-task.

What carries the argument

The visual pipeline composer together with the Git-backed component lifecycle and the experiment-bundle provenance model.

If this is right

Novices can learn simulation concepts by building pipelines visually instead of writing code.
Experts can assemble and run large-scale experiments by reusing versioned components injected at runtime.
Replication becomes verifiable because bundles and templates explicitly define the exact environment and data used.
Shared tasks can be redeployed as complete workflows, as demonstrated with the Sim4IA micro-task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The containerized deployment option could let research groups run private instances without relying on the hosted service.
The provenance model might serve as a template for documenting other IR evaluation pipelines that currently lack explicit replication records.
If custom components become widely shared, the community could build a growing library of simulation building blocks that reduces repeated setup work across projects.

Load-bearing premise

The main barrier to adopting user simulation is the lack of infrastructure that connects experiment design, execution, and sharing, rather than shortcomings inside the simulation engines themselves.

What would settle it

A timed user study in which participants create and replicate the same simulation experiment once with IIRSim Studio and once with a conventional code library, then measure setup time, sharing success, and whether independent parties can reproduce the results from the recorded bundles.

Figures

Figures reproduced from arXiv: 2604.23406 by Adam Roegiest, Michael Granitzer, Saber Zerhoudi.

**Figure 1.** Figure 1: The IIRSim Studio architecture. The Frontend Workbench provides visual pipeline composition, tutorials, and a playground for prototyping. The Orchestration Backend translates pipelines into versioned experiment bundles that are executed in isolated Docker containers, either through the workbench or at scale through the API wrapper. UserSimCRS [1, 8] and UXSim [27] frameworks have integrated conversational … view at source ↗

**Figure 2.** Figure 2: The visual pipeline composer. Each node represents a simulation component (e.g., query generator, stopping strategy) view at source ↗

**Figure 3.** Figure 3: Custom component authoring and saving: in view at source ↗

read the original abstract

User simulation is a valuable methodology for evaluation in Information Retrieval (IR), enabling low-cost experimentation and counterfactual analysis. However, existing simulation frameworks are primarily code-centric libraries that require substantial setup effort, which limits adoption and hinders reproducibility. The bottleneck is not the simulation engines themselves, but the lack of infrastructure connecting experiment design, execution, and sharing into a single verifiable workflow. This paper introduces IIRSim Studio, a web-based workbench that addresses this gap through four contributions: (1) a visual environment for composing simulation pipelines on top of simulation frameworks, serving both novices learning simulation concepts and experts piloting large-scale experiments; (2) a component lifecycle that supports authoring, versioning, and sharing custom simulation components through Git-backed storage and runtime injection; (3) a provenance model based on experiment bundles and environment templates that makes the scope of replication explicit; and (4) a shared-task workflow, demonstrated through the re-deployment of a Sim4IA micro-task. IIRSim Studio is available as a hosted service and as a portable containerized deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IIRSim Studio adds a visual Git-backed layer on top of existing IR simulation libraries, but the paper gives no evidence that infrastructure was the main barrier to wider use.

read the letter

The paper describes a web dashboard that lets users compose simulation pipelines visually, version components through Git, bundle experiments for provenance, and redeploy shared tasks like Sim4IA. It packages these features into one hosted or containerized service on top of prior simulation frameworks. That integration is the concrete new piece; separate elements like visual workflows and Git tracking already exist in other scientific tools, but the specific combination for IR user simulation looks fresh. The description is clear about how the pieces fit together for both novices and experts running larger experiments. The Git injection and environment templates are practical touches that could reduce setup friction and make replication scope explicit. The paper does a decent job laying out the four contributions without overclaiming results. The central assumption is that simulation engines are adequate and the real limit is missing infrastructure for design, execution, and sharing. Nothing in the paper tests or cites data on whether fidelity to real users, validation challenges, or engine scalability are bigger problems. The claims rest on the description alone, with no user studies, adoption numbers, or performance measurements. That makes the motivation plausible but unproven. This is aimed at IR researchers who already work with simulation but find code-only libraries cumbersome. A methods or theory reader will not find new algorithms or proofs here. A tool-builder or practitioner might get usable ideas for their own workflow. The work is coherent on its own terms and shows honest engagement with the simulation literature it cites. It deserves peer review so referees can check the implementation details and ask for evidence on whether the tool actually moves the adoption needle.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IIRSim Studio, a web-based workbench for user simulation in Information Retrieval. It argues that the primary adoption barrier is the absence of infrastructure linking experiment design, execution, and sharing, rather than limitations in existing simulation engines. The paper outlines four contributions: a visual pipeline composition environment for novices and experts, a Git-backed component lifecycle for authoring/versioning/sharing with runtime injection, a provenance model using experiment bundles and environment templates to clarify replication scope, and a shared-task workflow illustrated by re-deploying a Sim4IA micro-task. The system is offered as both a hosted service and a containerized deployment.

Significance. If implemented and validated as described, the workbench could meaningfully improve reproducibility and lower barriers to user simulation in IR evaluation by unifying design, execution, and sharing workflows. The dual hosted/container deployment model is a concrete strength that directly supports accessibility and experiment verification.

major comments (2)

Abstract: the claim that 'the bottleneck is not the simulation engines themselves, but the lack of infrastructure' is presented as given without citations, user studies, or analysis of simulation fidelity/validation issues; this premise is load-bearing for the motivation and scope of all four listed contributions.
Description of the four contributions: the manuscript supplies only high-level feature lists with no implementation details, architectural diagrams, code snippets, or performance metrics, preventing assessment of whether the visual environment, Git-backed lifecycle, or provenance model actually function as claimed.

minor comments (2)

The manuscript would benefit from explicit comparison to prior simulation frameworks (e.g., in a related-work section) to clarify how the visual and provenance features differ from existing code-centric libraries.
Adding at least one figure or screenshot of the dashboard interface would substantially improve clarity of the 'visual environment' contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify areas where the manuscript's motivation and technical description can be strengthened. We address each major comment below and commit to revisions that will improve the paper without altering its core contributions.

read point-by-point responses

Referee: Abstract: the claim that 'the bottleneck is not the simulation engines themselves, but the lack of infrastructure' is presented as given without citations, user studies, or analysis of simulation fidelity/validation issues; this premise is load-bearing for the motivation and scope of all four listed contributions.

Authors: We agree that the abstract's premise would be more robust with explicit support. In the revision we will add citations to prior IR literature on reproducibility challenges, setup overhead in simulation frameworks, and adoption barriers. We will also briefly reference known limitations in simulation fidelity and validation to contextualize why infrastructure is the focus. A new user study or original fidelity analysis lies outside the scope of this tool-description paper; we will explicitly note this limitation while grounding the claim in existing work. revision: partial
Referee: Description of the four contributions: the manuscript supplies only high-level feature lists with no implementation details, architectural diagrams, code snippets, or performance metrics, preventing assessment of whether the visual environment, Git-backed lifecycle, or provenance model actually function as claimed.

Authors: The current manuscript is intentionally concise and high-level. We accept that this limits evaluation. The revised version will include: (1) an architectural diagram of the visual pipeline composer, runtime injection mechanism, and provenance layer; (2) expanded implementation descriptions covering Git-backed component storage, environment templates, and experiment bundles; and (3) selected code/configuration snippets illustrating custom component authoring and shared-task deployment. As this is a workbench rather than a performance benchmark, traditional runtime metrics are less relevant, but we will report any available data on deployment footprint and responsiveness. revision: yes

Circularity Check

0 steps flagged

No circularity: system description with no derivations or self-referential reductions

full rationale

The paper is a descriptive introduction of a web-based workbench and its four contributions. It contains no equations, fitted parameters, predictions, or mathematical derivations. The central premise that infrastructure (rather than engine fidelity or validation) is the primary adoption bottleneck is stated as an assumption in the abstract and introduction but is not derived from or reduced to any self-citation, prior result by the authors, or definitional loop within the paper. No load-bearing step reduces by construction to its inputs, satisfying the criteria for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software tool description paper with no mathematical model, fitted parameters, axioms, or invented scientific entities.

pith-pipeline@v0.9.0 · 5481 in / 1110 out tokens · 68050 ms · 2026-05-08T07:19:35.042544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 20 canonical work pages

[1]

Jafar Afzali, Aleksander Mark Drzewiecki, Krisztian Balog, and Shuo Zhang
[2]

InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM ’23)

UserSimCRS: A User Simulation Toolkit for Evaluating Conversational Recommender Systems. InProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM ’23). ACM, 1160–1163. doi:10.1145/3539597.3573029 IIRSim Studio: A Dashboard for User Simulation

work page doi:10.1145/3539597.3573029
[3]

Armstrong, Alistair Moffat, William Webber, and Justin Zobel

Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. EvaluatIR: An Online Tool for Evaluating and Comparing IR Systems. InProceed- ings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09). ACM, 833. doi:10.1145/1571941.1572139

work page doi:10.1145/1571941.1572139 2009
[4]

Leif Azzopardi, Timo Breuer, Björn Engelmann, Christin Kreutz, Sean MacA- vaney, David Maxwell, Andrew Parry, Adam Roegiest, Xi Wang, and Saber Zerhoudi. 2024. SimIIR 3: A Framework for the Simulation of Interactive and Conversational Information Retrieval. InProceedings of the 2024 Annual Inter- national ACM SIGIR Conference on Research and Development i...

work page doi:10.1145/3673791.3698427 2024
[5]

Krisztian Balog, Nolwenn Bernard, Saber Zerhoudi, and ChengXiang Zhai. 2025. Theory and Toolkits for User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation. (2025), 4138– 4141

2025
[6]

Krisztian Balog and ChengXiang Zhai. 2025. The Indispensable Role of User Simulation in the Pursuit of AGI.arXiv preprint arXiv:2509.19456(2025)

work page arXiv 2025
[7]

Feza Baskaya, Heikki Keskustalo, and Kalervo Järvelin. 2013. Modeling behav- ioral factors ininteractive information retrieval. InProceedings of the 22nd ACM international conference on Information & Knowledge Management. 2297–2302

2013
[8]

Nolwenn Bernard. 2024. Leveraging User Simulation to Develop and Evaluate Conversational Information Access Agents. InProceedings of the 17th ACM International Conference on Web Search and Data Mining(Merida, Mexico)(WSDM ’24). Association for Computing Machinery, New York, NY, USA, 1136–1138. doi:10.1145/3616855.3635730

work page doi:10.1145/3616855.3635730 2024
[9]

Nolwenn Bernard and Krisztian Balog. 2026. UserSimCRS v2: Simulation-Based Evaluation for Conversational Recommender Systems. arXiv:2512.04588 [cs.IR] https://arxiv.org/abs/2512.04588

work page arXiv 2026
[10]

Timo Breuer, Nicola Ferro, Maria Maistro, and Philipp Schaer. 2021. repro_eval: A python interface to reproducibility measures of system-oriented IR experiments. InEuropean Conference on Information Retrieval. Springer, 481–486

2021
[11]

Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. 2019. Overview of the 2019 Open-Source IR Replicability Chal- lenge (OSIRRC 2019). InProceedings of the Open-Source IR Replicability Chal- lenge co-located with 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, OSIRRC@SIGIR 2019...

2019
[12]

Taylor, and Bill Ramsey

Nick Craswell, Onno Zoeter, Michael J. Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. InProceedings of the International Conference on Web Search and Web Data Mining, WSDM 2008, Palo Alto, California, USA, February 11-12, 2008, Marc Najork, Andrei Z. Broder, and Soumen Chakrabarti (Eds.). 87–94. doi:10.1145/134153...

work page doi:10.1145/1341531.1341545 2008
[13]

Jan de Wit. 2023. Leveraging Large Language Models as Simulated Users for Initial, Low-Cost Evaluations of Designed Conversations. InChatbot Research and Design: 7th International Workshop, CONVERSATIONS 2023, Oslo, Norway, November 22–23, 2023, Revised Selected Papers(Oslo, Norway). Springer-Verlag, Berlin, Heidelberg, 77–93. doi:10.1007/978-3-031-54975-5_5

work page doi:10.1007/978-3-031-54975-5_5 2023
[14]

Georges Dupret and Benjamin Piwowarski. 2008. A user browsing model to predict search engine click data from past observations. InProceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng C...

work page doi:10.1145/1390334.1390392 2008
[15]

Björn Engelmann, Timo Breuer, Jana Isabelle Friese, Philipp Schaer, and Norbert Fuhr. 2024. Context-Driven Interactive Query Simulations Based on Generative Large Language Models. arXiv:2312.09631 [cs.IR] https://arxiv.org/abs/2312.096 31

work page arXiv 2024
[16]

Pierre Erbacher, Laure Soulier, and Ludovic Denoyer. 2022. State of the Art of User Simulation approaches for conversational information retrieval. arXiv:2201.03435 [cs.IR] https://arxiv.org/abs/2201.03435

work page arXiv 2022
[17]

Nicola Ferro. 2017. Reproducibility Challenges in Information Retrieval Evalua- tion.ACM J. Data Inf. Qual.8, 2 (2017), 8:1–8:4. doi:10.1145/3020206

work page doi:10.1145/3020206 2017
[18]

Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff
[19]

InWorking Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019 (CEUR Workshop Proceedings, Vol

CENTRE@CLEF2019: Overview of the Replicability and Reproducibility Tasks. InWorking Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019 (CEUR Workshop Proceedings, Vol. 2380), Linda Cappellato, Nicola Ferro, David E. Losada, and Henning Müller (Eds.). https://ceur-ws.org/Vol-2380/paper_258.pdf

2019
[20]

Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2023. The information retrieval experiment platform. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2826–2836

2023
[21]

Frank Hopfgartner, Allan Hanbury, Henning Müller, Ivan Eggel, Krisztian Ba- log, Torben Brodt, Gordon V Cormack, Jimmy Lin, Jayashree Kalpathy-Cramer, Noriko Kando, et al. 2018. Evaluation-as-a-service for the computational sciences: Overview and outlook.Journal of Data and Information Quality (JDIQ)10, 4 (2018), 1–32

2018
[22]

Johannes Kiesel, Marcel Gohsen, Nailia Mirzakhmedova, Matthias Hagen, and Benno Stein. 2024. Simulating Follow-Up Questions in Conversational Search. InProceedings of the 46th European Conference on Information Retrieval (ECIR ’24). Springer, 382–398. doi:10.1007/978-3-031-56060-6_25

work page doi:10.1007/978-3-031-56060-6_25 2024
[23]

David Maxwell and Leif Azzopardi. 2016. Simulating Interactive Information Retrieval: SimIIR: A Framework for the Simulation of Interaction. InProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016, Raffaele Perego, Fabrizio Sebastiani, Javed A. Aslam, Ian ...

2016
[24]

doi:10.1145/2911451.2911469

work page doi:10.1145/2911451.2911469
[25]

David Maxwell and Leif Azzopardi. 2018. Information scent, searching and stopping: Modelling SERP level stopping behaviour. InEuropean Conference on Information Retrieval. Springer, 210–222

2018
[26]

David Maxwell, Leif Azzopardi, Kalervo Järvelin, and Heikki Keskustalo. 2015. Searching and Stopping: An Analysis of Stopping Rules and Strategies. InPro- ceedings of the 24th ACM International Conference on Information and Knowl- edge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015, James Bailey, Alistair Moffat, Charu C. Aggarwal...

work page doi:10.1145/2806416.2806476 2015
[27]

Teemu Pääkkönen, Jaana Kekäläinen, Heikki Keskustalo, Leif Azzopardi, David Maxwell, and Kalervo Järvelin. 2017. Validating simulated interaction for retrieval evaluation.Inf. Retr. J.20, 4 (2017), 338–362. doi:10.1007/S10791-017-9301-2

work page doi:10.1007/s10791-017-9301-2 2017
[28]

Philipp Schaer, Christin Katharina Kreutz, Krisztian Balog, Timo Breuer, and Andreas Konstantin Kruff. 2025. Second SIGIR Workshop on Simulations for Information Access (Sim4IA 2025). InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025, Nicola Ferro, ...

work page arXiv 2025
[29]

Sim4IA. 2025. SIGIR 2025 Micro Shared Task Repository. https://github.com/sim 4ia/sigir2025-shared-task

2025
[30]

Saber Zerhoudi and Michael Granitzer. 2025. UXSim: Towards a Hybrid User Search Simulation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea. ACM. doi:10.1145/3746252.3761640

work page doi:10.1145/3746252.3761640 2025
[31]

Saber Zerhoudi, Sebastian Günther, Kim Plassmeier, Timo Borst, Christin Seifert, Matthias Hagen, and Michael Granitzer. 2022. The SimIIR 2.0 Framework: User Types, Markov Model-Based Interaction Simulation, and Advanced Query Gen- eration. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, Octob...

work page doi:10.1145/3511808.3557711 2022
[32]

Saber Zerhoudi, Adam Roegiest, and Johanne R. Trippas. 2026. Simulation of Interactive Information Retrieval: A Guided Tour. InProceedings of the ACM Conference on Information Interaction and Retrieval (CHIIR’26). 1–3

2026
[33]

Yinan Zhang, Xueqing Liu, and ChengXiang Zhai. 2017. Information Retrieval Evaluation as Search Simulation: A General Formal Framework for IR Evaluation. InProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2017, Amsterdam, The Netherlands, October 1-4, 2017, Jaap Kamps, Evangelos Kanoulas, Maarten de Rijke, Hu...

2017
[34]

doi:10.1145/3121050.3121070

work page doi:10.1145/3121050.3121070