Effects of Collaboration on the Performance of Interactive Theme Discovery Systems
Pith reviewed 2026-05-23 22:09 UTC · model grok-4.3
The pith
Synchronous versus asynchronous collaboration produces distinct differences in consistency, cohesiveness, and correctness when using interactive NLP-assisted theme discovery tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a framework to evaluate the way collaboration settings may produce different research outcomes across a variety of interactive systems. Specifically, we study the impact of synchronous versus asynchronous collaboration using three different NLP-assisted qualitative research tools and present a comprehensive analysis of the differences in the consistency, cohesiveness, and correctness of their outcomes.
What carries the argument
An evaluation framework that compares synchronous and asynchronous collaboration across multiple interactive NLP-assisted tools by tracking consistency, cohesiveness, and correctness of theme discovery outcomes.
If this is right
- Collaboration mode can be treated as an experimental variable that measurably shifts the quality profile of themes produced by interactive systems.
- The proposed framework supplies a repeatable protocol for comparing additional tools or additional collaboration variables.
- Researchers can use the three metrics to diagnose whether a given collaboration setting improves or reduces outcome reliability.
- Tool interfaces may need to expose or log collaboration timing so that teams can monitor its influence on final theme sets.
Where Pith is reading between the lines
- The framework could be extended to non-NLP qualitative tools to test whether the same collaboration-mode effects appear outside the NLP-assisted setting.
- If synchronous and asynchronous modes produce reliably different error patterns, future tool designs might include mode-specific prompts or review steps.
- Teams could run small pilot studies with the framework before committing to a collaboration schedule for a large qualitative project.
Load-bearing premise
The three chosen tools together with the three chosen metrics of consistency, cohesiveness, and correctness are representative enough to support general claims about collaboration effects in interactive theme discovery systems.
What would settle it
A replication study that applies the same framework to a fourth independent NLP-assisted tool and finds no measurable differences in consistency, cohesiveness, or correctness between synchronous and asynchronous conditions would falsify the reported effects.
Figures
read the original abstract
NLP-assisted solutions to support qualitative data analysis have gained considerable traction. However, no unified evaluation framework exists which can account for the many different settings in which qualitative researchers may employ them. In this paper, we propose a framework to evaluate the way collaboration settings may produce different research outcomes across a variety of interactive systems. Specifically, we study the impact of synchronous vs. asynchronous collaboration using three different NLP-assisted qualitative research tools and present a comprehensive analysis of the differences in the consistency, cohesiveness, and correctness of their outcomes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a framework for evaluating how collaboration settings (synchronous vs. asynchronous) affect research outcomes across interactive NLP-assisted qualitative research tools for theme discovery. It applies the framework to three such tools and reports a comprehensive analysis of differences in consistency, cohesiveness, and correctness of the resulting themes.
Significance. If the empirical findings hold after addressing generalizability concerns, the work could provide a useful starting point for standardized evaluation of collaboration effects in interactive qualitative analysis systems, filling a noted gap in unified frameworks. The explicit focus on measurable outcome dimensions (consistency, cohesiveness, correctness) is a positive step toward falsifiable claims in this domain.
major comments (2)
- [Abstract] Abstract, paragraph 3: the central claim that the framework reveals collaboration effects 'across a variety of interactive systems' rests on only three specific NLP-assisted tools; no sampling justification, tool-class taxonomy, or sensitivity checks versus non-NLP baselines are described, so it remains possible that reported differences are idiosyncratic to the chosen implementations rather than attributable to synchronous/asynchronous settings.
- [Methods / Experimental Setup] The manuscript provides no visible participant counts, statistical tests, data exclusion rules, or inter-rater reliability measures for the consistency/cohesiveness/correctness metrics; without these, it is impossible to determine whether the reported differences between collaboration conditions are supported by the measurements or could be explained by small sample variance or task-specific confounds.
minor comments (1)
- [Abstract] The abstract would benefit from naming the three tools and briefly indicating how the metrics are operationalized, to allow readers to assess scope immediately.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on generalizability and methodological transparency. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract, paragraph 3: the central claim that the framework reveals collaboration effects 'across a variety of interactive systems' rests on only three specific NLP-assisted tools; no sampling justification, tool-class taxonomy, or sensitivity checks versus non-NLP baselines are described, so it remains possible that reported differences are idiosyncratic to the chosen implementations rather than attributable to synchronous/asynchronous settings.
Authors: We agree that the selection of three tools requires explicit justification to support the claim of applicability 'across a variety of interactive systems.' The tools were chosen to span distinct interaction styles (e.g., varying degrees of NLP automation and user steering), but we did not include a formal taxonomy or non-NLP baselines because the study scope centers on collaboration settings within existing NLP-assisted tools rather than a comprehensive system-class comparison. In revision we will add a Methods subsection detailing the selection criteria and diversity rationale, temper the abstract claim to reference the three studied systems, and add a limitations paragraph acknowledging the absence of non-NLP baselines and the need for future sensitivity checks. We do not believe a full taxonomy is required for the current contribution. revision: partial
-
Referee: [Methods / Experimental Setup] The manuscript provides no visible participant counts, statistical tests, data exclusion rules, or inter-rater reliability measures for the consistency/cohesiveness/correctness metrics; without these, it is impossible to determine whether the reported differences between collaboration conditions are supported by the measurements or could be explained by small sample variance or task-specific confounds.
Authors: We apologize that these details were not sufficiently prominent. The study collected data from a defined number of participants per condition, applied statistical tests to compare metrics across conditions, used predefined exclusion criteria for incomplete sessions, and assessed inter-rater reliability for the correctness metric. In the revised manuscript we will insert a dedicated 'Participants, Procedure, and Analysis' subsection that explicitly reports participant counts, the statistical tests employed (including p-values and effect sizes), exclusion rules, and reliability coefficients. This will make the evidential basis for the reported differences fully transparent. revision: yes
Circularity Check
Empirical user study with no derivation chain or self-referential inputs
full rationale
The paper describes an empirical user study comparing synchronous vs. asynchronous collaboration across three specific NLP-assisted tools, measuring consistency, cohesiveness, and correctness. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The central claims rest on experimental observations rather than any quantity defined inside the paper by construction. This matches the default expectation of no significant circularity for non-mathematical empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shai Ben-David and Margareta Ackerman. 2008. https://proceedings.neurips.cc/paper_files/paper/2008/file/beed13602b9b0e6ecb5b568ff5058f07-Paper.pdf Measures of clustering quality: A working set of axioms for clustering . In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc
work page 2008
-
[2]
Henry E. Brady. 2019. https://doi.org/10.1146/annurev-polisci-090216-023229 The challenge of big data and data science . Annual Review of Political Science, 22(1):297--323
-
[3]
Virginia Braun and Victoria Clarke. 2006. https://doi.org/10.1191/1478088706qp063oa Using thematic analysis in psychology . Qualitative Research in Psychology, 3:77--101
-
[4]
Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R. Aragon. 2018. https://doi.org/10.1145/3185515 Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity . ACM Trans. Interact. Intell. Syst., 8(2)
- [5]
-
[6]
Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and Haesun Park. 2013. https://doi.org/10.1109/TVCG.2013.212 Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization . IEEE Transactions on Visualization and Computer Graphics, 19(12):1992--2001
- [7]
-
[9]
Shih-Chieh Dai, Aiping Xiong, and Lun-Wei Ku. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.669 LLM -in-the-loop: Leveraging large language model for thematic analysis . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9993--10001, Singapore. Association for Computational Linguistics
-
[10]
Margaret Drouhard, Nan-Chen Chen, Jina Suh, Rafal Kocielnik, Vanessa Peña-Araya, Keting Cen, Xiangyi Zheng, and Cecilia R. Aragon. 2017. https://doi.org/10.1109/PACIFICVIS.2017.8031598 Aeonium: Visual analytics to support collaborative qualitative coding . In 2017 IEEE Pacific Visualization Symposium (PacificVis), pages 220--229
-
[11]
Zheng Fang, Lama Alqazlan, Du Liu, Yulan He, and Rob Procter. 2023. https://doi.org/10.18653/v1/2023.eacl-main.37 A user-centered, interactive, human-in-the-loop topic modelling system . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 505--522, Dubrovnik, Croatia. Association for Comput...
-
[12]
Zheng Fang, Yulan He, and Rob Procter. 2021. https://doi.org/10.18653/v1/2021.findings-acl.154 A query-driven topic model . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1764--1777, Online. Association for Computational Linguistics
-
[13]
Jessica L. Feuston and Jed R. Brubaker. 2021. https://doi.org/10.1145/3479856 Putting tools in their place: The role of time and perspective in human-ai collaboration for qualitative analysis . Proc. ACM Hum.-Comput. Interact., 5(CSCW2)
-
[14]
Uwe Flick. 2014. https://doi.org/10.4135/9781446282243 The sage handbook of qualitative data analysis
-
[15]
Jie Gao, Kenny Tsu Wei Choo, Junming Cao, Roy Ka-Wei Lee, and Simon Perrault. 2023. https://doi.org/10.1145/3617362 Coaicoder: Examining the effectiveness of ai-assisted human-to-human collaboration in qualitative analysis . ACM Trans. Comput.-Hum. Interact., 31(1)
-
[16]
Jie Gao, Yuchen Guo, Gionnieve Lim, Tianqin Zhang, Zheng Zhang, Toby Jia-Jun Li, and Simon Tangi Perrault. 2024. https://arxiv.org/abs/2304.07366 Collabcoder: A lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models . Preprint, arXiv:2304.07366
-
[17]
Barney G Glaser, Anselm L Strauss, and Elizabeth Strutzel. 1968. The discovery of grounded theory; strategies for qualitative research. Nursing research, 17(4):364
work page 1968
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1):29--48
work page 2008
-
[20]
Smaldino, Wouter van Atteveldt, Annie Waldherr, Jingwen Zhang, and Jonathan J
Martin Hilbert, George Barnett, Joshua Blumenstock, Noshir Contractor, Jana Diesner, Seth Frey, Sandra González-Bailón, PJ Lamberson, Jennifer Pan, Tai-Quan Peng, Cuihua (Cindy) Shen, Paul E. Smaldino, Wouter van Atteveldt, Annie Waldherr, Jingwen Zhang, and Jonathan J. H. Zhu. 2019. https://ijoc.org/index.php/ijoc/article/view/10675 Computational communi...
work page 2019
-
[21]
Enamul Hoque and Giuseppe Carenini. 2016. https://doi.org/10.1145/2854158 Interactive topic modeling for exploring asynchronous online conversations: Design and evaluation of convisit . ACM Trans. Interact. Intell. Syst., 6(1)
-
[22]
Alexander Hoyle, Pranav Goel, Denis Peskov, Andrew Hian-Cheong, Jordan Boyd-Graber, and Philip Resnik. 2021. Is automated topic model evaluation broken? the incoherence of coherence. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21, Red Hook, NY, USA. Curran Associates Inc
work page 2021
-
[23]
Jialun Aaron Jiang, Kandrea Wade, Casey Fiesler, and Jed R. Brubaker. 2021. https://doi.org/10.1145/3449168 Supporting serendipity: Opportunities and challenges for human-ai collaboration in qualitative analysis . Proc. ACM Hum.-Comput. Interact., 5(CSCW1)
-
[24]
Xin Jin and Jiawei Han. 2010. https://doi.org/10.1007/978-0-387-30164-8_425 K-Means Clustering , pages 563--564. Springer US, Boston, MA
-
[25]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2024. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA. Curran Associates Inc
work page 2024
-
[26]
Julia Mendelsohn, Ceren Budak, and David Jurgens. 2021. https://doi.org/10.18653/v1/2021.naacl-main.179 Modeling framing in immigration discourse on social media . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2219--2263, Online. Association for Comp...
-
[27]
Maria Leonor Pacheco, Tunazzina Islam, Monal Mahajan, Andrey Shor, Ming Yin, Lyle Ungar, and Dan Goldwasser. 2022. https://doi.org/10.18653/v1/2022.naacl-main.427 A holistic framework for analyzing the COVID -19 vaccine debate . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lang...
-
[28]
Maria Leonor Pacheco, Tunazzina Islam, Lyle Ungar, Ming Yin, and Dan Goldwasser. 2023. https://doi.org/10.18653/v1/2023.findings-acl.313 Interactive concept learning for uncovering latent themes in large text collections . In Findings of the Association for Computational Linguistics: ACL 2023, pages 5059--5080, Toronto, Canada. Association for Computation...
-
[29]
Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence embeddings using S iamese BERT -networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982--3992, Hong Kong, Chi...
-
[30]
Tim Rietz, Peyman Toreini, and Alexander Maedche. 2020. https://doi.org/10.1145/3379350.3416195 Cody: An interactive machine learning system for qualitative coding . In Adjunct Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, UIST '20 Adjunct, page 90–92, New York, NY, USA. Association for Computing Machinery
-
[31]
Shamik Roy, Maria Leonor Pacheco, and Dan Goldwasser. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.783 Identifying morality frames in political tweets using relational learning . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9939--9958, Online and Punta Cana, Dominican Republic. Association for Compu...
-
[32]
Alison Smith, Varun Kumar, Jordan Boyd-Graber, Kevin Seppi, and Leah Findlater. 2018. https://doi.org/10.1145/3172944.3172965 Closing the loop: User-centered design and evaluation of a human-in-the-loop topic modeling system . In Proceedings of the 23rd International Conference on Intelligent User Interfaces, IUI '18, page 293–304, New York, NY, USA. Asso...
-
[33]
Laurens van der Maaten and Geoffrey Hinton. 2008. http://jmlr.org/papers/v9/vandermaaten08a.html Visualizing data using t-sne . Journal of Machine Learning Research, 9(86):2579--2605
work page 2008
-
[34]
Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer
Ziang Xiao, Xingdi Yuan, Q. Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer. 2023. https://doi.org/10.1145/3581754.3584136 Supporting qualitative analysis with large language models: Combining codebook with gpt-3 for deductive coding . In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI '23 Companion, pag...
-
[35]
Himanshu Zade, Margaret Drouhard, Bonnie Chinh, Lu Gan, and Cecilia Aragon. 2018. https://doi.org/10.1145/3173574.3173733 Conceptualizing disagreement in qualitative coding . In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI '18, page 1–11, New York, NY, USA. Association for Computing Machinery
-
[36]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[37]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[38]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[39]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.