Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora on Reddit

Alamjeet Singh; Ivan Beschastnikh; Leo Itsuki Foord-Kelcey; Patrick Yung Kang Lee; Paul Hendrik Bucci

arxiv: 2402.06124 · v3 · submitted 2024-02-09 · 💻 cs.HC

Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora on Reddit

Patrick Yung Kang Lee , Paul Hendrik Bucci , Leo Itsuki Foord-Kelcey , Alamjeet Singh , Ivan Beschastnikh This is my paper

Pith reviewed 2026-05-24 04:17 UTC · model grok-4.3

classification 💻 cs.HC

keywords thematic curationqualitative inquirylarge text corporaReddit analysisserendipitous discoverysearch saturationcollaborative curation

0 comments

The pith

Teleoscope is a web interface that lets qualitative researchers iteratively curate large text corpora like Reddit posts thematically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Teleoscope as a tool to scaffold iterative, interactive, and reflexive refinement of large text corpora through a process called thematic curation. It reports that deployments of the tool supported serendipitous discovery of new keywords, produced greater feelings of confidence that searches had reached saturation, and helped teams discuss alternative curation pathways. This approach is positioned as a way to keep researchers close to the data instead of relying on statistical subsampling. If the reported benefits hold, qualitative analysis of big corpora can remain methodologically coherent with interpretivist goals. The work focuses on making such curation feasible without losing the ability to explore data reflexively.

Core claim

Teleoscope is a web-based interface designed to scaffold iterative, interactive, and reflexive refinement of a large corpus, in a process called thematic curation. Across three deployments, Teleoscope supports serendipitous discovery of new keywords, results in greater feelings of confidence in search saturation, and aids collaborative discussion of alternative curation pathways. Teleoscope empowers researchers to stay close to the data in order to make qualitative workflows methodologically coherent with large text corpora.

What carries the argument

Teleoscope, a web-based interface that supports thematic curation through iterative refinement of large text corpora.

If this is right

Researchers can analyze large corpora without reducing them to statistical subsamples.
Serendipitous keyword discovery becomes part of the curation process.
Teams gain ways to discuss and compare different curation pathways.
Qualitative workflows remain close to the original data throughout refinement.
Search saturation can be assessed with greater reported confidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The interface principles could apply to curation tasks on other large text sources beyond Reddit.
Thematic curation tools might reduce dependence on high-level statistical summaries for initial navigation.
Future systems could combine this interactive approach with automated suggestions while preserving reflexivity.

Load-bearing premise

Feedback from the three deployments reliably demonstrates the claimed benefits of the interface, even though no details on study design, participant selection, controls, or measures are provided.

What would settle it

A controlled comparison of Teleoscope users versus standard search methods that finds no increase in new keywords discovered or in reported confidence that search saturation has been reached.

Figures

Figures reproduced from arXiv: 2402.06124 by Alamjeet Singh, Ivan Beschastnikh, Leo Itsuki Foord-Kelcey, Patrick Yung Kang Lee, Paul Hendrik Bucci.

**Figure 1.** Figure 1: A screenshot of the core Teleoscope workflow: starting from a keyword search, you choose documents to iteratively [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: An image of the Teleoscope workspace. (1) Users start by performing a keyword search to explore documents; (2) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Large corpora in the thousands to millions of documents are difficult to make sense of, but LLMs are making it [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: During nucleation, ideas about the corpus are just starting to unfold and develop. Quick interaction is key to keeping [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ideas start off ambiguous. As we wonder about our corpus, hunches, notions, and predictions emerge that we can test [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: When developing a theme, some documents may be more or less illustrative of that theme and therefore more or [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: As documents are explored, both a conceptual relevance and calculated metric relevance are revealed. Both our [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Here is an example of a theme that has been faceted to the point of saturation. Many different facets are presented on [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: We discovered that we could us vectorized annotations to creatively summarize and search. We got a sense for [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: After reading through many posts, we wanted to figure out how to differentiate our research goals from common [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Large printouts of the Teleoscope interface were [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: An example of a workflow from Case Study 3 our long-term deployment. Pictures are actual screenshots of research artifacts from our participant research team’s data-gathering phase. Participants worked both individually and collaboratively on the Teleoscope interface, and collaboratively on Google Docs, Zoom, and in-person. Due to the existing qualitative research culture in their research group, keyword … view at source ↗

read the original abstract

Large text corpora, such as Reddit posts, have become an increasingly prevalent site of qualitative inquiry. However, most large text corpora are intractable for qualitative researchers. Instead, teams rely on statistical subsampling to reduce corpora to a manageable size for qualitative analysis. While previous work for navigating large corpora involves visualizing the dataset at the corpus-level using high-level statistical summaries, few systems offer the ability to curate data using an interpretivist approach. To address this, we developed Teleoscope, a web-based interface designed to scaffold iterative, interactive, and reflexive refinement of a large corpus, in a process we call thematic curation. Across three deployments, we learned that Teleoscope supports serendipitous discovery of new keywords, results in greater feelings of confidence in search saturation, and aids collaborative discussion of alternative curation pathways. Teleoscope empowers researchers to stay "close to the data" in order to make qualitative workflows methodologically coherent with large text corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Teleoscope gives a workable interface for iterative thematic curation on Reddit data, but the reported benefits rest on three deployments with no visible details on design or measures.

read the letter

The paper's core contribution is Teleoscope, a web interface that supports iterative, reflexive curation of large Reddit corpora by letting users refine themes while staying close to individual posts. It contrasts this with prior corpus-level visualization tools that rely on statistical summaries and positions the new process as thematic curation suited to interpretivist qualitative work. That framing and the system description are the main new elements here.

Referee Report

1 major / 0 minor

Summary. The paper presents Teleoscope, a web-based interface for thematic curation of large text corpora (e.g., Reddit posts). It contrasts this interpretivist, iterative refinement process with statistical subsampling and corpus-level visualizations. The central claim is that across three deployments, Teleoscope enables serendipitous keyword discovery, increases researcher confidence in search saturation, and supports collaborative discussion of curation pathways, thereby allowing qualitative workflows to remain close to the data.

Significance. If the deployment findings are supported by rigorous evidence, the work would contribute a concrete system and process for interpretivist curation of intractable corpora, addressing a recognized gap between statistical tools and qualitative analysis needs in HCI. The paper explicitly demonstrates an interface that scaffolds reflexive refinement rather than high-level summaries.

major comments (1)

[Deployments / Evaluation] The central claims rest on outcomes from three deployments, yet the manuscript provides no description of study design, participant selection, task instructions, comparison conditions, observation coding, or quantitative/qualitative measures used to assess serendipitous discovery, confidence, or collaborative discussion. This is load-bearing because the reported benefits cannot be distinguished from expectation effects or selection bias without these details.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for identifying this important gap in the presentation of our deployment findings. We address the concern directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claims rest on outcomes from three deployments, yet the manuscript provides no description of study design, participant selection, task instructions, comparison conditions, observation coding, or quantitative/qualitative measures used to assess serendipitous discovery, confidence, or collaborative discussion. This is load-bearing because the reported benefits cannot be distinguished from expectation effects or selection bias without these details.

Authors: We agree that the manuscript currently lacks a dedicated subsection describing the deployment protocols. The three deployments were real-world applications of Teleoscope within ongoing qualitative research projects (one solo, two collaborative) rather than a formal controlled study. In the revision we will add a 'Deployment Contexts' section that details: (1) how each deployment was initiated and the corpus characteristics, (2) the researchers involved and their selection, (3) the iterative workflow observed in each case, and (4) the specific observations and artifacts (e.g., saved schemas, discussion notes) that grounded the reported outcomes on keyword discovery, saturation confidence, and collaborative discussion. We will also explicitly qualify that these are illustrative case observations rather than experimentally controlled measures, thereby clarifying the evidential basis and potential biases for readers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims from deployments have no mathematical derivations or self-referential reductions

full rationale

The paper presents a system description and reports outcomes from three deployments as the basis for its claims about serendipitous discovery, confidence in saturation, and collaborative discussion. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The central claims are framed as observations from user experiences rather than derivations that reduce to inputs by construction. Self-citations, if present elsewhere, are not load-bearing for any derivation chain. This is a standard non-circular empirical HCI paper whose validity rests on study details (not supplied in the abstract) rather than definitional or self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that interpretivist curation is preferable to statistical subsampling for qualitative coherence and that deployment feedback accurately captures interface benefits.

axioms (1)

domain assumption Qualitative researchers benefit from staying close to the data in large corpora rather than relying solely on statistical summaries.
Stated directly in the abstract as the motivation for the interpretivist approach.

invented entities (1)

Teleoscope no independent evidence
purpose: Web-based interface to scaffold thematic curation
New system introduced by the authors with no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5711 in / 1166 out tokens · 25843 ms · 2026-05-24T04:17:06.144232+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 6 internal anchors

[1]

[n. d.]. https://openai.com/blog/chatgpt

work page
[2]

Eric Alexander, Joe Kohlmann, Robin Valenza, Michael Witmore, and Michael Gleicher. 2014. Serendip: Topic model-driven visual exploration of text corpora. In 2014 IEEE Conference on Visual Analytics Science and Technology (V AST). 173–

work page 2014
[3]

https://doi.org/10.1109/VAST.2014.7042493

work page doi:10.1109/vast.2014.7042493 2014
[4]

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Ben- netot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. In- formation fusion 58 (2020), 82–115

work page 2020
[5]

Deepak Suresh Asudani, Naresh Kumar Nagwani, and Pradeep Singh. 2023. Im- pact of word embedding models on text analytics in deep learning environment: a review. Artificial Intelligence Review 56 (2023), 1–81

work page 2023
[6]

Rajiv Badi, Soonil Bae, J Michael Moore, Konstantinos Meintanis, Anna Zacchi, Haowei Hsieh, Frank Shipman, and Catherine C Marshall. 2006. Recognizing user interest and document value from reading and organizing activities in document triage. In Proceedings of the 11th international conference on Intelligent user interfaces. 218–225

work page 2006
[7]

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media , Vol. 14. 830–839

work page 2020
[8]

Charles Berret and Tamara Munzner. 2022. Iceberg Sensemaking: A Process Model for Critical Data Analysis and Visualization. arxiv.org (4 2022)

work page 2022
[9]

Berret and T

C. Berret and T. Munzner. 2022. Iceberg Sensemaking: A Process Model for Critical Data Analysis and Visualization. arXiv preprint arXiv:2204.00000 (2022)

work page arXiv 2022
[10]

Christian Bors, Theresia Gschwandtner, and Silvia Miksch. 2019. Capturing and visualizing provenance from data wrangling. IEEE computer graphics and applications 39, 6 (2019), 61–75. Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora

work page 2019
[11]

Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psycho- logical Association

work page 2012
[12]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[13]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2023. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2309.07597 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

N.C. Chen, M. Drouhard, R. Kocielnik, J. Suh, and C.R. Aragon. 2018. Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity. ACM Transactions on Interactive Intelligent Systems (TiiS) 8, 2 (2018), 1–20

work page 2018
[15]

Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R Aragon. 2018. Using Machine Learning to Support Qualitative Coding in Social Science. ACM Transactions on Interactive Intelligent Systems 8 (6 2018), 1–20. Issue 2. https://doi.org/10.1145/3185515

work page doi:10.1145/3185515 2018
[16]

Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factoriza- tion. IEEE transactions on visualization and computer graphics 19, 12 (2013), 1992–2001

work page 2013
[17]

Zach Cutler, Kiran Gadhave, and Alexander Lex. 2020. Trrack: A Library for Provenance-Tracking in Web-Based Visualizations, In IEEE Visualization Con- ference (VIS). 116–120. https://doi.org/10.1109/VIS47514.2020.00030

work page doi:10.1109/vis47514.2020.00030 2020
[19]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Mennatallah El-Assady, Rebecca Kehlbeck, Christopher Collins, Daniel Keim, and Oliver Deussen. 2019. Semantic concept spaces: Guided topic model refine- ment using word-embedding projections. IEEE transactions on visualization and computer graphics 26, 1 (2019), 1001–1011

work page 2019
[21]

Laura L Ellingson. 2009. Engaging crystallization in qualitative research: An introduction. Sage

work page 2009
[22]

Anna Fariha and Alexandra Meliou. 2019. Example-driven query intent discovery: Abductive reasoning using semantic similarity. arXiv preprint arXiv:1906.10322 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Samah Gad, Waqas Javed, Sohaib Ghani, Niklas Elmqvist, Tom Ewing, Keith N Hampton, and Naren Ramakrishnan. 2015. ThemeDelta: Dynamic segmentations over temporal topic models. IEEE transactions on visualization and computer graphics 21, 5 (2015), 672–685

work page 2015
[24]

Jie Gao, Yuchen Guo, Gionnieve Lim, Tianqin Zhang, Zheng Zhang, Toby Jia-Jun Li, and Simon Tangi Perrault. 2024. CollabCoder: A Lower-barrier, Rigorous Workflow for Inductive Collaborative Qualitative Analysis with Large Language Models. arXiv:2304.07366 [cs.HC]

work page arXiv 2024
[25]

Greg Guest, Emily Namey, and Mario Chen. 2020. A simple method to assess and report thematic saturation in qualitative research. PloS one 15, 5 (2020), e0232076

work page 2020
[26]

Marti A Hearst and Duane Degler. 2013. Sewing the seams of sensemaking: A practical interface for tagging and organizing saved search results. InProceedings of the symposium on human-computer interaction and information retrieval . 1–10

work page 2013
[27]

Monique M Hennink, Bonnie N Kaiser, and Vincent C Marconi. 2017. Code satu- ration versus meaning saturation: how many interviews are enough?Qualitative health research 27, 4 (2017), 591–608

work page 2017
[28]

Matt-Heun Hong, Lauren A Marsh, Jessica L Feuston, Janet Ruppert, Jed R Brubaker, and Danielle Albers Szafir. 2022. Scholastic: Graphical Human-AI Collaboration for Inductive and Interpretive Text Analysis. The 35th Annual ACM Symposium on User Interface Software and Technology . https://doi.org/10. 1145/3526113.3545681

work page arXiv 2022
[29]

Hannah Kim, Dongjin Choi, Barry Drake, Alex Endert, and Haesun Park. 2019. TopicSifter: Interactive search space reduction through targeted topic modeling. In 2019 IEEE Conference on Visual Analytics Science and Technology (V AST). IEEE, IEEE, Vancouver, Canada, 35–45

work page 2019
[30]

Hannah Kim, Barry Drake, Alex Endert, and Haesun Park. 2020. Architext: Interactive hierarchical topic modeling. IEEE transactions on visualization and computer graphics 27, 9 (2020), 3644–3655

work page 2020
[31]

Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. 2024. MEGAnno+: A Human-LLM Collaborative Annotation System. arXiv:2402.18050 [cs.CL]

work page arXiv 2024
[32]

Kori A LaDonna, Anthony R Artino Jr, and Dorene F Balmer. 2021. Beyond the guise of saturation: rigor and qualitative interview data. , 607–611 pages

work page 2021
[33]

Ching-Hung Lee, Chien-Liang Liu, Amy JC Trappey, John PT Mo, and Kevin C Desouza. 2021. Understanding digital transformation in advanced manufacturing and engineering: A bibliometric analysis, topic modeling and research trend discovery. Advanced Engineering Informatics 50 (2021), 101428

work page 2021
[34]

Yuan Li, Anita Crescenzi, Austin R Ward, and Rob Capra. 2023. Thinking inside the box: An evaluation of a novel search-assisting tool for supporting (meta) cognition during exploratory search. Journal of the Association for Information Science and Technology (2023)

work page 2023
[35]

Matteo Lissandrini, Davide Mottin, Themis Palpanas, Yannis Velegrakis, and HV Jagadish. 2019. Data Exploration Using Example-Based Methods . Springer

work page 2019
[36]

Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A Myers. 2023. Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models. arXiv preprint arXiv:2310.02161 (2023)

work page arXiv 2023
[37]

Kirsti Malterud, Volkert Dirk Siersma, and Ann Dorrit Guassora. 2016. Sample size in qualitative interview studies: guided by information power. Qualitative health research 26, 13 (2016), 1753–1760

work page 2016
[38]

Sara Mannheimer. 2021. Data curation implications of qualitative data reuse and big social research. Journal of eScience Librarianship 10, 4 (2021)

work page 2021
[39]

Denis Mayr Lima Martins. 2019. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Information Systems 83 (2019), 89–100

work page 2019
[40]

Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205

work page 2017
[41]

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. 2018. UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software 3, 29 (2018), 861

work page 2018
[42]

Christofer Meinecke, David Joseph Wrisley, and Stefan Jänicke. 2021. Explaining semi-supervised text alignment through visualization. IEEE Transactions on Visualization and Computer Graphics 28, 12 (2021), 4797–4809

work page 2021
[43]

Albine Moser and Irene Korstjens. 2017. Series: Practical guidance to qualitative research. Part 1: Introduction. European Journal of General Practice 23 (10 2017), 271–273. Issue 1. https://doi.org/10.1080/13814788.2017.1375093

work page doi:10.1080/13814788.2017.1375093 2017
[44]

Tamara Munzner. 2014. Visualization analysis and design . CRC press

work page 2014
[45]

Emily Namey, Greg Guest, Lucy Thairu, and Laura Johnson. 2008. Data reduction techniques for large qualitative data sets. InHandbook for Team-Based Qualitative Research. 137–162

work page 2008
[46]

Jakob Neilson. [n. d.]. 10 usability heuristics for user interface design. https: //www.nngroup.com/articles/ten-usability-heuristics/

work page
[47]

Jakob Nielsen. 1992. Finding Usability Problems through Heuristic Evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 373–380. https://doi.org/10.1145/142750.142834

work page doi:10.1145/142750.142834 1992
[48]

Jakob Nielsen and Rolf Molich. 1990. Heuristic Evaluation of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 249–256. https://doi.org/10.1145/97243.97281

work page doi:10.1145/97243.97281 1990
[49]

Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science 43, 1 (2017), 88–102

work page 2017
[50]

Lorelli S Nowell, Jill M Norris, Deborah E White, and Nancy J Moules. 2017. Thematic analysis: Striving to meet the trustworthiness criteria. International journal of qualitative methods 16, 1 (2017), 1609406917733847

work page 2017
[51]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011
[53]

Reddit.com. 2024. Am I the Asshole? https://www.reddit.com/r/AmItheAsshole/

work page 2024
[54]

Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011)

work page 2011
[55]

Tim Rietz and Alexander Maedche. 2021. Cody: An AI-Based System to Semi- Automate Coding for Qualitative Research. Proceedings of the 2021 CHI Confer- ence on Human Factors in Computing Systems . https://doi.org/10.1145/3411764. 3445591

work page doi:10.1145/3411764 2021
[56]

Matthias Rüdiger, David Antons, Amol M Joshi, and Torsten-Oliver Salge. 2022. Topic modeling revisited: New evidence on algorithm performance and quality metrics. Plos one 17, 4 (2022), e0266325

work page 2022
[57]

Favourate Y Sebele-Mpofu. 2020. Saturation controversy in qualitative research: Complexities and underlying assumptions. A literature review. Cogent Social Sciences 6, 1 (2020), 1838706

work page 2020
[58]

Claudio T Silva, Juliana Freire, and Steven P Callahan. 2007. Provenance for visualizations: Reproducibility and beyond. Computing in Science & Engineering 9, 5 (2007), 82–89

work page 2007
[59]

Fabian Sperrle, Mennatallah El-Assady, Grace Guo, Rita Borgo, D Horng Chau, Alex Endert, and Daniel Keim. 2021. A Survey of Human-Centered Evaluations in Human-Centered Machine Learning. In Computer Graphics Forum, Vol. 40.3. Wiley Online Library, 543–568

work page 2021
[60]

Teleoscope.ca. 2024. Teleoscope. https://teleoscope.ca

work page 2024
[61]

Teleoscope.ca. 2024. Teleoscope GitHub. https://github.com/Teleoscope/Teleoscope

work page 2024
[62]

Silvia Terragni, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. 2021. OCTIS: Comparing and Optimizing Topic models is Simple!. In Proceedings of the 16th Conference of the European Chapter of the Paul Bucci, Leo Foord-Kelcey, Patrick Yung Kang Lee, Alamjeet Singh, and Ivan Beschastnikh Association for Computational Li...

work page doi:10.18653/v1/ 2021
[63]

Tobin and C.M

G.A. Tobin and C.M. Begley. 2004. Methodological rigour within a qualitative framework. Journal of Advanced Nursing 48, 4 (2004), 388–396

work page 2004
[64]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

T Wolf. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[66]

Kai Xu, Alvitta Ottley, Conny Walchshofer, Marc Streit, Remco Chang, and John Wenskovitch. 2020. Survey on the analysis of user interactions and visualization provenance. InComputer Graphics Forum, Vol. 39. Wiley Online Library, 757–783

work page 2020
[67]

Jun Yuan, Changjian Chen, Weikai Yang, Mengchen Liu, Jiazhi Xia, and Shixia Liu. 2021. A survey of visual analytics techniques for machine learning. Compu- tational Visual Media 7 (2021), 3–36

work page 2021

[1] [1]

[n. d.]. https://openai.com/blog/chatgpt

work page

[2] [2]

Eric Alexander, Joe Kohlmann, Robin Valenza, Michael Witmore, and Michael Gleicher. 2014. Serendip: Topic model-driven visual exploration of text corpora. In 2014 IEEE Conference on Visual Analytics Science and Technology (V AST). 173–

work page 2014

[3] [3]

https://doi.org/10.1109/VAST.2014.7042493

work page doi:10.1109/vast.2014.7042493 2014

[4] [4]

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Ben- netot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. In- formation fusion 58 (2020), 82–115

work page 2020

[5] [5]

Deepak Suresh Asudani, Naresh Kumar Nagwani, and Pradeep Singh. 2023. Im- pact of word embedding models on text analytics in deep learning environment: a review. Artificial Intelligence Review 56 (2023), 1–81

work page 2023

[6] [6]

Rajiv Badi, Soonil Bae, J Michael Moore, Konstantinos Meintanis, Anna Zacchi, Haowei Hsieh, Frank Shipman, and Catherine C Marshall. 2006. Recognizing user interest and document value from reading and organizing activities in document triage. In Proceedings of the 11th international conference on Intelligent user interfaces. 218–225

work page 2006

[7] [7]

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media , Vol. 14. 830–839

work page 2020

[8] [8]

Charles Berret and Tamara Munzner. 2022. Iceberg Sensemaking: A Process Model for Critical Data Analysis and Visualization. arxiv.org (4 2022)

work page 2022

[9] [9]

Berret and T

C. Berret and T. Munzner. 2022. Iceberg Sensemaking: A Process Model for Critical Data Analysis and Visualization. arXiv preprint arXiv:2204.00000 (2022)

work page arXiv 2022

[10] [10]

Christian Bors, Theresia Gschwandtner, and Silvia Miksch. 2019. Capturing and visualizing provenance from data wrangling. IEEE computer graphics and applications 39, 6 (2019), 61–75. Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora

work page 2019

[11] [11]

Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psycho- logical Association

work page 2012

[12] [12]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020

[13] [13]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2023. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2309.07597 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

N.C. Chen, M. Drouhard, R. Kocielnik, J. Suh, and C.R. Aragon. 2018. Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity. ACM Transactions on Interactive Intelligent Systems (TiiS) 8, 2 (2018), 1–20

work page 2018

[15] [15]

Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R Aragon. 2018. Using Machine Learning to Support Qualitative Coding in Social Science. ACM Transactions on Interactive Intelligent Systems 8 (6 2018), 1–20. Issue 2. https://doi.org/10.1145/3185515

work page doi:10.1145/3185515 2018

[16] [16]

Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factoriza- tion. IEEE transactions on visualization and computer graphics 19, 12 (2013), 1992–2001

work page 2013

[17] [17]

Zach Cutler, Kiran Gadhave, and Alexander Lex. 2020. Trrack: A Library for Provenance-Tracking in Web-Based Visualizations, In IEEE Visualization Con- ference (VIS). 116–120. https://doi.org/10.1109/VIS47514.2020.00030

work page doi:10.1109/vis47514.2020.00030 2020

[18] [19]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[19] [20]

Mennatallah El-Assady, Rebecca Kehlbeck, Christopher Collins, Daniel Keim, and Oliver Deussen. 2019. Semantic concept spaces: Guided topic model refine- ment using word-embedding projections. IEEE transactions on visualization and computer graphics 26, 1 (2019), 1001–1011

work page 2019

[20] [21]

Laura L Ellingson. 2009. Engaging crystallization in qualitative research: An introduction. Sage

work page 2009

[21] [22]

Anna Fariha and Alexandra Meliou. 2019. Example-driven query intent discovery: Abductive reasoning using semantic similarity. arXiv preprint arXiv:1906.10322 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[22] [23]

Samah Gad, Waqas Javed, Sohaib Ghani, Niklas Elmqvist, Tom Ewing, Keith N Hampton, and Naren Ramakrishnan. 2015. ThemeDelta: Dynamic segmentations over temporal topic models. IEEE transactions on visualization and computer graphics 21, 5 (2015), 672–685

work page 2015

[23] [24]

Jie Gao, Yuchen Guo, Gionnieve Lim, Tianqin Zhang, Zheng Zhang, Toby Jia-Jun Li, and Simon Tangi Perrault. 2024. CollabCoder: A Lower-barrier, Rigorous Workflow for Inductive Collaborative Qualitative Analysis with Large Language Models. arXiv:2304.07366 [cs.HC]

work page arXiv 2024

[24] [25]

Greg Guest, Emily Namey, and Mario Chen. 2020. A simple method to assess and report thematic saturation in qualitative research. PloS one 15, 5 (2020), e0232076

work page 2020

[25] [26]

Marti A Hearst and Duane Degler. 2013. Sewing the seams of sensemaking: A practical interface for tagging and organizing saved search results. InProceedings of the symposium on human-computer interaction and information retrieval . 1–10

work page 2013

[26] [27]

Monique M Hennink, Bonnie N Kaiser, and Vincent C Marconi. 2017. Code satu- ration versus meaning saturation: how many interviews are enough?Qualitative health research 27, 4 (2017), 591–608

work page 2017

[27] [28]

Matt-Heun Hong, Lauren A Marsh, Jessica L Feuston, Janet Ruppert, Jed R Brubaker, and Danielle Albers Szafir. 2022. Scholastic: Graphical Human-AI Collaboration for Inductive and Interpretive Text Analysis. The 35th Annual ACM Symposium on User Interface Software and Technology . https://doi.org/10. 1145/3526113.3545681

work page arXiv 2022

[28] [29]

Hannah Kim, Dongjin Choi, Barry Drake, Alex Endert, and Haesun Park. 2019. TopicSifter: Interactive search space reduction through targeted topic modeling. In 2019 IEEE Conference on Visual Analytics Science and Technology (V AST). IEEE, IEEE, Vancouver, Canada, 35–45

work page 2019

[29] [30]

Hannah Kim, Barry Drake, Alex Endert, and Haesun Park. 2020. Architext: Interactive hierarchical topic modeling. IEEE transactions on visualization and computer graphics 27, 9 (2020), 3644–3655

work page 2020

[30] [31]

Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. 2024. MEGAnno+: A Human-LLM Collaborative Annotation System. arXiv:2402.18050 [cs.CL]

work page arXiv 2024

[31] [32]

Kori A LaDonna, Anthony R Artino Jr, and Dorene F Balmer. 2021. Beyond the guise of saturation: rigor and qualitative interview data. , 607–611 pages

work page 2021

[32] [33]

Ching-Hung Lee, Chien-Liang Liu, Amy JC Trappey, John PT Mo, and Kevin C Desouza. 2021. Understanding digital transformation in advanced manufacturing and engineering: A bibliometric analysis, topic modeling and research trend discovery. Advanced Engineering Informatics 50 (2021), 101428

work page 2021

[33] [34]

Yuan Li, Anita Crescenzi, Austin R Ward, and Rob Capra. 2023. Thinking inside the box: An evaluation of a novel search-assisting tool for supporting (meta) cognition during exploratory search. Journal of the Association for Information Science and Technology (2023)

work page 2023

[34] [35]

Matteo Lissandrini, Davide Mottin, Themis Palpanas, Yannis Velegrakis, and HV Jagadish. 2019. Data Exploration Using Example-Based Methods . Springer

work page 2019

[35] [36]

Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A Myers. 2023. Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models. arXiv preprint arXiv:2310.02161 (2023)

work page arXiv 2023

[36] [37]

Kirsti Malterud, Volkert Dirk Siersma, and Ann Dorrit Guassora. 2016. Sample size in qualitative interview studies: guided by information power. Qualitative health research 26, 13 (2016), 1753–1760

work page 2016

[37] [38]

Sara Mannheimer. 2021. Data curation implications of qualitative data reuse and big social research. Journal of eScience Librarianship 10, 4 (2021)

work page 2021

[38] [39]

Denis Mayr Lima Martins. 2019. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Information Systems 83 (2019), 89–100

work page 2019

[39] [40]

Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205

work page 2017

[40] [41]

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. 2018. UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software 3, 29 (2018), 861

work page 2018

[41] [42]

Christofer Meinecke, David Joseph Wrisley, and Stefan Jänicke. 2021. Explaining semi-supervised text alignment through visualization. IEEE Transactions on Visualization and Computer Graphics 28, 12 (2021), 4797–4809

work page 2021

[42] [43]

Albine Moser and Irene Korstjens. 2017. Series: Practical guidance to qualitative research. Part 1: Introduction. European Journal of General Practice 23 (10 2017), 271–273. Issue 1. https://doi.org/10.1080/13814788.2017.1375093

work page doi:10.1080/13814788.2017.1375093 2017

[43] [44]

Tamara Munzner. 2014. Visualization analysis and design . CRC press

work page 2014

[44] [45]

Emily Namey, Greg Guest, Lucy Thairu, and Laura Johnson. 2008. Data reduction techniques for large qualitative data sets. InHandbook for Team-Based Qualitative Research. 137–162

work page 2008

[45] [46]

Jakob Neilson. [n. d.]. 10 usability heuristics for user interface design. https: //www.nngroup.com/articles/ten-usability-heuristics/

work page

[46] [47]

Jakob Nielsen. 1992. Finding Usability Problems through Heuristic Evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 373–380. https://doi.org/10.1145/142750.142834

work page doi:10.1145/142750.142834 1992

[47] [48]

Jakob Nielsen and Rolf Molich. 1990. Heuristic Evaluation of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 249–256. https://doi.org/10.1145/97243.97281

work page doi:10.1145/97243.97281 1990

[48] [49]

Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science 43, 1 (2017), 88–102

work page 2017

[49] [50]

Lorelli S Nowell, Jill M Norris, Deborah E White, and Nancy J Moules. 2017. Thematic analysis: Striving to meet the trustworthiness criteria. International journal of qualitative methods 16, 1 (2017), 1609406917733847

work page 2017

[50] [51]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [52]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011

[52] [53]

Reddit.com. 2024. Am I the Asshole? https://www.reddit.com/r/AmItheAsshole/

work page 2024

[53] [54]

Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011)

work page 2011

[54] [55]

Tim Rietz and Alexander Maedche. 2021. Cody: An AI-Based System to Semi- Automate Coding for Qualitative Research. Proceedings of the 2021 CHI Confer- ence on Human Factors in Computing Systems . https://doi.org/10.1145/3411764. 3445591

work page doi:10.1145/3411764 2021

[55] [56]

Matthias Rüdiger, David Antons, Amol M Joshi, and Torsten-Oliver Salge. 2022. Topic modeling revisited: New evidence on algorithm performance and quality metrics. Plos one 17, 4 (2022), e0266325

work page 2022

[56] [57]

Favourate Y Sebele-Mpofu. 2020. Saturation controversy in qualitative research: Complexities and underlying assumptions. A literature review. Cogent Social Sciences 6, 1 (2020), 1838706

work page 2020

[57] [58]

Claudio T Silva, Juliana Freire, and Steven P Callahan. 2007. Provenance for visualizations: Reproducibility and beyond. Computing in Science & Engineering 9, 5 (2007), 82–89

work page 2007

[58] [59]

Fabian Sperrle, Mennatallah El-Assady, Grace Guo, Rita Borgo, D Horng Chau, Alex Endert, and Daniel Keim. 2021. A Survey of Human-Centered Evaluations in Human-Centered Machine Learning. In Computer Graphics Forum, Vol. 40.3. Wiley Online Library, 543–568

work page 2021

[59] [60]

Teleoscope.ca. 2024. Teleoscope. https://teleoscope.ca

work page 2024

[60] [61]

Teleoscope.ca. 2024. Teleoscope GitHub. https://github.com/Teleoscope/Teleoscope

work page 2024

[61] [62]

Silvia Terragni, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. 2021. OCTIS: Comparing and Optimizing Topic models is Simple!. In Proceedings of the 16th Conference of the European Chapter of the Paul Bucci, Leo Foord-Kelcey, Patrick Yung Kang Lee, Alamjeet Singh, and Ivan Beschastnikh Association for Computational Li...

work page doi:10.18653/v1/ 2021

[62] [63]

Tobin and C.M

G.A. Tobin and C.M. Begley. 2004. Methodological rigour within a qualitative framework. Journal of Advanced Nursing 48, 4 (2004), 388–396

work page 2004

[63] [64]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [65]

T Wolf. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[65] [66]

Kai Xu, Alvitta Ottley, Conny Walchshofer, Marc Streit, Remco Chang, and John Wenskovitch. 2020. Survey on the analysis of user interactions and visualization provenance. InComputer Graphics Forum, Vol. 39. Wiley Online Library, 757–783

work page 2020

[66] [67]

Jun Yuan, Changjian Chen, Weikai Yang, Mengchen Liu, Jiazhi Xia, and Shixia Liu. 2021. A survey of visual analytics techniques for machine learning. Compu- tational Visual Media 7 (2021), 3–36

work page 2021