Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora on Reddit
Pith reviewed 2026-05-24 04:17 UTC · model grok-4.3
The pith
Teleoscope is a web interface that lets qualitative researchers iteratively curate large text corpora like Reddit posts thematically.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Teleoscope is a web-based interface designed to scaffold iterative, interactive, and reflexive refinement of a large corpus, in a process called thematic curation. Across three deployments, Teleoscope supports serendipitous discovery of new keywords, results in greater feelings of confidence in search saturation, and aids collaborative discussion of alternative curation pathways. Teleoscope empowers researchers to stay close to the data in order to make qualitative workflows methodologically coherent with large text corpora.
What carries the argument
Teleoscope, a web-based interface that supports thematic curation through iterative refinement of large text corpora.
If this is right
- Researchers can analyze large corpora without reducing them to statistical subsamples.
- Serendipitous keyword discovery becomes part of the curation process.
- Teams gain ways to discuss and compare different curation pathways.
- Qualitative workflows remain close to the original data throughout refinement.
- Search saturation can be assessed with greater reported confidence.
Where Pith is reading between the lines
- The interface principles could apply to curation tasks on other large text sources beyond Reddit.
- Thematic curation tools might reduce dependence on high-level statistical summaries for initial navigation.
- Future systems could combine this interactive approach with automated suggestions while preserving reflexivity.
Load-bearing premise
Feedback from the three deployments reliably demonstrates the claimed benefits of the interface, even though no details on study design, participant selection, controls, or measures are provided.
What would settle it
A controlled comparison of Teleoscope users versus standard search methods that finds no increase in new keywords discovered or in reported confidence that search saturation has been reached.
Figures
read the original abstract
Large text corpora, such as Reddit posts, have become an increasingly prevalent site of qualitative inquiry. However, most large text corpora are intractable for qualitative researchers. Instead, teams rely on statistical subsampling to reduce corpora to a manageable size for qualitative analysis. While previous work for navigating large corpora involves visualizing the dataset at the corpus-level using high-level statistical summaries, few systems offer the ability to curate data using an interpretivist approach. To address this, we developed Teleoscope, a web-based interface designed to scaffold iterative, interactive, and reflexive refinement of a large corpus, in a process we call thematic curation. Across three deployments, we learned that Teleoscope supports serendipitous discovery of new keywords, results in greater feelings of confidence in search saturation, and aids collaborative discussion of alternative curation pathways. Teleoscope empowers researchers to stay "close to the data" in order to make qualitative workflows methodologically coherent with large text corpora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Teleoscope, a web-based interface for thematic curation of large text corpora (e.g., Reddit posts). It contrasts this interpretivist, iterative refinement process with statistical subsampling and corpus-level visualizations. The central claim is that across three deployments, Teleoscope enables serendipitous keyword discovery, increases researcher confidence in search saturation, and supports collaborative discussion of curation pathways, thereby allowing qualitative workflows to remain close to the data.
Significance. If the deployment findings are supported by rigorous evidence, the work would contribute a concrete system and process for interpretivist curation of intractable corpora, addressing a recognized gap between statistical tools and qualitative analysis needs in HCI. The paper explicitly demonstrates an interface that scaffolds reflexive refinement rather than high-level summaries.
major comments (1)
- [Deployments / Evaluation] The central claims rest on outcomes from three deployments, yet the manuscript provides no description of study design, participant selection, task instructions, comparison conditions, observation coding, or quantitative/qualitative measures used to assess serendipitous discovery, confidence, or collaborative discussion. This is load-bearing because the reported benefits cannot be distinguished from expectation effects or selection bias without these details.
Simulated Author's Rebuttal
We thank the referee for identifying this important gap in the presentation of our deployment findings. We address the concern directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The central claims rest on outcomes from three deployments, yet the manuscript provides no description of study design, participant selection, task instructions, comparison conditions, observation coding, or quantitative/qualitative measures used to assess serendipitous discovery, confidence, or collaborative discussion. This is load-bearing because the reported benefits cannot be distinguished from expectation effects or selection bias without these details.
Authors: We agree that the manuscript currently lacks a dedicated subsection describing the deployment protocols. The three deployments were real-world applications of Teleoscope within ongoing qualitative research projects (one solo, two collaborative) rather than a formal controlled study. In the revision we will add a 'Deployment Contexts' section that details: (1) how each deployment was initiated and the corpus characteristics, (2) the researchers involved and their selection, (3) the iterative workflow observed in each case, and (4) the specific observations and artifacts (e.g., saved schemas, discussion notes) that grounded the reported outcomes on keyword discovery, saturation confidence, and collaborative discussion. We will also explicitly qualify that these are illustrative case observations rather than experimentally controlled measures, thereby clarifying the evidential basis and potential biases for readers. revision: yes
Circularity Check
No circularity: empirical claims from deployments have no mathematical derivations or self-referential reductions
full rationale
The paper presents a system description and reports outcomes from three deployments as the basis for its claims about serendipitous discovery, confidence in saturation, and collaborative discussion. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The central claims are framed as observations from user experiences rather than derivations that reduce to inputs by construction. Self-citations, if present elsewhere, are not load-bearing for any derivation chain. This is a standard non-circular empirical HCI paper whose validity rests on study details (not supplied in the abstract) rather than definitional or self-referential logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Qualitative researchers benefit from staying close to the data in large corpora rather than relying solely on statistical summaries.
invented entities (1)
-
Teleoscope
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[n. d.]. https://openai.com/blog/chatgpt
-
[2]
Eric Alexander, Joe Kohlmann, Robin Valenza, Michael Witmore, and Michael Gleicher. 2014. Serendip: Topic model-driven visual exploration of text corpora. In 2014 IEEE Conference on Visual Analytics Science and Technology (V AST). 173–
work page 2014
-
[3]
https://doi.org/10.1109/VAST.2014.7042493
-
[4]
Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Ben- netot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. In- formation fusion 58 (2020), 82–115
work page 2020
-
[5]
Deepak Suresh Asudani, Naresh Kumar Nagwani, and Pradeep Singh. 2023. Im- pact of word embedding models on text analytics in deep learning environment: a review. Artificial Intelligence Review 56 (2023), 1–81
work page 2023
-
[6]
Rajiv Badi, Soonil Bae, J Michael Moore, Konstantinos Meintanis, Anna Zacchi, Haowei Hsieh, Frank Shipman, and Catherine C Marshall. 2006. Recognizing user interest and document value from reading and organizing activities in document triage. In Proceedings of the 11th international conference on Intelligent user interfaces. 218–225
work page 2006
-
[7]
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media , Vol. 14. 830–839
work page 2020
-
[8]
Charles Berret and Tamara Munzner. 2022. Iceberg Sensemaking: A Process Model for Critical Data Analysis and Visualization. arxiv.org (4 2022)
work page 2022
-
[9]
C. Berret and T. Munzner. 2022. Iceberg Sensemaking: A Process Model for Critical Data Analysis and Visualization. arXiv preprint arXiv:2204.00000 (2022)
-
[10]
Christian Bors, Theresia Gschwandtner, and Silvia Miksch. 2019. Capturing and visualizing provenance from data wrangling. IEEE computer graphics and applications 39, 6 (2019), 61–75. Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora
work page 2019
-
[11]
Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psycho- logical Association
work page 2012
-
[12]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
-
[13]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2023. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2309.07597 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
N.C. Chen, M. Drouhard, R. Kocielnik, J. Suh, and C.R. Aragon. 2018. Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity. ACM Transactions on Interactive Intelligent Systems (TiiS) 8, 2 (2018), 1–20
work page 2018
-
[15]
Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R Aragon. 2018. Using Machine Learning to Support Qualitative Coding in Social Science. ACM Transactions on Interactive Intelligent Systems 8 (6 2018), 1–20. Issue 2. https://doi.org/10.1145/3185515
-
[16]
Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factoriza- tion. IEEE transactions on visualization and computer graphics 19, 12 (2013), 1992–2001
work page 2013
-
[17]
Zach Cutler, Kiran Gadhave, and Alexander Lex. 2020. Trrack: A Library for Provenance-Tracking in Web-Based Visualizations, In IEEE Visualization Con- ference (VIS). 116–120. https://doi.org/10.1109/VIS47514.2020.00030
-
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[20]
Mennatallah El-Assady, Rebecca Kehlbeck, Christopher Collins, Daniel Keim, and Oliver Deussen. 2019. Semantic concept spaces: Guided topic model refine- ment using word-embedding projections. IEEE transactions on visualization and computer graphics 26, 1 (2019), 1001–1011
work page 2019
-
[21]
Laura L Ellingson. 2009. Engaging crystallization in qualitative research: An introduction. Sage
work page 2009
-
[22]
Anna Fariha and Alexandra Meliou. 2019. Example-driven query intent discovery: Abductive reasoning using semantic similarity. arXiv preprint arXiv:1906.10322 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[23]
Samah Gad, Waqas Javed, Sohaib Ghani, Niklas Elmqvist, Tom Ewing, Keith N Hampton, and Naren Ramakrishnan. 2015. ThemeDelta: Dynamic segmentations over temporal topic models. IEEE transactions on visualization and computer graphics 21, 5 (2015), 672–685
work page 2015
- [24]
-
[25]
Greg Guest, Emily Namey, and Mario Chen. 2020. A simple method to assess and report thematic saturation in qualitative research. PloS one 15, 5 (2020), e0232076
work page 2020
-
[26]
Marti A Hearst and Duane Degler. 2013. Sewing the seams of sensemaking: A practical interface for tagging and organizing saved search results. InProceedings of the symposium on human-computer interaction and information retrieval . 1–10
work page 2013
-
[27]
Monique M Hennink, Bonnie N Kaiser, and Vincent C Marconi. 2017. Code satu- ration versus meaning saturation: how many interviews are enough?Qualitative health research 27, 4 (2017), 591–608
work page 2017
-
[28]
Matt-Heun Hong, Lauren A Marsh, Jessica L Feuston, Janet Ruppert, Jed R Brubaker, and Danielle Albers Szafir. 2022. Scholastic: Graphical Human-AI Collaboration for Inductive and Interpretive Text Analysis. The 35th Annual ACM Symposium on User Interface Software and Technology . https://doi.org/10. 1145/3526113.3545681
-
[29]
Hannah Kim, Dongjin Choi, Barry Drake, Alex Endert, and Haesun Park. 2019. TopicSifter: Interactive search space reduction through targeted topic modeling. In 2019 IEEE Conference on Visual Analytics Science and Technology (V AST). IEEE, IEEE, Vancouver, Canada, 35–45
work page 2019
-
[30]
Hannah Kim, Barry Drake, Alex Endert, and Haesun Park. 2020. Architext: Interactive hierarchical topic modeling. IEEE transactions on visualization and computer graphics 27, 9 (2020), 3644–3655
work page 2020
- [31]
-
[32]
Kori A LaDonna, Anthony R Artino Jr, and Dorene F Balmer. 2021. Beyond the guise of saturation: rigor and qualitative interview data. , 607–611 pages
work page 2021
-
[33]
Ching-Hung Lee, Chien-Liang Liu, Amy JC Trappey, John PT Mo, and Kevin C Desouza. 2021. Understanding digital transformation in advanced manufacturing and engineering: A bibliometric analysis, topic modeling and research trend discovery. Advanced Engineering Informatics 50 (2021), 101428
work page 2021
-
[34]
Yuan Li, Anita Crescenzi, Austin R Ward, and Rob Capra. 2023. Thinking inside the box: An evaluation of a novel search-assisting tool for supporting (meta) cognition during exploratory search. Journal of the Association for Information Science and Technology (2023)
work page 2023
-
[35]
Matteo Lissandrini, Davide Mottin, Themis Palpanas, Yannis Velegrakis, and HV Jagadish. 2019. Data Exploration Using Example-Based Methods . Springer
work page 2019
- [36]
-
[37]
Kirsti Malterud, Volkert Dirk Siersma, and Ann Dorrit Guassora. 2016. Sample size in qualitative interview studies: guided by information power. Qualitative health research 26, 13 (2016), 1753–1760
work page 2016
-
[38]
Sara Mannheimer. 2021. Data curation implications of qualitative data reuse and big social research. Journal of eScience Librarianship 10, 4 (2021)
work page 2021
-
[39]
Denis Mayr Lima Martins. 2019. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Information Systems 83 (2019), 89–100
work page 2019
-
[40]
Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205
work page 2017
-
[41]
Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. 2018. UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software 3, 29 (2018), 861
work page 2018
-
[42]
Christofer Meinecke, David Joseph Wrisley, and Stefan Jänicke. 2021. Explaining semi-supervised text alignment through visualization. IEEE Transactions on Visualization and Computer Graphics 28, 12 (2021), 4797–4809
work page 2021
-
[43]
Albine Moser and Irene Korstjens. 2017. Series: Practical guidance to qualitative research. Part 1: Introduction. European Journal of General Practice 23 (10 2017), 271–273. Issue 1. https://doi.org/10.1080/13814788.2017.1375093
-
[44]
Tamara Munzner. 2014. Visualization analysis and design . CRC press
work page 2014
-
[45]
Emily Namey, Greg Guest, Lucy Thairu, and Laura Johnson. 2008. Data reduction techniques for large qualitative data sets. InHandbook for Team-Based Qualitative Research. 137–162
work page 2008
-
[46]
Jakob Neilson. [n. d.]. 10 usability heuristics for user interface design. https: //www.nngroup.com/articles/ten-usability-heuristics/
-
[47]
Jakob Nielsen. 1992. Finding Usability Problems through Heuristic Evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 373–380. https://doi.org/10.1145/142750.142834
-
[48]
Jakob Nielsen and Rolf Molich. 1990. Heuristic Evaluation of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 249–256. https://doi.org/10.1145/97243.97281
-
[49]
Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science 43, 1 (2017), 88–102
work page 2017
-
[50]
Lorelli S Nowell, Jill M Norris, Deborah E White, and Nancy J Moules. 2017. Thematic analysis: Striving to meet the trustworthiness criteria. International journal of qualitative methods 16, 1 (2017), 1609406917733847
work page 2017
-
[51]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830
work page 2011
-
[53]
Reddit.com. 2024. Am I the Asshole? https://www.reddit.com/r/AmItheAsshole/
work page 2024
-
[54]
Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011)
work page 2011
-
[55]
Tim Rietz and Alexander Maedche. 2021. Cody: An AI-Based System to Semi- Automate Coding for Qualitative Research. Proceedings of the 2021 CHI Confer- ence on Human Factors in Computing Systems . https://doi.org/10.1145/3411764. 3445591
-
[56]
Matthias Rüdiger, David Antons, Amol M Joshi, and Torsten-Oliver Salge. 2022. Topic modeling revisited: New evidence on algorithm performance and quality metrics. Plos one 17, 4 (2022), e0266325
work page 2022
-
[57]
Favourate Y Sebele-Mpofu. 2020. Saturation controversy in qualitative research: Complexities and underlying assumptions. A literature review. Cogent Social Sciences 6, 1 (2020), 1838706
work page 2020
-
[58]
Claudio T Silva, Juliana Freire, and Steven P Callahan. 2007. Provenance for visualizations: Reproducibility and beyond. Computing in Science & Engineering 9, 5 (2007), 82–89
work page 2007
-
[59]
Fabian Sperrle, Mennatallah El-Assady, Grace Guo, Rita Borgo, D Horng Chau, Alex Endert, and Daniel Keim. 2021. A Survey of Human-Centered Evaluations in Human-Centered Machine Learning. In Computer Graphics Forum, Vol. 40.3. Wiley Online Library, 543–568
work page 2021
-
[60]
Teleoscope.ca. 2024. Teleoscope. https://teleoscope.ca
work page 2024
-
[61]
Teleoscope.ca. 2024. Teleoscope GitHub. https://github.com/Teleoscope/Teleoscope
work page 2024
-
[62]
Silvia Terragni, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. 2021. OCTIS: Comparing and Optimizing Topic models is Simple!. In Proceedings of the 16th Conference of the European Chapter of the Paul Bucci, Leo Foord-Kelcey, Patrick Yung Kang Lee, Alamjeet Singh, and Ivan Beschastnikh Association for Computational Li...
-
[63]
G.A. Tobin and C.M. Begley. 2004. Methodological rigour within a qualitative framework. Journal of Advanced Nursing 48, 4 (2004), 388–396
work page 2004
-
[64]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
T Wolf. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[66]
Kai Xu, Alvitta Ottley, Conny Walchshofer, Marc Streit, Remco Chang, and John Wenskovitch. 2020. Survey on the analysis of user interactions and visualization provenance. InComputer Graphics Forum, Vol. 39. Wiley Online Library, 757–783
work page 2020
-
[67]
Jun Yuan, Changjian Chen, Weikai Yang, Mengchen Liu, Jiazhi Xia, and Shixia Liu. 2021. A survey of visual analytics techniques for machine learning. Compu- tational Visual Media 7 (2021), 3–36
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.