pith. sign in

arxiv: 2507.05385 · v5 · submitted 2025-07-07 · 💻 cs.CL

EduCoder: An Open-Source Annotation System for Education Transcript Data

Pith reviewed 2026-05-19 05:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords educational dialogueannotation toolcodebook developmentutterance-level codingcollaborative annotationopen-source systempedagogical featurestranscript analysis
0
0 comments X

The pith

EduCoder provides a platform for researchers to collaboratively build codebooks and calibrate annotations when coding educational dialogue transcripts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EduCoder to handle the specific demands of annotating teacher-student and peer interactions in education transcripts. It focuses on enabling teams to define complex codebooks drawn from actual data, while supporting both categorical labels and open-ended responses plus lesson context. Side-by-side views of different annotators' work are included to support calibration and improve consistency. A reader would care because accurate, reliable coding is essential for studying teaching practices, yet general annotation tools often lack the needed education-specific features for codebook creation and comparison.

Core claim

EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators' responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source.

What carries the argument

EduCoder, the annotation system that combines collaborative codebook definition from observed transcripts, mixed categorical and open-ended coding, contextual lesson materials, and side-by-side annotator comparison views.

If this is right

  • Teams can iteratively refine codebooks for pedagogical features directly from the transcripts being studied.
  • Annotators gain flexibility to apply both fixed categories and free-text descriptions to individual utterances.
  • Direct comparison of responses from multiple coders supports calibration and raises overall data reliability.
  • Open-source release allows education researchers to adopt, modify, and extend the system for their own projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption across studies could produce more comparable labeled datasets on classroom interactions.
  • The collaborative and comparison features might lower variability in qualitative education research more broadly.
  • The design could serve as a model for building specialized annotation systems in related fields like counseling or legal transcription.

Load-bearing premise

Existing general-purpose annotation tools do not sufficiently support the creation of complex pedagogical codebooks, mixed annotation types, contextualization, and annotator calibration for education dialogue data.

What would settle it

A direct comparison study that measures inter-annotator agreement scores, time per transcript, and reported ease of use when the same education dialogue data is coded with EduCoder versus a general-purpose tool.

Figures

Figures reproduced from arXiv: 2507.05385 by Dorottya Demszky, Guanzhong Pan, Helen Higgins, Hyunji Nam, James Malamut, Liliana Deonizio, Luc\'ia Langlois, Mei Tan, Saad Ashraf, Vishal Kumar.

Figure 1
Figure 1. Figure 1: Overview of EduCoder. EduCoder is designed to facilitate the collaborative annotation of educational dialogue data. It is a comprehensive, end-to-end platform for A. importing and preprocessing conversation transcripts, B. collaborative utterance-level annotation with customizable codebooks, and C. real-time inter-rater reliability (IRR) monitoring and cross-annotator comparison. This system aims to enhanc… view at source ↗
Figure 2
Figure 2. Figure 2: Screenshots of example tasks supported by EduCoder. Example 2a shows the codebook definition interface [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts -- with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson's purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators' responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces EduCoder, an open-source annotation system specialized for utterance-level coding of educational dialogue transcripts. It identifies four challenges with general-purpose tools (complex codebook definition, mixed categorical/open-ended types, contextual materials, and annotator calibration) and describes how the system supports collaborative codebook definition based on observed data, both annotation types, contextual support, and side-by-side annotator comparison for calibration.

Significance. If the described features operate as presented, EduCoder offers a domain-specific platform that could streamline annotation workflows for education researchers analyzing teacher-student and peer interactions. The open-source release and demo video constitute clear strengths that support reproducibility and community adoption.

minor comments (2)
  1. The motivation section would benefit from explicit citations or a brief comparison table to 2-3 existing general-purpose tools (e.g., Doccano, Label Studio) to ground the claim that current options fall short on the listed challenges.
  2. Consider adding a short limitations or future-work paragraph discussing scalability for large corpora or integration with existing qualitative-analysis pipelines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of EduCoder's domain-specific features for educational dialogue annotation and the value placed on its open-source release.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a descriptive system paper introducing an open-source annotation tool. It states challenges in education dialogue coding and describes how EduCoder's features (collaborative codebook definition, mixed annotation types, contextual support, annotator comparison) address them. There are no derivations, equations, fitted parameters, predictions, or load-bearing self-citations that reduce to internal definitions. The contribution rests on the feature description and open-source release, which is externally verifiable and does not rely on any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software-tool paper. It introduces no free parameters, mathematical axioms, or postulated entities; the only assumptions are standard software-engineering premises such as the existence of a web browser and user willingness to adopt the interface.

pith-pipeline@v0.9.0 · 5714 in / 1060 out tokens · 98553 ms · 2026-05-19T05:32:38.977057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Sterling Alic, Dorottya Demszky, Zid Mancenido, Jing Liu, Heather Hill, and Dan Jurafsky. 2022. Computationally identifying funneling and focusing questions in classroom discourse. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 224--233

  4. [4]

    Anthropic. 2025. Claude 4 sonnet model card. System card published May 2025; training data cut‐off: March 2025; accessed July 2 2025

  5. [5]

    ATLAS.ti Scientific Software Development GmbH . 2023. ATLAS.ti Mac (version 23.2.1) [qualitative data analysis software]. Accessed: 2025-07-04

  6. [6]

    Ljubi s a Boji\'c, Olga Zagovora, Asta Zelenkauskait\.e, Vuk Vukovi\'c, Milan C abarkapa, Selma Veseljevi\'c Jerkovi\'c, and Ana Jovan c evi\'c. 2025. Evaluating large language models against human annotators in latent content analysis: Sentiment, political leaning, emotional intensity, and sarcasm. arXiv preprint arXiv:2501.02532

  7. [7]

    David Broska, Michael Howes, and Austin van Loon. 2025. The mixed subjects design: Treating large language models as potentially informative observations. Sociological Methods & Research, page 00491241251326865

  8. [8]

    Nitay Calderon, Roi Reichart, and Rotem Dror. 2025. The alternative annotator test for llm-as-a-judge: How to statistically justify replacing human annotators with llms. arXiv preprint arXiv:2501.10970

  9. [9]

    John L Campbell, Charles Quincy, Jordan Osserman, and Ove K Pedersen. 2013. Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement. Sociological methods & research, 42(3):294--320

  10. [10]

    Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37--46

  11. [11]

    Rosanna Cole. 2024. Inter-rater reliability methods in qualitative case study research. Sociological Methods & Research, 53(4):1944--1975

  12. [12]

    Tobias Daudert. 2020. A web-based collaborative annotation and consolidation tool. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 7053--7059

  13. [13]

    Dorottya Demszky and Heather Hill. 2023. The ncte transcripts: A dataset of elementary math classroom transcripts. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 528--538

  14. [14]

    Dorottya Demszky, Jing Liu, Zid Mancenido, Julie Cohen, Heather Hill, Dan Jurafsky, and Tatsunori B Hashimoto. 2021. Measuring conversational uptake: A case study on student-teacher interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Proces...

  15. [15]

    Naihao Deng, Yikai Liu, Mingye Chen, Winston Wu, Siyang Liu, Yulong Chen, Yue Zhang, and Rada Mihalcea. 2023. Ease: An easily-customized annotation system powered by efficiency enhancement mechanisms. arXiv preprint arXiv:2305.14169

  16. [16]

    Kerry Dhakal. 2022. Nvivo. Journal of the Medical Library Association: JMLA, 110(2):270

  17. [17]

    Sidney K D'Mello and Art Graesser. 2012. Language and discourse are powerful signals of student emotions during tutoring. IEEE Transactions on Learning Technologies, 5(4):304--317

  18. [18]

    Zackary Okun Dunivin. 2025. Scaling hermeneutics: a guide to qualitative coding with llms for reflexive content analysis. EPJ Data Science, 14(1):28

  19. [19]

    Eisenhardt

    Kathleen M. Eisenhardt. 1989. Building theories from case study research. Academy of Management Review, 14(4):532--550. Available at JSTOR: stable 258557

  20. [20]

    David Garlan, Vishal Dwivedi, Ivan Ruchkin, and Bradley Schmerl. 2012. Foundations and tools for end-user architecting. In Large-Scale Complex IT Systems. Development, Operation and Management: 17th Monterey Workshop 2012, Oxford, UK, March 19-21, 2012, Revised Selected Papers 17, pages 157--182. Springer

  21. [21]

    Kevin A Hallgren. 2012. Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology, 8(1):23

  22. [22]

    Andrew F Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication methods and measures, 1(1):77--89

  23. [23]

    Xudong Hong, Margarita Ryzhova, Daniel Adrian Biondi, and Vera Demberg. 2023. Do large language models and humans have similar behaviors in causal inference with script knowledge? arXiv preprint arXiv:2311.07311

  24. [24]

    Matthew Honnibal and Ines Montani. 2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear

  25. [25]

    beautiful work, you're rock stars!

    Nicholas Hunkins, Sean Kelly, and Sidney D'Mello. 2022. “beautiful work, you're rock stars!”: Teacher analytics to uncover discourse that supports or undermines student motivation, identity, and belonging in classrooms. In Lak22: 12th international learning analytics and knowledge conference, pages 230--238

  26. [26]

    Mete Ismayilzada, Claire Stevenson, and Lonneke van der Plas. 2024. Evaluating creative short story generation in humans and large language models. arXiv preprint arXiv:2411.02316

  27. [27]

    Pugh, and Sidney K

    Emily Jensen, Samuel L. Pugh, and Sidney K. D'Mello. 2021. A deep transfer learning approach to modeling teacher discourse in the classroom. In LAK21: 11th international learning analytics and knowledge conference, pages 302--312

  28. [28]

    Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, and Iryna Gurevych. 2018. The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5--9, Santa Fe, New Mexico, USA. Association for Comp...

  29. [29]

    Klaus Krippendorff. 2018. Content Analysis: An Introduction to Its Methodology, 4th edition. SAGE Publications, Inc., Thousand Oaks, CA

  30. [30]

    Yun Long, Haifeng Luo, and Yu Zhang. 2024. Evaluating large language models in analysing classroom dialogue. npj Science of Learning, 9(1):60

  31. [31]

    Jakub Macina, Nico Daheim, Sankalan Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. Mathdial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5602--5621

  32. [32]

    Neil Mercer and Christine Howe. 2012. Explaining the dialogic processes of teaching and learning: The value and potential of sociocultural theory. Learning, Culture and Social Interaction, 1(1):12--21

  33. [33]

    Sarah Michaels, Catherine O'Connor, and Lauren B. Resnick. 2008. Deliberative discourse idealized and realized: Accountable talk in the classroom and in civic life. In Studies in Philosophy and Education, volume 27, pages 283--297

  34. [34]

    Prodigy: A modern and scriptable annotation tool for creating training data for machine learning models

    Ines Montani and Matthew Honnibal. Prodigy: A modern and scriptable annotation tool for creating training data for machine learning models

  35. [35]

    Alberto Mu \ n oz‑Ortiz, Carlos Gómez‑Rodríguez, and David Vilares. 2023. Contrasting linguistic patterns in human and llm‑generated news text. arXiv preprint arXiv:2308.09067. Version 3 (Sep 2, 2024)

  36. [36]

    Hiroki Nakayama, Takahiro Kubo, Junya Kamura, Yasufumi Taniguchi, and Xu Liang. 2018. doccano : Text annotation tool for human. Open-source text annotation software

  37. [37]

    OpenAI. 2024. Gpt-4o technical report. Accessed: 2025-07-02

  38. [38]

    Soya Park, April Yi Wang, Ban Kawas, Q Vera Liao, David Piorkowski, and Marina Danilevsky. 2021. Facilitating knowledge sharing from domain experts to data scientists for building nlp models. In Proceedings of the 26th International Conference on Intelligent User Interfaces, pages 585--596

  39. [39]

    Michael Quinn Patton. 2002. Two decades of developments in qualitative inquiry: A personal, experiential perspective. Qualitative social work, 1(3):261--283

  40. [40]

    Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jackson Sargent, and David Jurgens. 2022. Potato: The portable text annotation tool. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 327--337

  41. [41]

    Tal Perry. 2021. Lighttag: Text annotation platform. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 20--27

  42. [42]

    Katherine Stasaski, Kimberly Kao, and Marti A Hearst. 2020. Cima: A large open access dialogue dataset for tutoring. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52--64

  43. [43]

    Abhijit Suresh, Jennifer Jacobs, Charis Harty, Margaret Perkoff, James H Martin, and Tamara Sumner. 2022. The talkmoves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4654--4662

  44. [44]

    Abhijit Suresh, Tamara Sumner, Isabella Huang, Jennifer Jacobs, Bill Foland, and Wayne Ward. 2018. Using deep learning to automatically detect talk moves in teachers' mathematics lessons. In 2018 IEEE International Conference on Big Data (Big Data), pages 5445--5447. IEEE

  45. [45]

    2020-2022

    Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. 2020-2022. Label Studio : Data labeling software. Open source software

  46. [46]

    Ludi Wang, Dongze Song, Qiang Cui, Xueqing Chen, Yuanchun Zhou, Wenjuan Cui, and Yi Du. 2025. Autodive+: An adaptive model enhanced multimodal online annotation tool. In Companion Proceedings of the ACM on Web Conference 2025, pages 2919--2922

  47. [47]

    Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Paper...

  48. [48]

    Cynthia Weston, Terry Gandell, Jacinthe Beauchamp, Lynn McAlpine, Carol Wiseman, and Cathy Beauchamp. 2001. Analyzing interview data: The development and evolution of a coding system. Qualitative sociology, 24:381--400

  49. [49]

    Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi. 2021. Reframing human-ai collaboration for generating free-text explanations. arXiv preprint arXiv:2112.08674

  50. [50]

    Linxuan Zhao, Dragan Ga s evi \'c , Zachari Swiecki, Yuheng Li, Jionghao Lin, Lele Sha, Lixiang Yan, Riordan Alfredo, Xinyu Li, and Roberto Martinez-Maldonado. 2024. Towards automated transcribing and coding of embodied teamwork communication through multimodal learning analytics. British Journal of Educational Technology, 55(4):1673--1702