arxiv: 2603.08406 · v2 · submitted 2026-03-09 · 💻 cs.HC · cs.CL

Recognition: 1 theorem link

· Lean Theorem

Sandpiper: Orchestrated AI-Annotation for Educational Discourse at Scale

Daryl Hedley , Doug Pietrzak , Jorge Dias , Ian Burden , Bakhtawar Ahtisham , Zhuqian Zhou , Kirk Vanacore , Josh Marland

show 4 more authors

Rachel Slama Justin Reich Kenneth Koedinger Ren\'e Kizilcec

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:09 UTC · model grok-4.3

classification 💻 cs.HC cs.CL

keywords AI-assisted qualitative analysiseducational discourseLLM orchestrationmixed-initiative systemsschema constraintsresearch efficiencydata privacyinter-rater reliability

0 comments

The pith

Sandpiper pairs researcher dashboards with agentic LLMs under schema constraints to scale qualitative coding of educational conversations while enforcing methodological rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Sandpiper as a mixed-initiative platform that links interactive dashboards to LLM agents for processing large volumes of digital educational discourse data. Traditional qualitative analysis creates a labor bottleneck that limits research scale; the system automates de-identification, annotation, and validation steps while routing outputs through schema constraints that tie directly to human-defined codebooks. An evaluations engine continuously compares AI labels against human ones to support iterative refinement. The authors argue this combination removes the trade-off between volume and rigor and propose a user study to measure gains in efficiency, inter-rater reliability, and researcher trust.

Core claim

Sandpiper enables scalable analysis of educational discourse by tightly coupling interactive researcher dashboards with agentic LLM engines and schema-constrained orchestration that eliminates hallucinations and enforces strict adherence to qualitative codebooks, all supported by context-aware automated de-identification and secure university-housed infrastructure, plus an integrated evaluations engine for continuous benchmarking against human labels.

What carries the argument

Schema-constrained orchestration, which routes LLM generations through explicit rules that force outputs to match a researcher-supplied qualitative codebook.

If this is right

Qualitative researchers can process conversation volumes that previously required months of manual coding.
Continuous benchmarking against human labels supports ongoing model refinement without restarting from scratch.
Secure, university-hosted infrastructure and automated de-identification lower privacy barriers to using real student data.
The proposed user study will quantify changes in research efficiency and inter-rater reliability.
Iterative validation loops create a feedback path that increases researcher trust in the AI outputs over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orchestration pattern could be tested on non-education domains that rely on qualitative coding of long transcripts, such as clinical interviews or customer support logs.
If the schema constraints prove robust, training programs for education researchers might shift toward teaching codebook design rather than exhaustive manual annotation.
Real-time integration with live learning platforms would let researchers monitor discourse patterns during a course rather than only after it ends.

Load-bearing premise

Schema-constrained orchestration will reliably eliminate LLM hallucinations and enforce strict adherence to qualitative codebooks when applied to real educational discourse data.

What would settle it

A side-by-side comparison on a held-out corpus of classroom transcripts in which Sandpiper's automated codes show hallucinated categories or agreement below 80 percent with independent human coders.

Figures

Figures reproduced from arXiv: 2603.08406 by Bakhtawar Ahtisham, Daryl Hedley, Doug Pietrzak, Ian Burden, Jorge Dias, Josh Marland, Justin Reich, Kenneth Koedinger, Kirk Vanacore, Rachel Slama, Ren\'e Kizilcec, Zhuqian Zhou.

**Figure 1.** Figure 1: The Sandpiper Pipeline. A central mixed-initiative loop tightly integrates the researcher dashboard and the schema-constrained orchestrator. The backend API and datastore are hosted within Cornell secure servers, while external model access is mediated through Cornell’s AI gateway. To understand how the platform realizes its Design Goals, we present a walkthrough of the core interface components and the hu… view at source ↗

**Figure 2.** Figure 2: The Prompt Editor interface, illustrating version control for instructions and explicitly defined [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: System mechanisms for ensuring inference output perfectly aligns with the target qualitative [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The Session Explorer interface detailing the Chat View and interconnected labeling panel. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The Evaluations Dashboard aggregating metrics for multiple Experimental Run-Sets. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Digital educational environments are expanding toward complex AI and human discourse, providing researchers with an abundance of data that offers deep insights into learning and instructional processes. However, traditional qualitative analysis remains a labor-intensive bottleneck, severely limiting the scale at which this research can be conducted. We present Sandpiper, a mixed-initiative system designed to serve as a bridge between high-volume conversational data and human qualitative expertise. By tightly coupling interactive researcher dashboards with agentic Large Language Model (LLM) engines, the platform enables scalable analysis without sacrificing methodological rigor. Sandpiper addresses critical barriers to AI adoption in education by implementing context-aware, automated de-identification workflows supported by secure, university-housed infrastructure to ensure data privacy. Furthermore, the system employs schema-constrained orchestration to eliminate LLM hallucinations and enforces strict adherence to qualitative codebooks. An integrated evaluations engine allows for the continuous benchmarking of AI performance against human labels, fostering an iterative approach to model refinement and validation. We propose a user study to evaluate the system's efficacy in improving research efficiency, inter-rater reliability, and researcher trust in AI-assisted qualitative workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sandpiper is a clear system proposal for mixing LLMs with human oversight on educational discourse annotation, but the core claims about reliable hallucination control and codebook adherence rest on unshown mechanisms and no data.

read the letter

The paper lays out Sandpiper as a platform that links researcher dashboards to agentic LLMs for annotating large sets of educational conversations. It adds university-hosted de-identification for privacy and an evaluations engine that keeps comparing AI output to human labels over time. The schema-constrained orchestration is presented as the piece that keeps the models from drifting off the qualitative codebook. That combination of features aimed at education research is the main new element here, since similar mixed-initiative tools exist elsewhere but this one is tuned to discourse data with built-in privacy and benchmarking loops. The motivation section does a straightforward job of explaining why hand-coding hits a wall once datasets grow beyond small samples. The proposed user study on efficiency, inter-rater reliability, and researcher trust is a logical next move and shows the authors know they need evidence. The soft spot is exactly where the stress test flagged it. The paper asserts that schema-constrained orchestration eliminates hallucinations and enforces strict codebook adherence, yet it gives no technical account of how the constraints actually work—prompt tricks, output filters, constrained decoding, or a validation layer—and reports zero measurements on real transcripts. Without those details or results, the scalability and rigor claims stay untested. This is the kind of paper that would interest people in learning analytics or educational technology who already handle conversation data and want to think through tool architectures. It is not yet useful for someone who needs a working method or validated performance numbers. I would send it to peer review. The problem is real, the design is coherent on its own terms, and referees could push for the missing mechanism description and a clearer plan for the user study before the authors build further.

Referee Report

2 major / 1 minor

Summary. The paper presents Sandpiper, a mixed-initiative system that couples interactive researcher dashboards with agentic LLM engines for scalable qualitative annotation of educational discourse. It claims to address labor bottlenecks via schema-constrained orchestration that eliminates hallucinations and enforces codebook adherence, includes privacy-preserving de-identification on university infrastructure, and proposes a future user study to measure gains in efficiency, inter-rater reliability, and researcher trust.

Significance. If the orchestration mechanism and validation components function as described, the platform could meaningfully expand the feasible scale of rigorous qualitative analysis in education research by reducing manual coding effort while retaining human oversight and iterative benchmarking against human labels.

major comments (2)

[System Design and Schema-Constrained Orchestration] The central claim that schema-constrained orchestration eliminates LLM hallucinations and enforces strict qualitative codebook adherence (stated in the abstract and system description) is load-bearing for the contribution yet supplies no technical specification of the constraint method (prompt engineering, output parsing, constrained decoding, or validation layer) and reports no empirical adherence or hallucination-rate measurements on actual educational discourse data.
[Evaluations Engine] The manuscript asserts that the integrated evaluations engine enables continuous benchmarking of AI performance against human labels, but provides no details on the engine's implementation, the specific metrics used, or how discrepancies are fed back into model refinement.

minor comments (1)

[Future Work] The proposed user study is mentioned only at a high level; adding even a brief outline of planned measures, sample size, and comparison conditions would strengthen the manuscript without requiring new data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below and will revise the paper to provide the requested technical clarifications and implementation details.

read point-by-point responses

Referee: [System Design and Schema-Constrained Orchestration] The central claim that schema-constrained orchestration eliminates LLM hallucinations and enforces strict qualitative codebook adherence (stated in the abstract and system description) is load-bearing for the contribution yet supplies no technical specification of the constraint method (prompt engineering, output parsing, constrained decoding, or validation layer) and reports no empirical adherence or hallucination-rate measurements on actual educational discourse data.

Authors: We agree that the manuscript requires a more explicit technical specification of the schema-constrained orchestration to support this central claim. In the revised version we will add a dedicated subsection describing the implementation: the system uses a combination of structured prompt engineering with explicit codebook examples, enforced JSON schema output parsing, and a post-generation validation layer that rejects or corrects outputs violating codebook rules. This layered approach is intended to constrain the generation space and reduce hallucinations. Regarding empirical measurements, the original submission focuses on system design and proposes a future user study; we do not currently include hallucination-rate data on educational discourse. We will incorporate preliminary internal validation results on sample discourse transcripts in the revision to provide initial evidence of adherence rates. revision: yes
Referee: [Evaluations Engine] The manuscript asserts that the integrated evaluations engine enables continuous benchmarking of AI performance against human labels, but provides no details on the engine's implementation, the specific metrics used, or how discrepancies are fed back into model refinement.

Authors: We acknowledge that the current description of the evaluations engine is insufficiently detailed. In the revised manuscript we will expand this section to specify the engine's modular implementation, which computes standard metrics including Cohen's kappa for inter-rater agreement, precision/recall for code adherence, and a hallucination flag based on output validation failures. Discrepancies between AI-generated and human labels are automatically logged in the dashboard and trigger a feedback loop: they are used either to refine prompts in subsequent runs or to queue additional human review for model improvement. This will clarify the continuous benchmarking and refinement process. revision: yes

Circularity Check

0 steps flagged

No circularity: system proposal with no derivations or self-referential predictions

full rationale

The manuscript is a descriptive system proposal for the Sandpiper platform. It asserts capabilities such as schema-constrained orchestration eliminating hallucinations but supplies no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. No derivation chain exists that could reduce to its own inputs by construction. The work remains self-contained as an engineering description without mathematical or predictive circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on unproven assumptions about LLM controllability and the practical effectiveness of the proposed orchestration; no free parameters or external benchmarks are supplied.

axioms (1)

ad hoc to paper Schema-constrained orchestration can eliminate LLM hallucinations and enforce strict codebook adherence in qualitative coding of educational discourse
Invoked when describing how the system maintains methodological rigor.

invented entities (1)

Sandpiper platform no independent evidence
purpose: To serve as a bridge between high-volume conversational data and human qualitative expertise
The system is introduced in this paper without prior independent validation or external evidence.

pith-pipeline@v0.9.0 · 5531 in / 1204 out tokens · 52951 ms · 2026-05-15T15:09:48.616135+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

schema-constrained orchestration to eliminate LLM hallucinations and enforces strict adherence to qualitative codebooks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Million Tutoring Moves (MTM): An Open Multimodal Dataset for the Science of Tutoring
cs.CY 2026-04 accept novelty 6.0

MTM v1 releases 4,654 open math tutoring transcripts as the first step toward a large-scale multimodal repository for studying and improving tutoring.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Barany, X

A. Barany, X. Liu, J. Zhang, M. Pankiewicz, and R. S. Baker. Chatgpt for education research: Exploring the potential of large language models for qualitative codebook development. In A. M. Olney, I.-A. Chounta, Z. Liu, O. C. Santos, and I. I. Bittencourt, editors,Artificial Intelligence in Education (AIED 2024), pages 134–149. Springer Nature Switzerland, 2024

work page 2024
[2]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka- mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Geathers, Y

J. Geathers, Y. Hicke, C. E. Chan, N. Rajashekar, S. Young, J. Sewell, S. Cornes, R. F. Kizilcec, and D. Shung. Benchmarking generative ai for scoring medical student interviews in objective structured clinical examinations (osces). InArtificial Intelligence in Education (AIED 2025), volume 15879 of Lecture Notes in Computer Science, pages 231–245. Spring...

work page 2025
[4]

Z. He, S. Naphade, and T.-H. K. Huang. Prompting in the dark: Assessing human performance in prompt engineering for data labeling when gold labels are absent. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–33. Association for Computing Machinery, 2025

work page 2025
[5]

Principles of mixed-initiative user interfaces.Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 159–166, 1999

Eric Horvitz. Principles of mixed-initiative user interfaces.Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 159–166, 1999

work page 1999
[6]

Building community capacity: Exploring voice assistants to support older adults in an independent living community

Yukta Karkera, Barsa Tandukar, Sowmya Chandra, and Aqueasha Martin-Hammond. Building community capacity: Exploring voice assistants to support older adults in an independent living community. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023
[7]

SAGE Publications, 2014

Udo Kuckartz.Qualitative Text Analysis: A Guide to Methods, Practice and Using Software. SAGE Publications, 2014

work page 2014
[8]

Liu et al

X. Liu et al. Qualitative coding with gpt-4: Where it works better.Journal of Learning Analytics, 12(1):169–185, 2025

work page 2025
[9]

X. Liu, J. Zhang, A. Barany, M. Pankiewicz, and R. S. Baker. Assessing the potential and limits of large language models in qualitative coding. In Y. J. Kim and Z. Swiecki, editors,Advances in Quantitative Ethnography, pages 89–103. Springer Nature Switzerland, 2024

work page 2024
[10]

Y. Long, H. Luo, and Y. Zhang. Evaluating large language models in analysing classroom dialogue. npj Science of Learning, 9(1):60, 2024

work page 2024
[11]

Nahum, N

O. Nahum, N. Calderon, O. Keller, I. Szpektor, and R. Reichart. Are llms better than reported? detecting label errors and mitigating their effect on model performance.arXiv preprint, 2024

work page 2024
[12]

Computational grounded theory: A methodological framework.Sociological Meth- ods & Research, 49(1):3–42, 2020

Laura K Nelson. Computational grounded theory: A methodological framework.Sociological Meth- ods & Research, 49(1):3–42, 2020

work page 2020
[13]

Label studio: Data labeling software, 2020

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label studio: Data labeling software, 2020

work page 2020
[14]

Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer

Ziang Xiao, Xingdi Yuan, Q. Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer. Supporting qualitative analysis with large language models: Combining codebook with gpt-3 for deductive cod- ing. InCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI ’23 Companion, page 75–78, New York, NY, USA, 2023. Association ...

work page 2023