Recognition: no theorem link
Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Pith reviewed 2026-05-15 21:12 UTC · model grok-4.3
The pith
Domain-aware LLM prompting detects PII in math tutoring dialogues while preserving instructional numbers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generic PII detection systems over-redact numeric expressions in math tutoring dialogues due to ambiguity with structured identifiers, but domain-aware prompting strategies for LLMs, including math-aware and segment-aware variants, substantially improve detection accuracy on the new MathEd-PII benchmark dataset while reducing numeric false positives and thereby preserving educational utility.
What carries the argument
Density-based segmentation to locate math-dense regions, paired with math-aware and segment-aware prompting of LLMs to distinguish instructional numbers from PII.
If this is right
- Generic PII detectors are inadequate for domain-specific educational dialogues because they cannot resolve numeric ambiguity.
- Domain-aware methods enable larger-scale sharing of de-identified math tutoring data without destroying core instructional content.
- Segment-aware prompting delivers the highest accuracy by incorporating local density information during detection.
- Human-in-the-loop annotation offers a practical route to building reliable domain-specific PII benchmarks.
Where Pith is reading between the lines
- The same numeric ambiguity problem likely appears in other quantitative tutoring domains such as physics or chemistry dialogues.
- Direct integration of mathematical expression parsers into detection pipelines could reduce false positives beyond what prompting alone achieves.
- These techniques could support standardized privacy practices for releasing learning analytics datasets from schools and platforms.
- Testing the prompting strategies on real-time live tutoring sessions rather than transcripts would reveal additional practical constraints.
Load-bearing premise
The human-in-the-loop LLM annotation process yields reliable ground-truth PII labels that generalize to unseen math tutoring dialogues without systematic bias in numeric patterns.
What would settle it
Re-annotating a fresh held-out collection of math tutoring transcripts with independent human reviewers and measuring whether segment-aware prompting still achieves F1 above 0.8 with low numeric false positives.
Figures
read the original abstract
Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and three LLM-based approaches with basic, math-aware, and segment-aware prompting. Domain-aware prompting, including both math-aware (F1: 0.802) and segment-aware versions (F1: 0.821), substantially outperforms the baseline (F1: 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MathEd-PII, the first benchmark dataset for PII detection in mathematics tutoring dialogues, constructed via human-in-the-loop LLM annotation. It uses density-based segmentation to demonstrate that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a failure mode for generic detectors. The work then evaluates four strategies—Presidio baseline plus three LLM prompting variants (basic, math-aware, segment-aware)—and reports that domain-aware prompting yields substantial gains (math-aware F1 0.802, segment-aware F1 0.821) over the baseline (F1 0.379) while reducing numeric false positives, arguing that utility-preserving de-identification requires domain context.
Significance. If the empirical results hold under proper validation, the paper supplies a needed domain-specific benchmark and concrete evidence that generic PII tools over-redact instructional content in math dialogues. This could directly support safer large-scale sharing of tutoring transcripts for learning-science research while preserving analytic utility.
major comments (2)
- [Abstract] Abstract: the headline F1 gains (0.802 and 0.821 vs. 0.379) are presented without any mention of dataset size, inter-annotator agreement, statistical significance tests, or error analysis on numeric false positives; these omissions make it impossible to assess whether the reported outperformance is robust or merely an artifact of the annotation process.
- [Dataset construction] Dataset construction section: the human-in-the-loop LLM annotation used to create MathEd-PII ground truth introduces a circularity risk because the same class of models is later employed for detection; without a purely human baseline, IAA metrics, or targeted error analysis on ambiguous numeric expressions, the claimed reduction in numeric false positives cannot be confidently attributed to the prompting strategies rather than annotation bias.
minor comments (1)
- [Abstract] Abstract: the phrase 'density-based segmentation' is used without a one-sentence definition or citation, which may hinder readers who are not already familiar with the technique.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of robustness and potential biases in our evaluation. We address each major comment below and have made targeted revisions to the manuscript to improve transparency and strengthen the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline F1 gains (0.802 and 0.821 vs. 0.379) are presented without any mention of dataset size, inter-annotator agreement, statistical significance tests, or error analysis on numeric false positives; these omissions make it impossible to assess whether the reported outperformance is robust or merely an artifact of the annotation process.
Authors: We agree that the abstract should provide sufficient context for readers to evaluate result robustness. The full manuscript already reports dataset size, inter-annotator agreement, significance testing, and numeric error analysis in the Dataset Construction and Results sections. We have revised the abstract to explicitly include dataset size, IAA metrics, and a brief reference to the error analysis on numeric false positives, while preserving the required length constraints. revision: yes
-
Referee: [Dataset construction] Dataset construction section: the human-in-the-loop LLM annotation used to create MathEd-PII ground truth introduces a circularity risk because the same class of models is later employed for detection; without a purely human baseline, IAA metrics, or targeted error analysis on ambiguous numeric expressions, the claimed reduction in numeric false positives cannot be confidently attributed to the prompting strategies rather than annotation bias.
Authors: We acknowledge the circularity concern inherent to LLM-assisted annotation. The process was strictly human-in-the-loop, with human annotators reviewing, correcting, and finalizing all labels; IAA metrics are reported in the Dataset Construction section to quantify annotator reliability. We have expanded the targeted error analysis on ambiguous numeric expressions in the revised Results section to better attribute performance gains to the prompting strategies. A purely human baseline at this scale was not feasible due to annotation cost, but the independent density-based segmentation analysis (showing false-positive clustering in math-dense regions) provides supporting evidence independent of the detection models. revision: partial
- A complete purely human-annotated baseline for the full MathEd-PII dataset is not available and would require substantial additional resources beyond the current study.
Circularity Check
No circularity: empirical benchmark with direct F1 measurements
full rationale
The paper introduces the MathEd-PII benchmark via human-in-the-loop LLM annotation and reports direct empirical F1 scores (baseline 0.379, math-aware 0.802, segment-aware 0.821) for prompting strategies on numeric ambiguity detection. No equations, parameter fits, derivations, or self-citations appear in the provided text that reduce any claimed result to its own inputs by construction. The performance numbers are straightforward held-out measurements rather than quantities defined or predicted from the authors' prior work, satisfying the self-contained empirical standard with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-in-the-loop LLM annotation produces sufficiently accurate PII labels for benchmarking
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION The Educational Data Mining (EDM) community has long relied on large-scale open source data sets from digital learn- ing platforms [14, 18]. Many of these data sources can be easily de-identified because they are composed of action logs that, for the most part, do not typically contain person- ally identifiable information (PII). As more and ...
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[2]
RELATED WORK 2.1 Policy Context De-identification is a prerequisite for sharing educational in- teraction data at scale, but what counts as “sufficient” de- identification depends on the legal regime and the assumed 2The benchmark dataset will be available following review. adversary model. In the United States, the Children’s On- line Privacy Protection ...
-
[3]
OVERVIEW OF THE CURRENT RE- SEARCH To answer the research questions outlined above, we follow a three-phase research workflow moving from dataset con- struction to analytical validation and finally to evaluation of domain-aware de-identification strategies. Phase 1: Dataset Preparationinvolves constructing MathEd- PII, a benchmark dataset for PII detectio...
-
[4]
PHASE 1: DATASET PREPARATION Reference datasets for PII detection in math education are currently lacking. To enable rigorous evaluation, we con- structed a benchmark dataset,MathEd-PII, from a PII- redacted large corpus. 4.1 Source Corpus Our source corpus comprises 1,000 math tutoring sessions (115,620 messages; 769,628 tokens) from a U.S.-based tutor- ...
-
[5]
Use conversions to solve multi-step real-world problems
PHASE 2: MATH SEGMENTATION AND NUMERIC AMBIGUITY 5https://anonymized for blind review Table 2: PII Statistics Comparison between Source Corpus and MathEd-PII (ordered by the most common PII to the least in MathEd-PII) Category Source Corpus MathEd-PII Transcripts 1,000 1,000 Messages 115,620 115,620 PII Labels (Total) 5,263 1,995 PERSON 1,915 1,424 URL 24...
-
[6]
We utilized its default analyzer which orchestrates a set of predefined recognizers
PHASE 3: PII DETECTION AND EV AL- UATION 6.1 PII Detection Methods 6.1.1 Baseline: Microsoft Presidio As a baseline, we deployed Microsoft Presidio (v2.2), an industry-standard open-source software development kit (SDK) for PII detection. We utilized its default analyzer which orchestrates a set of predefined recognizers. For high- structure entities (e.g...
-
[7]
DISCUSSION AND CONCLUSION This study investigated the challenge of utility-preserving de-identification in the context of math tutoring transcripts, focusing on the phenomenon of numeric ambiguity. By in- troducing MathEd-PII, the first benchmark dataset for this domain, we provided a rigorous foundation for evaluating PII detection methods that balance p...
-
[8]
A. Caines, H. Yannakoudakis, H. Allen, P. P´ erez-Paredes, B. Byrne, and P. Buttery. The teacher-student chatroom corpus version 2: more lessons, new annotation, automatic detection of sequence shifts. In D. Alfter, E. Volodina, T. Fran¸ cois, P. Desmet, F. Cornillie, A. J ¨onsson, and E. Rennes, editors,Proceedings of the 11th Workshop on NLP for Compute...
work page 2022
-
[9]
D. S. Carrell, B. Malin, J. Aberdeen, S. Bayer, and C. Clark. Hiding in plain sight: Use of realistic surrogates to reduce exposure of protected health information in clinical text.Journal of the American Medical Informatics Association, pages 342–348, 2013
work page 2013
-
[10]
G. Deacon and G. Chojnacki. Impacts of upchieve on-demand tutoring on students’ math knowledge and perceptions. middle years math grantee report series. Mathematica, 2023
work page 2023
-
[11]
Children’s online privacy protection act (coppa) guidance
Federal Trade Commission. Children’s online privacy protection act (coppa) guidance. https://www.ftc.gov/business-guidance/resource s/complying-coppa-frequently-asked-questions,
-
[12]
Accessed: 2026-02-10
work page 2026
- [13]
-
[14]
S. L. Garfinkel. De-identification of personal information. NIST Interagency Report 8053, National Institute of Standards and Technology, Oct. 2015
work page 2015
-
[15]
S. L. Garfinkel. De-identifying government data sets: Techniques and governance. NIST Special Publication 800-188, National Institute of Standards and Technology, Sept. 2023
work page 2023
- [16]
-
[17]
L. Holmes, S. Crossley, N. Hayes, D. Kuehl, A. Trumbore, and G. Gutu-Robu. De-identification of student writing in technologically mediated educational settings. InPolyphonic Construction of Smart Learning Ecosystems: Improving Inclusive Digital Education, pages 177–189, Singapore, 2023. Springer Nature Singapore
work page 2023
- [18]
-
[19]
L. Holmes, S. Crossley, J. Wang, and W. Zhang. The cleaned repository of annotated personally identifiable information. In P. Benjamin and D. E. Carrie, editors, Proceedings of the 17th International Conference on Educational Data Mining, pages 790–796, Atlanta, Georgia, USA, July 2024
work page 2024
-
[20]
M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd. spacy: Industrial-strength natural language processing in python. https://doi.org/10.5281/zenodo.1212303, 2020
-
[21]
Bidirectional LSTM-CRF Models for Sequence Tagging
Z. Huang, W. Xu, and K. Yu. Bidirectional LSTM-CRF models for sequence tagging.arXiv preprint arXiv: 1508.01991, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[22]
K. R. Koedinger, R. S. Baker, K. Cunningham, A. Skogsholm, B. Leber, and J. Stamper. A data repository for the edm community: The pslc datashop. Handbook of educational data mining, 43:43–56, 2010
work page 2010
-
[23]
Learning Commons Initiative. Learning components. https://docs.learningcommons.org/knowledge-gra ph/entity-and-relationship-reference/learnin g-components, 2024. Accessed: 2026-02-05
work page 2024
-
[24]
Presidio: Data protection and de-identification sdk
Microsoft. Presidio: Data protection and de-identification sdk. https://github.com/microsoft/presidio, 2020. GitHub repository. Accessed 2026-01-20
work page 2020
-
[25]
presidio-research: Research utilities for Presidio (including synthetic text generation)
Microsoft. presidio-research: Research utilities for Presidio (including synthetic text generation). https://github.com/microsoft/presidio-research,
- [26]
-
[27]
M. C. Mihaescu and P. S. Popescu. Review on publicly available datasets for educational data mining.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(3):e1403, 2021
work page 2021
- [28]
- [29]
-
[30]
I. Neamatullah, M. M. Douglass, L.-W. H. Lehman, A. Reisner, M. Villarroel, W. J. Long, P. Szolovits, G. B. Moody, R. G. Mark, and G. D. Clifford. Automated de-identification of free-text medical records.BMC Medical Informatics and Decision Making, 8(32), 2008
work page 2008
- [31]
-
[32]
M. Savkin, T. Ionov, and V. Konovalov. SPY: Enhancing privacy with synthetic PII detection dataset. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 236–246. Association for Computational Linguistics, 2025
work page 2025
- [33]
-
[34]
K. Singhal, J. Zambrano, L. Pankiewicz, and R. Baker. Educational data de-identification with large language models. InProceedings of the 17th International Conference on Educational Data Mining (EDM), pages 559–565, 2024
work page 2024
- [35]
-
[36]
A. Stubbs and ¨O. Uzuner. Automated systems for the de-identification of longitudinal clinical narratives: Overview of the 2014 i2b2/uthealth shared task track 1.Journal of Biomedical Informatics, 58:S11–S19, 2015
work page 2014
- [37]
-
[38]
Department of Education, Privacy Technical Assistance Center (PTAC)
U.S. Department of Education, Privacy Technical Assistance Center (PTAC). Data de-identification: An overview of basic terms.https://studentprivacy.e d.gov/sites/default/files/resource_document/fi le/data_deidentification_terms_0.pdf, 2012. Updated May 2013. Accessed 2026-01-20
work page 2012
-
[39]
Department of Health and Human Services, Office for Civil Rights
U.S. Department of Health and Human Services, Office for Civil Rights. Guidance regarding methods for de-identification of protected health information in accordance with the HIPAA privacy rule. https://www.hhs.gov/sites/default/files/ocr/pr ivacy/hipaa/understanding/coveredentities/De-i dentification/hhs_deid_guidance.pdf, 2012. Accessed 2026-01-20
work page 2012
-
[40]
J. Zambrano, K. Singhal, L. Pankiewicz, R. Baker, L. Porter, and L. Liu. De-identifying student personally identifying information in discussion forum posts with large language models.Information and Learning Sciences, 126(5/6):401–424, 2025
work page 2025
-
[41]
M. Zent, D. Smith, and S. Woodhead. PIIvot: A lightweight NLP anonymization framework for question-anchored tutoring dialogues.arXiv, 2025
work page 2025
-
[42]
APPENDIX 1: THE PROMPT USED FOR PII QUALITY EV ALUATION AND SUR- ROGATE GENERATION Role: You are a Senior PII (Personally Identifi- able Information) Analyst and Data Sanitization Expert specializing in math tutoring transcripts. Objective: Analyze transcripts to identify unredacted PII, validate existing redactions, and generate high-quality, context-awa...
-
[43]
Detection: Scan each message for PII from the taxonomy. Some PII has been redacted. Some has not. For every PII instance, identify which PII type in the taxonomy it belongs to
-
[44]
Label as ”PII”, ”Not PII”, or ”Uncertain”
Evaluation: For every PII instance (pre- redacted or newly found), use at least a window of 3 messages above and 3 messages below to deter- mine if the tag is valid. Label as ”PII”, ”Not PII”, or ”Uncertain”
-
[45]
Redaction: If unredacted PII is found, provide the message with the PII replaced by the tag (e.g., <PERSON>)
-
[46]
Surrogation: 4.1. If ”PII” or ”Uncertain”: Generate a specific, realistic surrogate that fit the PII type (e.g., re- place<SCHOOL>with ”Northview High”, not ”the school”). Keep the entity name consistent in a transcript. Meanwhile, do not reuse the same names or places across the transcript. If the origi- nal PII is know, the generated surrogate should be...
-
[47]
pii type: The identified category from the 17 types of each message containing PII
-
[48]
ai redacted content: The message with <PII TYPE>(only for newly discovered PII; oth- erwise leave blank)
-
[49]
pii evaluation: ”PII”, ”Not PII”, or ”Uncertain”
-
[50]
surrogate: The specific replacement value for the tag
-
[51]
APPENDIX 2: MATH VOCABULARY The following vocabulary list, categorized by mathematical domain and grade level, was used to calculate the Math Density (Dmath) of messages in Phase 2. Operations: operation, add, addition, adding, sum, total, plus, subtract, subtraction, subtracting, minus, difference, multiply, multiplication, multiplying, times, product, d...
-
[52]
APPENDIX 3: SEGMENTATION OPTI- MIZATION This appendix documents the procedure and results of the threshold optimization used for math segmentation. We conducted a grid search to examine the sensitivity of math- segmentation outcomes to two parameters: theAnchor Threshold(T anchor) and theSimilarity Threshold(T sim). The goal of this analysis was to assess...
-
[53]
Your task is to identify ALL PII in the provided message content
APPENDIX 4: THE BASIC PROMPT USED FOR PII DETECTION You are a specialist in PII (Personally Identifiable Information) detection. Your task is to identify ALL PII in the provided message content. PII Types to detect: AGE COURSE: must be a subject or its acronym with a multi-digit number, e.g., algebra 300, geometry 101, CS 503; only a subject name without ...
-
[54]
APPENDIX 5: THE MATH-A W ARE PROMPT USED FOR PII DETECTION You are a specialist in PII (Personally Identifiable Information) detection. Your task is to identify ALL PII in the provided message content that comes from math tutoring sessions. Pay attention that general math content should not be annotated as PII, e.g., math sub- jects, concepts, symbols, eq...
-
[55]
APPENDIX 6: THE SEGMENT-A W ARE PROMPT USED FOR PII DETECTION You are a specialist in PII (Personally Identifiable Information) detection. Your task is to identify ALL PII in the provided message content. If the message is likely to be about mathemat- ics, its “math label” field will have the value ”MATH”. Otherwise, the “math label” will be “NON-MATH”. N...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.