Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Pith reviewed 2026-05-15 21:12 UTC · model grok-4.3
The pith
Domain-aware LLM prompting detects PII in math tutoring dialogues while preserving instructional numbers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generic PII detection systems over-redact numeric expressions in math tutoring dialogues due to ambiguity with structured identifiers, but domain-aware prompting strategies for LLMs, including math-aware and segment-aware variants, substantially improve detection accuracy on the new MathEd-PII benchmark dataset while reducing numeric false positives and thereby preserving educational utility.
What carries the argument
Density-based segmentation to locate math-dense regions, paired with math-aware and segment-aware prompting of LLMs to distinguish instructional numbers from PII.
If this is right
- Generic PII detectors are inadequate for domain-specific educational dialogues because they cannot resolve numeric ambiguity.
- Domain-aware methods enable larger-scale sharing of de-identified math tutoring data without destroying core instructional content.
- Segment-aware prompting delivers the highest accuracy by incorporating local density information during detection.
- Human-in-the-loop annotation offers a practical route to building reliable domain-specific PII benchmarks.
Where Pith is reading between the lines
- The same numeric ambiguity problem likely appears in other quantitative tutoring domains such as physics or chemistry dialogues.
- Direct integration of mathematical expression parsers into detection pipelines could reduce false positives beyond what prompting alone achieves.
- These techniques could support standardized privacy practices for releasing learning analytics datasets from schools and platforms.
- Testing the prompting strategies on real-time live tutoring sessions rather than transcripts would reveal additional practical constraints.
Load-bearing premise
The human-in-the-loop LLM annotation process yields reliable ground-truth PII labels that generalize to unseen math tutoring dialogues without systematic bias in numeric patterns.
What would settle it
Re-annotating a fresh held-out collection of math tutoring transcripts with independent human reviewers and measuring whether segment-aware prompting still achieves F1 above 0.8 with low numeric false positives.
Figures
read the original abstract
Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and three LLM-based approaches with basic, math-aware, and segment-aware prompting. Domain-aware prompting, including both math-aware (F1: 0.802) and segment-aware versions (F1: 0.821), substantially outperforms the baseline (F1: 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MathEd-PII, the first benchmark dataset for PII detection in mathematics tutoring dialogues, constructed via human-in-the-loop LLM annotation. It uses density-based segmentation to demonstrate that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a failure mode for generic detectors. The work then evaluates four strategies—Presidio baseline plus three LLM prompting variants (basic, math-aware, segment-aware)—and reports that domain-aware prompting yields substantial gains (math-aware F1 0.802, segment-aware F1 0.821) over the baseline (F1 0.379) while reducing numeric false positives, arguing that utility-preserving de-identification requires domain context.
Significance. If the empirical results hold under proper validation, the paper supplies a needed domain-specific benchmark and concrete evidence that generic PII tools over-redact instructional content in math dialogues. This could directly support safer large-scale sharing of tutoring transcripts for learning-science research while preserving analytic utility.
major comments (2)
- [Abstract] Abstract: the headline F1 gains (0.802 and 0.821 vs. 0.379) are presented without any mention of dataset size, inter-annotator agreement, statistical significance tests, or error analysis on numeric false positives; these omissions make it impossible to assess whether the reported outperformance is robust or merely an artifact of the annotation process.
- [Dataset construction] Dataset construction section: the human-in-the-loop LLM annotation used to create MathEd-PII ground truth introduces a circularity risk because the same class of models is later employed for detection; without a purely human baseline, IAA metrics, or targeted error analysis on ambiguous numeric expressions, the claimed reduction in numeric false positives cannot be confidently attributed to the prompting strategies rather than annotation bias.
minor comments (1)
- [Abstract] Abstract: the phrase 'density-based segmentation' is used without a one-sentence definition or citation, which may hinder readers who are not already familiar with the technique.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of robustness and potential biases in our evaluation. We address each major comment below and have made targeted revisions to the manuscript to improve transparency and strengthen the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline F1 gains (0.802 and 0.821 vs. 0.379) are presented without any mention of dataset size, inter-annotator agreement, statistical significance tests, or error analysis on numeric false positives; these omissions make it impossible to assess whether the reported outperformance is robust or merely an artifact of the annotation process.
Authors: We agree that the abstract should provide sufficient context for readers to evaluate result robustness. The full manuscript already reports dataset size, inter-annotator agreement, significance testing, and numeric error analysis in the Dataset Construction and Results sections. We have revised the abstract to explicitly include dataset size, IAA metrics, and a brief reference to the error analysis on numeric false positives, while preserving the required length constraints. revision: yes
-
Referee: [Dataset construction] Dataset construction section: the human-in-the-loop LLM annotation used to create MathEd-PII ground truth introduces a circularity risk because the same class of models is later employed for detection; without a purely human baseline, IAA metrics, or targeted error analysis on ambiguous numeric expressions, the claimed reduction in numeric false positives cannot be confidently attributed to the prompting strategies rather than annotation bias.
Authors: We acknowledge the circularity concern inherent to LLM-assisted annotation. The process was strictly human-in-the-loop, with human annotators reviewing, correcting, and finalizing all labels; IAA metrics are reported in the Dataset Construction section to quantify annotator reliability. We have expanded the targeted error analysis on ambiguous numeric expressions in the revised Results section to better attribute performance gains to the prompting strategies. A purely human baseline at this scale was not feasible due to annotation cost, but the independent density-based segmentation analysis (showing false-positive clustering in math-dense regions) provides supporting evidence independent of the detection models. revision: partial
- A complete purely human-annotated baseline for the full MathEd-PII dataset is not available and would require substantial additional resources beyond the current study.
Circularity Check
No circularity: empirical benchmark with direct F1 measurements
full rationale
The paper introduces the MathEd-PII benchmark via human-in-the-loop LLM annotation and reports direct empirical F1 scores (baseline 0.379, math-aware 0.802, segment-aware 0.821) for prompting strategies on numeric ambiguity detection. No equations, parameter fits, derivations, or self-citations appear in the provided text that reduce any claimed result to its own inputs by construction. The performance numbers are straightforward held-out measurements rather than quantities defined or predicted from the authors' prior work, satisfying the self-contained empirical standard with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-in-the-loop LLM annotation produces sufficiently accurate PII labels for benchmarking
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.