Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Bakhtawar Ahtisham; Chris Shaw; Daryl Hedley; Doug Pietrzak; Jinsook Lee; Jorge Dias; Kirk Vanacore; Ren\'e F. Kizilcec; Ruth Sch\"afer; Zhuqian Zhou

arxiv: 2602.16571 · v3 · pith:OSUS25D3new · submitted 2026-02-18 · 💻 cs.CL

Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Zhuqian Zhou , Kirk Vanacore , Bakhtawar Ahtisham , Jinsook Lee , Doug Pietrzak , Daryl Hedley , Jorge Dias , Chris Shaw

show 2 more authors

Ruth Sch\"afer Ren\'e F. Kizilcec

This is my paper

Pith reviewed 2026-05-15 21:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords PII detectionde-identificationmath tutoringnumeric ambiguityLLM promptingbenchmark dataseteducational datautility preservation

0 comments

The pith

Domain-aware LLM prompting detects PII in math tutoring dialogues while preserving instructional numbers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In mathematics tutoring transcripts, numeric expressions often resemble personal identifiers such as dates or IDs, causing generic de-identification tools to remove core instructional content and reduce data utility for research. The paper introduces the MathEd-PII benchmark dataset, constructed via human-in-the-loop LLM annotation, to study this numeric ambiguity. Density-based segmentation shows that erroneous redactions cluster in math-dense regions. When LLM-based detectors receive basic, math-aware, or segment-aware prompts, performance rises from a baseline F1 of 0.379 to 0.802 and 0.821 respectively, with fewer false removals of numbers. The results establish that domain context must be incorporated into detection methods to maintain analytic value when sharing tutoring data at scale.

Core claim

Generic PII detection systems over-redact numeric expressions in math tutoring dialogues due to ambiguity with structured identifiers, but domain-aware prompting strategies for LLMs, including math-aware and segment-aware variants, substantially improve detection accuracy on the new MathEd-PII benchmark dataset while reducing numeric false positives and thereby preserving educational utility.

What carries the argument

Density-based segmentation to locate math-dense regions, paired with math-aware and segment-aware prompting of LLMs to distinguish instructional numbers from PII.

If this is right

Generic PII detectors are inadequate for domain-specific educational dialogues because they cannot resolve numeric ambiguity.
Domain-aware methods enable larger-scale sharing of de-identified math tutoring data without destroying core instructional content.
Segment-aware prompting delivers the highest accuracy by incorporating local density information during detection.
Human-in-the-loop annotation offers a practical route to building reliable domain-specific PII benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same numeric ambiguity problem likely appears in other quantitative tutoring domains such as physics or chemistry dialogues.
Direct integration of mathematical expression parsers into detection pipelines could reduce false positives beyond what prompting alone achieves.
These techniques could support standardized privacy practices for releasing learning analytics datasets from schools and platforms.
Testing the prompting strategies on real-time live tutoring sessions rather than transcripts would reveal additional practical constraints.

Load-bearing premise

The human-in-the-loop LLM annotation process yields reliable ground-truth PII labels that generalize to unseen math tutoring dialogues without systematic bias in numeric patterns.

What would settle it

Re-annotating a fresh held-out collection of math tutoring transcripts with independent human reviewers and measuring whether segment-aware prompting still achieves F1 above 0.8 with low numeric false positives.

Figures

Figures reproduced from arXiv: 2602.16571 by Bakhtawar Ahtisham, Chris Shaw, Daryl Hedley, Doug Pietrzak, Jinsook Lee, Jorge Dias, Kirk Vanacore, Ren\'e F. Kizilcec, Ruth Sch\"afer, Zhuqian Zhou.

**Figure 1.** Figure 1: Distribution of Original PII Redactions from the Upstream System (Presidio and Customized Rules) across Different [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Math Segmentation. Figure 2a visualizes how these proportions vary across the threshold space. Across all configurations, the false positive proportion remains consistently higher than the true positive proportion. As Tsim increases from 0.1 to 0.3, false positive capture declines gradually, while for Tsim ≥ 0.3 it stabilizes at approximately 50%–55% across anchor thresholds. In contrast, the true positi… view at source ↗

read the original abstract

Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and three LLM-based approaches with basic, math-aware, and segment-aware prompting. Domain-aware prompting, including both math-aware (F1: 0.802) and segment-aware versions (F1: 0.821), substantially outperforms the baseline (F1: 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces the first benchmark for PII detection in math tutoring transcripts and shows domain-aware prompting lifts F1 from 0.38 to 0.82, but the gains rest on LLM-assisted labels whose reliability is not yet shown.

read the letter

The core contribution here is a new dataset called MathEd-PII plus a clear demonstration that generic PII tools over-redact numbers in tutoring dialogues. The authors segment the transcripts by math density and find that false positives cluster there, which matches what anyone who has looked at these logs would expect. They then test basic, math-aware, and segment-aware prompts against a Presidio baseline and report F1 scores rising from 0.379 to 0.802 and 0.821. That gap is large enough to matter for anyone trying to release tutoring data at scale.

Referee Report

2 major / 1 minor

Summary. The paper introduces MathEd-PII, the first benchmark dataset for PII detection in mathematics tutoring dialogues, constructed via human-in-the-loop LLM annotation. It uses density-based segmentation to demonstrate that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a failure mode for generic detectors. The work then evaluates four strategies—Presidio baseline plus three LLM prompting variants (basic, math-aware, segment-aware)—and reports that domain-aware prompting yields substantial gains (math-aware F1 0.802, segment-aware F1 0.821) over the baseline (F1 0.379) while reducing numeric false positives, arguing that utility-preserving de-identification requires domain context.

Significance. If the empirical results hold under proper validation, the paper supplies a needed domain-specific benchmark and concrete evidence that generic PII tools over-redact instructional content in math dialogues. This could directly support safer large-scale sharing of tutoring transcripts for learning-science research while preserving analytic utility.

major comments (2)

[Abstract] Abstract: the headline F1 gains (0.802 and 0.821 vs. 0.379) are presented without any mention of dataset size, inter-annotator agreement, statistical significance tests, or error analysis on numeric false positives; these omissions make it impossible to assess whether the reported outperformance is robust or merely an artifact of the annotation process.
[Dataset construction] Dataset construction section: the human-in-the-loop LLM annotation used to create MathEd-PII ground truth introduces a circularity risk because the same class of models is later employed for detection; without a purely human baseline, IAA metrics, or targeted error analysis on ambiguous numeric expressions, the claimed reduction in numeric false positives cannot be confidently attributed to the prompting strategies rather than annotation bias.

minor comments (1)

[Abstract] Abstract: the phrase 'density-based segmentation' is used without a one-sentence definition or citation, which may hinder readers who are not already familiar with the technique.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which highlight important aspects of robustness and potential biases in our evaluation. We address each major comment below and have made targeted revisions to the manuscript to improve transparency and strengthen the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline F1 gains (0.802 and 0.821 vs. 0.379) are presented without any mention of dataset size, inter-annotator agreement, statistical significance tests, or error analysis on numeric false positives; these omissions make it impossible to assess whether the reported outperformance is robust or merely an artifact of the annotation process.

Authors: We agree that the abstract should provide sufficient context for readers to evaluate result robustness. The full manuscript already reports dataset size, inter-annotator agreement, significance testing, and numeric error analysis in the Dataset Construction and Results sections. We have revised the abstract to explicitly include dataset size, IAA metrics, and a brief reference to the error analysis on numeric false positives, while preserving the required length constraints. revision: yes
Referee: [Dataset construction] Dataset construction section: the human-in-the-loop LLM annotation used to create MathEd-PII ground truth introduces a circularity risk because the same class of models is later employed for detection; without a purely human baseline, IAA metrics, or targeted error analysis on ambiguous numeric expressions, the claimed reduction in numeric false positives cannot be confidently attributed to the prompting strategies rather than annotation bias.

Authors: We acknowledge the circularity concern inherent to LLM-assisted annotation. The process was strictly human-in-the-loop, with human annotators reviewing, correcting, and finalizing all labels; IAA metrics are reported in the Dataset Construction section to quantify annotator reliability. We have expanded the targeted error analysis on ambiguous numeric expressions in the revised Results section to better attribute performance gains to the prompting strategies. A purely human baseline at this scale was not feasible due to annotation cost, but the independent density-based segmentation analysis (showing false-positive clustering in math-dense regions) provides supporting evidence independent of the detection models. revision: partial

standing simulated objections not resolved

A complete purely human-annotated baseline for the full MathEd-PII dataset is not available and would require substantial additional resources beyond the current study.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct F1 measurements

full rationale

The paper introduces the MathEd-PII benchmark via human-in-the-loop LLM annotation and reports direct empirical F1 scores (baseline 0.379, math-aware 0.802, segment-aware 0.821) for prompting strategies on numeric ambiguity detection. No equations, parameter fits, derivations, or self-citations appear in the provided text that reduce any claimed result to its own inputs by construction. The performance numbers are straightforward held-out measurements rather than quantities defined or predicted from the authors' prior work, satisfying the self-contained empirical standard with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about PII definitions and LLM annotation quality rather than new free parameters or invented entities.

axioms (1)

domain assumption Human-in-the-loop LLM annotation produces sufficiently accurate PII labels for benchmarking
Invoked when constructing the MathEd-PII dataset

pith-pipeline@v0.9.0 · 5582 in / 1310 out tokens · 43173 ms · 2026-05-15T21:12:34.060180+00:00 · methodology

Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)