Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Bakhtawar Ahtisham; Haocheng Zhang; Kirk Vanacore; Ren\'e F. Kizilcec; Zhuqian Zhou

arxiv: 2606.18372 · v1 · pith:P4NW52EPnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Haocheng Zhang , Zhuqian Zhou , Kirk Vanacore , Bakhtawar Ahtisham , Ren\'e F. Kizilcec This is my paper

Pith reviewed 2026-06-27 00:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords de-identificationeducational dialoguelocal AIprivacy triagePII detectioncontext-aware reviewnamed entity recognitioncascade framework

0 comments

The pith

A fully local two-stage cascade reaches 0.958 macro F1 for educational dialogue de-identification by using context to decide redact or keep.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that educational dialogue de-identification can be solved accurately without sending data outside a local machine by changing the problem from open entity recognition to a constrained triage task. A recall-first proposer uses lightweight encoders and rules to generate candidate spans, then a reviewer applies surrounding dialogue and speaker role to make binary Redact/Keep calls on each one. On math tutoring transcripts the strongest local setup scores 0.958 macro F1, above 0.767 for a same-family LLM baseline and 0.706 for a commercial API, while running on one laptop. The same setup loses only 0.03 F1 on a set built around curricular-personal name overlaps, where smaller reviewers lose 0.19 to 0.25. These outcomes point to problem formulation mattering more than model size for this use case.

Core claim

The central claim is that reframing de-identification as privacy triage with a recall-first union proposer followed by a context-aware reviewer allows a fully local system to reach 0.958 macro F1 on educational transcripts, exceeding both same-family LLM-only baselines and commercial APIs while preserving data governance.

What carries the argument

The context-aware reviewer, which makes binary Redact/Keep decisions for each candidate span using surrounding dialogue and speaker role.

If this is right

Educational data can be de-identified at high accuracy without transmitting transcripts to third parties.
Over-generation of candidates followed by context review reduces erroneous redaction of curricular terms that resemble names.
Performance on ambiguous name cases stays stable when the reviewer has access to dialogue turns and speaker identity.
Single-laptop deployment becomes feasible for processing large collections of tutoring transcripts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged approach may transfer to other settings where domain terms overlap with personal names, such as medical notes or legal transcripts.
Lowering the need for large models through explicit proposer-reviewer separation could cut energy use for privacy-sensitive text processing.
Adding speaker-turn features or simple role labels might further stabilize reviewer decisions on edge cases the paper does not test.

Load-bearing premise

Surrounding dialogue context plus speaker role alone is sufficient for the reviewer to make reliable Redact/Keep decisions on ambiguous names without external knowledge or larger models.

What would settle it

A test set of dialogues in which names cannot be classified as personal or curricular from transcript context and speaker role alone, on which the cascade macro F1 falls below the reported 0.958 level.

Figures

Figures reproduced from arXiv: 2606.18372 by Bakhtawar Ahtisham, Haocheng Zhang, Kirk Vanacore, Ren\'e F. Kizilcec, Zhuqian Zhou.

**Figure 1.** Figure 1: End-to-end cascade pipeline. A recall-first union encoder proposer (DeBERTa-v3-base + ModernBERTbase; stage 1A) plus direct rules (stage 1B) emit candidate spans; a cascade-aligned reviewer (stage 2A) decides Redact/Keep for each candidate using surrounding context and speaker role; a deployment-time policy applies the final privacy action. its surrounding context, speaker role, and turn metadata, while r… view at source ↗

**Figure 2.** Figure 2: Canonical-to-challenge F1 shift per configuration. Canonical-to-challenge degradation ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: shows that per-dialogue recall for the fully-LLM Gemma E4B baseline declines as dialogue length grows: a linear fit yields slope −0.013 per 1K characters (𝑝 < 10−5 , 𝑅2 = 0.12), and bucketed mean recall drops from 0.54 (<2K chars) to 0.11 (≥20K chars). We hypothesize that this recall collapse reflects an attention-budget limitation: the full-dialogue formulation must locate every PII span across thousands … view at source ↗

read the original abstract

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The cascade splits proposal from review to keep local de-id accurate on educational names, but abstract-level evidence leaves the ambiguity resolution claim shaky.

read the letter

The main takeaway is a recall-first local cascade: lightweight encoders plus rules over-generate name candidates, then a context-and-role reviewer makes binary redact/keep calls. On math tutoring transcripts this hits 0.958 macro F1, beating a same-family LLM baseline at 0.767 and a commercial API at 0.706, all on one laptop.

The formulation is the useful part. Treating de-identification as constrained triage rather than open NER or direct prompting lets the system catch PII without over-redacting curricular terms like mathematician names. The targeted challenge set shows the best reviewer drops only 0.03 F1 while smaller ones drop 0.19-0.25, which suggests the context step adds something.

The soft spot is exactly the one the stress-test flags. The abstract gives no dataset sizes, no training details, and no description of how the curricular-personal ambiguity cases were built. If the reviewer really decides on local dialogue window plus speaker role alone, that signal may not be enough when the same surface form is genuinely ambiguous in real transcripts. The 0.03 degradation number is only as strong as the challenge set, and without seeing its construction or the full methods it is hard to trust the scale-over-formulation conclusion.

This is for ed-tech or on-device privacy work. The head-to-head numbers and the local constraint are concrete enough that it deserves referee time, even if the paper will need to add data, code, and clearer challenge-set documentation before publication.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a fully local cascade for de-identifying educational dialogues by reframing the task as privacy triage. A recall-first union proposer generates candidate spans using lightweight encoders and deterministic rules, and a context-aware reviewer makes binary Redact/Keep decisions based on surrounding dialogue and speaker role. On math tutoring transcripts from two platforms, the strongest configuration achieves 0.958 macro F1, outperforming a same-family LLM baseline (0.767) and a commercial API (0.706). On a targeted challenge set for curricular-personal name ambiguity, it degrades by only 0.03 F1 compared to 0.19-0.25 for smaller reviewers.

Significance. If the results are reproducible, this work demonstrates that a local, multi-stage approach can achieve high accuracy in handling ambiguous PII in educational data without relying on external APIs, addressing both privacy governance and performance. The inclusion of a targeted challenge set provides a concrete test of the method's robustness to name ambiguity. Strengths include the fully local execution on a single laptop and the empirical comparison showing the value of the cascade formulation over scale alone.

major comments (3)

[Abstract] Abstract: The reported performance metrics (0.958 macro F1, 0.03 degradation) are presented without details on dataset sizes, exact model architectures, training procedures, or statistical significance tests, which are necessary to evaluate the reliability of the central empirical claims.
[Methods (reviewer component)] Methods (reviewer component): The claim that the reviewer can reliably decide Redact/Keep using only surrounding dialogue and speaker role is load-bearing for the advantage over LLM baselines, but the manuscript does not provide analysis or examples of cases where this local context is insufficient for resolving curricular vs. personal name ambiguity.
[Evaluation] Evaluation: The construction and size of the 'targeted challenge set of curricular-personal name ambiguity' is not described, undermining the interpretation of the 0.03 F1 degradation result as evidence of robustness.

minor comments (1)

[Abstract] Abstract: The term 'smaller reviewers' is used without definition or reference to specific configurations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The reported performance metrics (0.958 macro F1, 0.03 degradation) are presented without details on dataset sizes, exact model architectures, training procedures, or statistical significance tests, which are necessary to evaluate the reliability of the central empirical claims.

Authors: The abstract is intentionally concise to highlight the core contribution. Full details on the two-platform dataset sizes, exact encoder and reviewer architectures, training procedures, and evaluation protocol appear in the Methods and Evaluation sections. We agree that the abstract would benefit from additional specificity and will revise it to include dataset sizes, model family details, and a statement on statistical significance (computed via bootstrap resampling over the test transcripts). revision: yes
Referee: [Methods (reviewer component)] Methods (reviewer component): The claim that the reviewer can reliably decide Redact/Keep using only surrounding dialogue and speaker role is load-bearing for the advantage over LLM baselines, but the manuscript does not provide analysis or examples of cases where this local context is insufficient for resolving curricular vs. personal name ambiguity.

Authors: We acknowledge that the manuscript would be strengthened by explicit discussion of the reviewer's limitations. While the quantitative results on the challenge set already demonstrate robustness relative to smaller reviewers, we will add a qualitative analysis subsection containing representative examples of curricular-personal name ambiguity, cases where speaker role and dialogue context suffice for correct decisions, and any observed failure modes where additional context would be required. revision: yes
Referee: [Evaluation] Evaluation: The construction and size of the 'targeted challenge set of curricular-personal name ambiguity' is not described, undermining the interpretation of the 0.03 F1 degradation result as evidence of robustness.

Authors: The challenge set is referenced in the Evaluation section, but we agree that its construction, size, and selection criteria require fuller description to support the robustness claim. We will expand the relevant paragraph to detail how instances were identified (names appearing in both curricular and personal contexts), the total number of spans, annotation process, and how the set was held out from training. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical head-to-head evaluation on external transcripts

full rationale

The paper reports macro F1 scores from direct evaluation of a proposed local cascade (proposer + reviewer) against same-family LLM baselines and a commercial API on math tutoring transcripts. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the derivation chain. The central claims rest on measured performance differences (0.958 vs. 0.767/0.706) and degradation on a challenge set, which are falsifiable against the held-out data rather than constructed from the method itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the ledger is therefore limited to the high-level assumptions visible in the abstract text.

free parameters (1)

reviewer decision threshold
Binary Redact/Keep decision likely depends on an implicit or tuned threshold not specified in the abstract.

axioms (1)

domain assumption Dialogue context and speaker role suffice to disambiguate curricular versus personal names
This premise underpins the reviewer stage and is invoked when the abstract claims robustness on the name-ambiguity challenge set.

pith-pipeline@v0.9.1-grok · 5808 in / 1426 out tokens · 45955 ms · 2026-06-27T00:51:38.522567+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 7 canonical work pages

[1]

Singhal, A

S. Singhal, A. F. Zambrano, M. Pankiewicz, X. Liu, C. Porter, R. S. Baker, De-identifying student personally identifying information with GPT-4, in: Proceedings of the 17th International Con- ference on Educational Data Mining, International Educational Data Mining Society, Atlanta, Georgia, USA, 2024, pp. 559–565. URL: https://educationaldatamining.org/e...

work page doi:10.5281/zenodo.12729884 2024
[2]

Z. Ji, Y. Shen, K. R. Koedinger, J. Lin, Enhancing the de-identification of personally identifiable information in educational data, Journal of Educational Data Mining (2025). URL: https://jedm. educationaldatamining.org/index.php/JEDM/article/view/936. doi:10.5281/zenodo.17114271

work page doi:10.5281/zenodo.17114271 2025
[3]

Z. Zhou, K. Vanacore, B. Ahtisham, J. Lee, D. Pietrzak, D. Hedley, J. Dias, C. Shaw, R. Schäfer, R. F. Kizilcec, Utility-preserving de-identification for math tutoring: Investigating numeric ambiguity in the MathEd-PII benchmark dataset, arXiv preprint arXiv:2602.16571 (2026). URL: https://arxiv.org/abs/2602.16571

Pith/arXiv arXiv 2026
[4]

Nicola, M

A. Stubbs, C. Kotfila, Ö. Uzuner, Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1, Journal of Biomedical Informatics 58 (2015) S11–S19. URL: https://pubmed.ncbi.nlm.nih.gov/26225918/. doi:10.1016/j. jbi.2015.06.007

work page doi:10.1016/j 2014
[5]

Kovačević, B

A. Kovačević, B. Bašaragin, N. Milošević, G. Nenadić, De-identification of clinical free text using natural language processing: A systematic review of current approaches, Artificial Intelligence in Medicine 151 (2024) 102845. URL: https://doi.org/10.1016/j.artmed.2024.102845. doi:10.1016/j. artmed.2024.102845

work page doi:10.1016/j.artmed.2024.102845 2024
[6]

M. Zent, D. Smith, S. Woodhead, PIIvot: A lightweight NLP anonymization framework for question- anchored tutoring dialogues, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Suzhou, China, 2025, pp. 27479–27488. URL: https://aclanthology.org/2025.emnlp-main.1397/. doi:10...

work page doi:10.18653/v1/2025 2025
[7]

Holmes, J

L. Holmes, J. Wang, S. Crossley, W. Zhang, The cleaned repository of annotated personally identifiable information, in: Proceedings of the 17th International Conference on Educational Data Mining, 2024. URL: https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters. 88/index.html

2024
[8]

Honnibal, I

M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spacy: Industrial-strength natural language processing in python, Software, 2020. URL: https://spacy.io

2020
[9]

URL: https://microsoft.github.io/presidio/

Microsoft, Presidio: Data protection and de-identification sdk, Software documentation, 2024. URL: https://microsoft.github.io/presidio/

2024
[10]

Awasthy, T

P. Awasthy, T. Moon, N. Jian, R. Florian, Cascaded models for better fine-grained named entity recognition, arXiv preprint arXiv:2009.07317 (2020). URL: https://arxiv.org/abs/2009.07317

arXiv 2009
[11]

Huang, X

P. Huang, X. Zhao, M. Hu, Z. Tan, W. Xiao, T2-NER: A two-stage span-based framework for unified named entity recognition with templates, Transactions of the Association for Computational Linguistics 11 (2023) 1265–1282. URL: https://aclanthology.org/2023.tacl-1.72. doi:10.1162/tacl_ a_00602

work page doi:10.1162/tacl_ 2023
[12]

Y. Li, Y. Yu, T. Qian, Type-aware decomposed framework for few-shot named entity recognition, arXiv preprint arXiv:2302.06397 (2023). URL: https://arxiv.org/abs/2302.06397

arXiv 2023
[13]

Jiang, Z

G. Jiang, Z. Luo, Y. Shi, D. Wang, J. Liang, D. Yang, ToNER: Type-oriented named entity recognition with generative language model, arXiv preprint arXiv:2404.09145 (2024). URL: https://arxiv.org/ abs/2404.09145

arXiv 2024
[14]

P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing, in: The Eleventh International Conference on Learning Representations, 2023. URL: https://openreview.net/forum?id=sE7-XhLxHA

2023
[15]

Warner, A

B. Warner, A. Chaffin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas, F. Ladhak, T. Aarsen, N. Cooper, G. Adams, J. Howard, I. Poli, Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, arXiv preprint arXiv:2412.13663 (2024). URL: https://arxiv....

Pith/arXiv arXiv 2024
[16]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988. URL: https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_ paper.html. doi:10.1109/ICCV.2017.324

work page doi:10.1109/iccv.2017.324 2017
[17]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv:1907.11692 (2019). URL: https://arxiv.org/abs/1907.11692

Pith/arXiv arXiv 1907
[18]

URL: https://blog

Gemma Team, Google DeepMind, Gemma 4: Open models for on-device and local deployment, Model release; model cards for gemma-4-E4B-it and gemma-4-31B-it, 2026. URL: https://blog. google/technology/developers/gemma-4/

2026
[19]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations,
[20]

URL: https://openreview.net/forum?id=nZeVKeeFYf9
[21]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, Advances in Neural Information Processing Systems 36 (2023) 10088–10115. URL: https: //arxiv.org/abs/2305.14314

Pith/arXiv arXiv 2023
[22]

Jordan has 15 apples

Google, Gemini 3.1 pro preview model card, Google AI for Developers, 2026. URL: https://ai.google. dev/gemini-api/docs/models. A. Methodology Notes Candidate spans are divided into two pools with different inference paths. (1)Direct-rule pool(18 rows on Platform A, 0 on Platform B): structured identifiers matched by deterministic rules (e.g., email, phone...

2026
[23]

A word-problem note instructingKeepwhen a candidate name appears in a mathematical scenario
[24]

Hi, I’m Morgan

Pedagogical direct-address counter-examples (“Hi, I’m Morgan”, “Thanks, Taylor!”) using names chosen specifically to not overlap with the test-set failure cases, to avoid overfitting the prompt to specific evaluation spans. Prompt text (verbatim). You are a PII verification analyst for educational de-identification. You will receive one candidate span pro...

[1] [1]

Singhal, A

S. Singhal, A. F. Zambrano, M. Pankiewicz, X. Liu, C. Porter, R. S. Baker, De-identifying student personally identifying information with GPT-4, in: Proceedings of the 17th International Con- ference on Educational Data Mining, International Educational Data Mining Society, Atlanta, Georgia, USA, 2024, pp. 559–565. URL: https://educationaldatamining.org/e...

work page doi:10.5281/zenodo.12729884 2024

[2] [2]

Z. Ji, Y. Shen, K. R. Koedinger, J. Lin, Enhancing the de-identification of personally identifiable information in educational data, Journal of Educational Data Mining (2025). URL: https://jedm. educationaldatamining.org/index.php/JEDM/article/view/936. doi:10.5281/zenodo.17114271

work page doi:10.5281/zenodo.17114271 2025

[3] [3]

Z. Zhou, K. Vanacore, B. Ahtisham, J. Lee, D. Pietrzak, D. Hedley, J. Dias, C. Shaw, R. Schäfer, R. F. Kizilcec, Utility-preserving de-identification for math tutoring: Investigating numeric ambiguity in the MathEd-PII benchmark dataset, arXiv preprint arXiv:2602.16571 (2026). URL: https://arxiv.org/abs/2602.16571

Pith/arXiv arXiv 2026

[4] [4]

Nicola, M

A. Stubbs, C. Kotfila, Ö. Uzuner, Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1, Journal of Biomedical Informatics 58 (2015) S11–S19. URL: https://pubmed.ncbi.nlm.nih.gov/26225918/. doi:10.1016/j. jbi.2015.06.007

work page doi:10.1016/j 2014

[5] [5]

Kovačević, B

A. Kovačević, B. Bašaragin, N. Milošević, G. Nenadić, De-identification of clinical free text using natural language processing: A systematic review of current approaches, Artificial Intelligence in Medicine 151 (2024) 102845. URL: https://doi.org/10.1016/j.artmed.2024.102845. doi:10.1016/j. artmed.2024.102845

work page doi:10.1016/j.artmed.2024.102845 2024

[6] [6]

M. Zent, D. Smith, S. Woodhead, PIIvot: A lightweight NLP anonymization framework for question- anchored tutoring dialogues, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Suzhou, China, 2025, pp. 27479–27488. URL: https://aclanthology.org/2025.emnlp-main.1397/. doi:10...

work page doi:10.18653/v1/2025 2025

[7] [7]

Holmes, J

L. Holmes, J. Wang, S. Crossley, W. Zhang, The cleaned repository of annotated personally identifiable information, in: Proceedings of the 17th International Conference on Educational Data Mining, 2024. URL: https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters. 88/index.html

2024

[8] [8]

Honnibal, I

M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spacy: Industrial-strength natural language processing in python, Software, 2020. URL: https://spacy.io

2020

[9] [9]

URL: https://microsoft.github.io/presidio/

Microsoft, Presidio: Data protection and de-identification sdk, Software documentation, 2024. URL: https://microsoft.github.io/presidio/

2024

[10] [10]

Awasthy, T

P. Awasthy, T. Moon, N. Jian, R. Florian, Cascaded models for better fine-grained named entity recognition, arXiv preprint arXiv:2009.07317 (2020). URL: https://arxiv.org/abs/2009.07317

arXiv 2009

[11] [11]

Huang, X

P. Huang, X. Zhao, M. Hu, Z. Tan, W. Xiao, T2-NER: A two-stage span-based framework for unified named entity recognition with templates, Transactions of the Association for Computational Linguistics 11 (2023) 1265–1282. URL: https://aclanthology.org/2023.tacl-1.72. doi:10.1162/tacl_ a_00602

work page doi:10.1162/tacl_ 2023

[12] [12]

Y. Li, Y. Yu, T. Qian, Type-aware decomposed framework for few-shot named entity recognition, arXiv preprint arXiv:2302.06397 (2023). URL: https://arxiv.org/abs/2302.06397

arXiv 2023

[13] [13]

Jiang, Z

G. Jiang, Z. Luo, Y. Shi, D. Wang, J. Liang, D. Yang, ToNER: Type-oriented named entity recognition with generative language model, arXiv preprint arXiv:2404.09145 (2024). URL: https://arxiv.org/ abs/2404.09145

arXiv 2024

[14] [14]

P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing, in: The Eleventh International Conference on Learning Representations, 2023. URL: https://openreview.net/forum?id=sE7-XhLxHA

2023

[15] [15]

Warner, A

B. Warner, A. Chaffin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas, F. Ladhak, T. Aarsen, N. Cooper, G. Adams, J. Howard, I. Poli, Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, arXiv preprint arXiv:2412.13663 (2024). URL: https://arxiv....

Pith/arXiv arXiv 2024

[16] [16]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988. URL: https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_ paper.html. doi:10.1109/ICCV.2017.324

work page doi:10.1109/iccv.2017.324 2017

[17] [17]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv:1907.11692 (2019). URL: https://arxiv.org/abs/1907.11692

Pith/arXiv arXiv 1907

[18] [18]

URL: https://blog

Gemma Team, Google DeepMind, Gemma 4: Open models for on-device and local deployment, Model release; model cards for gemma-4-E4B-it and gemma-4-31B-it, 2026. URL: https://blog. google/technology/developers/gemma-4/

2026

[19] [19]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations,

[20] [20]

URL: https://openreview.net/forum?id=nZeVKeeFYf9

[21] [21]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, Advances in Neural Information Processing Systems 36 (2023) 10088–10115. URL: https: //arxiv.org/abs/2305.14314

Pith/arXiv arXiv 2023

[22] [22]

Jordan has 15 apples

Google, Gemini 3.1 pro preview model card, Google AI for Developers, 2026. URL: https://ai.google. dev/gemini-api/docs/models. A. Methodology Notes Candidate spans are divided into two pools with different inference paths. (1)Direct-rule pool(18 rows on Platform A, 0 on Platform B): structured identifiers matched by deterministic rules (e.g., email, phone...

2026

[23] [23]

A word-problem note instructingKeepwhen a candidate name appears in a mathematical scenario

[24] [24]

Hi, I’m Morgan

Pedagogical direct-address counter-examples (“Hi, I’m Morgan”, “Thanks, Taylor!”) using names chosen specifically to not overlap with the test-set failure cases, to avoid overfitting the prompt to specific evaluation spans. Prompt text (verbatim). You are a PII verification analyst for educational de-identification. You will receive one candidate span pro...