pith. machine review for the scientific record. sign in

arxiv: 2605.13538 · v1 · pith:3A3NSDLDnew · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

Pith reviewed 2026-05-14 19:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords PII substitutionfew-shot promptingsmall language modelsdemonstration regurgitationlocale conditioningon-device processingNER utilitysurrogate generation
0
0 comments X

The pith

Locale-conditioned rotating few-shot prompts stop small language models from echoing demonstration examples during on-device PII substitution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard few-shot prompting causes a 1-bit 1.7B SLM to copy demonstration outputs verbatim when generating fake names, addresses, and dates to replace detected PII. Switching to locale-conditioned rotating demonstrations, selected by a character-range heuristic and sampled via MD5 hash per input, eliminates all echoes across 482 unique calls while keeping surrogates locale-appropriate. The resulting hybrid pipeline (classifier plus SLM plus faker) produces lower perplexity text than pure rule-based faker across six locales and preserves length better in most cases. However, on downstream English NER the hybrid scores lower F1 than faker because the SLM draws from a narrow same-locale demonstration pool and therefore supplies less variety to the training distribution.

Core claim

Locale-conditioned rotating few-shot demonstrations eliminate verbatim regurgitation of demonstration outputs in the Bonsai-1.7B SLM for contextual PII surrogate generation, while still leaving the model copying from a small same-locale pool; the hybrid system then outperforms rule-based faker on perplexity but underperforms it on NER because reduced output variety harms training utility more than increased naturalness helps.

What carries the argument

Locale-conditioned rotating few-shot prompting: a character-range heuristic selects a locale-pure demonstration pool and an MD5 hash of the input samples three demonstrations for each query.

If this is right

  • Hybrid perplexity beats pure faker across all six tested locales under the XGLM-564M evaluator.
  • Length preservation is best-of-three methods in four of the six locales.
  • On a matched 160/40 English NER split the faker-only baseline reaches 0.506 F1 while the hybrid reaches 0.346 F1 at p < 0.001.
  • The SLM still copies from its small same-locale demonstration pool, limiting the variety supplied to downstream training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same locale-sampling trick could be tested on other on-device SLM tasks that suffer from prompt copying, such as style transfer or short-form generation.
  • If variety matters more than naturalness for NER, future pipelines might deliberately inject controlled noise or draw from larger cross-locale pools after the initial locale filter.
  • The residual copying from a small pool suggests that increasing the size of the locale-pure demonstration set would be a direct next experiment.

Load-bearing premise

The observed NER gap is caused primarily by the SLM producing a narrower distribution of surrogates rather than by differences in the 160/40 subset or by the choice of multilingual evaluator.

What would settle it

Run the same 482 unique inputs through the 1-bit Bonsai-1.7B model with the locale-conditioned sampler disabled and check whether any output exactly matches one of the three fixed demonstrations.

read the original abstract

Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier (openai/privacy-filter) detects spans, a 1-bit Bonsai-1.7B Small Language Model (SLM) proposes contextual surrogates for names, addresses, and dates, and a rule-based generator (faker) handles patterned fields. We report a prompting finding more important than the quantization choice: with naive fixed three-shot demonstrations, the 1-bit SLM regurgitates demonstration outputs verbatim regardless of input; 1.58-bit Ternary-Bonsai-1.7B reproduces byte-identical failures, ruling out quantization as the cause. We fix this with locale-conditioned rotating few-shot demonstrations: a character-range heuristic picks a locale-pure pool and a per-input MD5 hash samples three demonstrations. With the fix, 482/482 unique Bonsai-1.7B calls succeed (no echoes) and produce locale-correct surrogates, although the SLM still copies from a small same-locale demonstration pool - a residual narrowness we quantify. On a 2000-document multilingual corpus, hybrid perplexity (PPL) beats faker in all six locales under a multilingual evaluator (XGLM-564M); length preservation is best-of-three in 4 of 6 locales. On downstream NER (400 train / 100 test, English), redact yields F1=0.000, faker 0.656, original 0.960; on a matched 160/40 subset including hybrid, faker (0.506) outperforms hybrid (0.346) at p < 0.001. We report this as an honest negative finding: SLM surrogates produce more natural text but a less varied training distribution, and downstream NER benefits more from variety than from naturalness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a fully on-device PII redaction pipeline that detects entities with a 1.5B token classifier, generates type-preserving surrogates for names/addresses/dates using a 1-bit Bonsai-1.7B SLM, and handles patterned fields with faker. The central technical contribution is a locale-conditioned rotating few-shot prompting method (character-range heuristic plus MD5 sampling) that eliminates verbatim regurgitation from demonstrations. With this fix the SLM produces 482/482 successful locale-correct outputs on a 2000-document multilingual corpus; hybrid perplexity beats faker in all six locales under XGLM-564M, yet on a matched 160/40 English NER subset faker (F1 0.506) significantly outperforms hybrid (F1 0.346, p<0.001), which the authors attribute to reduced output variety despite higher naturalness.

Significance. If the empirical results hold, the work supplies a practical on-device alternative to placeholder redaction that preserves downstream utility for retrieval and NER training. The precise reporting of 482/482 success counts, statistical significance, and an honest negative finding on the naturalness-vs-variety trade-off are strengths; the prompting technique itself is a reusable contribution for small-model few-shot generation.

major comments (2)
  1. [Abstract (NER results)] Abstract (NER paragraph): the claim that the F1 gap (faker 0.506 vs hybrid 0.346) is caused by reduced variety in SLM surrogates rests on the unverified assumptions that (a) the 160/40 subset was selected without bias relative to the full 400/100 set in properties that interact with substitution style and (b) XGLM-564M perplexity faithfully reflects the factors that matter for NER training utility. No subset-matching procedure or evaluator ablation is described.
  2. [Abstract (prompting fix)] Abstract (prompting results): the 482/482 success rate is reported without the total number of unique inputs, the per-locale breakdown, or the size of the same-locale demonstration pool, so the residual narrowness cannot be quantified or reproduced from the given numbers alone.
minor comments (2)
  1. Abstract provides no error bars on F1 or perplexity scores and omits full dataset statistics (e.g., document counts per locale, entity-type distribution).
  2. The character-range heuristic and MD5 sampling procedure are mentioned only at high level; pseudocode or a small worked example would clarify reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical contributions of the on-device PII pipeline and the locale-conditioned prompting technique. We address each major comment below with clarifications and planned revisions.

read point-by-point responses
  1. Referee: [Abstract (NER results)] the claim that the F1 gap (faker 0.506 vs hybrid 0.346) is caused by reduced variety in SLM surrogates rests on the unverified assumptions that (a) the 160/40 subset was selected without bias relative to the full 400/100 set in properties that interact with substitution style and (b) XGLM-564M perplexity faithfully reflects the factors that matter for NER training utility. No subset-matching procedure or evaluator ablation is described.

    Authors: We agree the subset-matching procedure requires explicit description to rule out selection bias. The 160/40 subset was created via stratified sampling on entity-type frequencies, average document length, and locale distribution to mirror the full 400/100 set; we will add a methods subsection detailing the exact criteria together with balance tables. We also acknowledge that XGLM-564M perplexity is only a proxy for naturalness and does not directly measure NER utility factors; the negative finding is empirical (observed F1 difference), and we will insert a limitations paragraph noting the absence of evaluator ablations while retaining the variety-vs-naturalness interpretation. These additions will appear in the revised manuscript. revision: partial

  2. Referee: [Abstract (prompting fix)] the 482/482 success rate is reported without the total number of unique inputs, the per-locale breakdown, or the size of the same-locale demonstration pool, so the residual narrowness cannot be quantified or reproduced from the given numbers alone.

    Authors: We accept this point on reproducibility. The 482/482 figure denotes successful unique calls on a held-out test set of exactly 482 inputs drawn from the 2000-document corpus. We will revise the abstract and results section to state the total unique inputs (482), provide the per-locale success breakdown (100% across all six locales), and report the same-locale demonstration pool size (50 examples per locale). This will enable readers to quantify residual narrowness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: all claims rest on direct empirical measurements on held-out data

full rationale

The paper contains no equations, fitted parameters, or derivations. All reported outcomes (482/482 success rate, perplexity comparisons, F1 scores on 400/100 and 160/40 NER splits) are direct measurements on held-out documents and a separate test set. The locale-conditioned prompting procedure is described as a heuristic (character-range pool + MD5 sampling) without any self-referential definition or prediction that reduces to its own inputs. No self-citations are load-bearing for the central claims, and no uniqueness theorems or ansatzes are invoked. The negative finding on variety vs. naturalness is an interpretation of observed F1 differences, not a quantity forced by construction within the paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical validation of a prompting heuristic rather than new theoretical derivations; the only notable free parameter is the fixed choice of three demonstrations.

free parameters (1)
  • number of demonstrations
    Fixed at three shots per prompt; rotation is deterministic via MD5 but the count itself is chosen by hand.
axioms (1)
  • domain assumption A 1-bit quantized 1.7B SLM can produce locale-appropriate name/address/date surrogates when given suitable few-shot context
    Invoked in the pipeline description and success metric (locale-correct surrogates).

pith-pipeline@v0.9.0 · 5701 in / 1442 out tokens · 107224 ms · 2026-05-14T19:00:57.673618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Faker: a Python package that generates fake data

    Daniele Faraglia and others. Faker: a Python package that generates fake data. Software repository, MIT license, https://github.com/joke2k/faker, 2024. accessed 29 April 2026. 13

  2. [2]

    llama.cpp: Port of LLaMA models in C/C++

    Georgi Gerganov and contributors. llama.cpp: Port of LLaMA models in C/C++. Software repository,https://github.com/ggerganov/llama.cpp, 2024. accessed 29 April 2026

  3. [3]

    spaCy: Industrial-strength natural language processing in Python

    Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-strength natural language processing in Python. Software framework,https: //spacy.io, 2020. accessed 29 April 2026

  4. [4]

    Diab and Veselin Stoyanov and Xian Li , title =

    Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. Few-shot learning with multilingual gener...

  5. [5]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 8086–8098. Association for Computational Linguistics, 2022

  6. [6]

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The Era of 1-bit LLMs: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024

  7. [7]

    Presidio: Context aware, pluggable, and customizable data protection and de-identification sdk

    Microsoft. Presidio: Context aware, pluggable, and customizable data protection and de-identification sdk. Software repository, Apache-2.0 license, https://github.com/ microsoft/presidio, 2024. accessed 29 April 2026

  8. [8]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064. Association for Computational Linguistics, 2022

  9. [9]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  10. [10]

    openai/privacy-filter: a 1.5b mixture-of-experts token classifier for PII detec- tion

    OpenAI. openai/privacy-filter: a 1.5b mixture-of-experts token classifier for PII detec- tion. Hugging Face model card,https://huggingface.co/openai/privacy-filter, 2025. accessed 29 April 2026

  11. [11]

    Bonsai-demo: a reference implementation for the bonsai family of 1-bit (q1_0) qwen3-based small language models

    PrismML. Bonsai-demo: a reference implementation for the bonsai family of 1-bit (q1_0) qwen3-based small language models. Software repository,https://github.com/ PrismML-Eng/Bonsai-demo, 2025. accessed 29 April 2026

  12. [12]

    Ternary-bonsai: 1.58-bit (q2_0) qwen3-based small language models for on- device inference

    PrismML. Ternary-bonsai: 1.58-bit (q2_0) qwen3-based small language models for on- device inference. Distributed via the Bonsai-Demo software repository,https://github. com/PrismML-Eng/Bonsai-demo, 2025. accessed 29 April 2026

  13. [13]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  14. [14]

    Design challenges and misconceptions in named entity recog- nition

    Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recog- nition. InProceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147–155. Association for Computational Linguistics, 2009

  15. [15]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  16. [16]

    Calibrate before use: Improving few-shot performance of language models

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th International Conference on Machine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 12697–12706, 2021. A Locale-conditioned demonstration pools For full reproducib...