pith. sign in

arxiv: 2604.10271 · v3 · submitted 2026-04-11 · 💻 cs.CR · cs.CL· cs.IR

Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.IR
keywords homoglyph substitutionstylometryadversarial stylometryprivacyUnicodeauthorship attributionforensic linguistics
0
0 comments X

The pith

Homoglyph substitution degrades stylometric systems by replacing characters with visually similar alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that swapping letters in text for look-alikes from other Unicode scripts can weaken stylometric tools that infer author traits such as age range or country from writing patterns. This matters because social-media posts already allow statistical recovery of personal details comparable to what leaks from ID documents, and simply avoiding text disclosure is impractical. The proposed method keeps the text readable while targeting the character features that stylometry depends on. A sympathetic reader would see this as a lightweight privacy tool for everyday online writing.

Core claim

Performing homoglyph substitution on text degrades stylometric systems, allowing authors to reduce the leakage of personal information such as estimated age and geographic location that these systems can otherwise extract from voluntary text disclosures.

What carries the argument

Homoglyph substitution, defined as the replacement of characters with visually similar alternatives drawn from different Unicode code points (for example, Latin 'h' with Cyrillic 'h'), which targets and disrupts the character-level patterns that stylometric classifiers use.

If this is right

  • Stylometric authorship attribution and trait inference become measurably less reliable on the altered text.
  • Individuals can reduce the personal information extractable from their online writing while preserving visual readability.
  • Adversarial stylometry provides a practical defense against forensic analysis of voluntary text disclosures.
  • Text can be altered to hinder stylometric recovery of demographic signals such as age group or location.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Stylometric tools may require explicit Unicode normalization steps to remain effective against this class of obfuscation.
  • An iterative arms race could develop between substitution techniques and improved detection or normalization methods.
  • The approach might generalize to other character-based privacy protections in digital communication.

Load-bearing premise

Stylometric systems depend on character-level or Unicode-sensitive features that homoglyph substitution will reliably disrupt without being normalized away by standard preprocessing or creating new detectable signals.

What would settle it

An experiment in which stylometric accuracy on the modified text remains statistically unchanged from the original, or in which routine Unicode normalization restores full performance.

Figures

Figures reproduced from arXiv: 2604.10271 by Robert Dilworth.

Figure 1
Figure 1. Figure 1: The paper’s structure (with annotations) post [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Kagi Translate acts as a conduit for adversarial stylometry, rending au [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A Taxonomic Overview of the Adversarial Attacks: [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: TraceTarnish: Our stylometric attack script–a gestalt modular frame￾work where each component contributes to a whole that is greater than the sum of its parts; incorporating homoglyph functionality resulted in the following pro￾cessing pipeline for razing authorship: Translation → Obfuscation → Imitation → Injection [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An enumeration of the adversarial attacks examined by [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The sentences to be evaluated, representing 100% Injection for each ex [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A plot capturing the results of the homoglyph-based Injection-optimality [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TraceTarnish, in its current state, implements an Injection amal￾gam, interspersing both homoglyphs and zero-width characters into text to shroud authorship. To demonstrate the efficiency of the Injection com￾ponent, we rerun Experiment #1, incrementally introducing both ho￾moglyphs and zero-width characters in a stepwise fashion. The follow￾ing string represents 100% Injection, with the “bad characters” h… view at source ↗
Figure 9
Figure 9. Figure 9: Distance measures in stylometry are mathematical methods used to quan [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes homoglyphic substitution—replacing Latin characters with visually similar glyphs from other Unicode blocks (e.g., U+0068 'h' to U+04BB 'h')—as a technique for adversarial stylometry to degrade author attribution and profiling performance of stylometric systems.

Significance. If empirically validated, the approach could supply a lightweight, accessible method for textual privacy protection against stylometric inference of attributes such as age or location. It extends prior adversarial stylometry work but currently offers only a descriptive claim without demonstrated effectiveness or robustness.

major comments (2)
  1. Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.
  2. Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important gaps in empirical support and robustness analysis. We agree that the current manuscript is primarily conceptual and will incorporate experiments, implementation details, and preprocessing evaluations in the revised version to strengthen the claims.

read point-by-point responses
  1. Referee: Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.

    Authors: We acknowledge that the present version offers a descriptive proposal without quantitative validation. The manuscript introduces homoglyphic substitution as an adversarial stylometry technique but does not report experiments, datasets, or metrics. In revision, we will add empirical evaluations on standard stylometric corpora (e.g., using author attribution accuracy and attribute inference F1 scores), detail the substitution algorithm and parameters, and explicitly test for introduced signals such as non-Latin character frequency distributions to demonstrate that the method does not create easily detectable artifacts. revision: yes

  2. Referee: Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.

    Authors: This is a fair and substantive critique. The current text does not examine how homoglyph substitution interacts with common text normalization pipelines. We will revise the manuscript to include a dedicated analysis section that evaluates survival rates under NFKC/NFD normalization, script detection heuristics, and various tokenizers (e.g., word-level, subword, and Unicode-aware). Where the substitution is neutralized, we will discuss mitigation strategies or clearly delineate the threat model under which the technique remains effective. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive claim with no derivations or fitted elements

full rationale

The paper presents an exploratory idea that homoglyph substitution can degrade stylometric systems. No equations, parameters, predictions, or derivation chains appear in the provided text. The abstract and description frame the work as an investigation rather than a mathematical result derived from prior self-referential steps. None of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.) apply, as there are no load-bearing logical reductions to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified or required for the high-level claim.

pith-pipeline@v0.9.0 · 5532 in / 977 out tokens · 39508 ms · 2026-05-10T15:20:38.105681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    R Documenta- tionhttps://search.r-project.org/CRAN/refmans/stylo/html/imposters.h tml

    Authorship verification classifier known as the imposters method. R Documenta- tionhttps://search.r-project.org/CRAN/refmans/stylo/html/imposters.h tml

  2. [2]

    Alvi, F.: Monolingual Plagiarism Detection and Paraphrase Type Identification. Ph.D. thesis, University of Sheffield (8 2020),https://etheses.whiterose.ac.u k/id/eprint/27552/

  3. [3]

    669–675 (2017).https://doi.org/10.1007/978-3-319-56608 -5_64,https://eprints.whiterose.ac.uk/id/eprint/112665/1/paper_247v2 .pdf

    Alvi, F., Stevenson, M., Clough, P.: Plagiarism Detection in Texts Obfuscated with Homoglyphs, pp. 669–675 (2017).https://doi.org/10.1007/978-3-319-56608 -5_64,https://eprints.whiterose.ac.uk/id/eprint/112665/1/paper_247v2 .pdf

  4. [4]

    Amodei, D.: Statement from dario amodei on our discussions with the department of war (2 2026),https://www.anthropic.com/news/statement-department-o f-war

  5. [5]

    Ayuso, J.W.: Can a comma solve a crime? The DialIssue 22: Language(11 2024),https://www.thedial.world/articles/news/issue-22/forensic-lingu ists-solve-crimes

  6. [6]

    Bhalerao, R., Al-Rubaie, M., Bhaskar, A., Markov, I.: Data-driven mitigation of adversarial text perturbation (2 2022),https://arxiv.org/abs/2202.09483

  7. [7]

    Creo, A., Pudasaini, S.: Silverspeak: Evading ai-generated text detectors using homoglyphs (1 2025),https://arxiv.org/abs/2406.11239,https://github.c om/ACMCMC/silverspeak

  8. [8]

    Dans, E.: Stylometry and the right to anonymity (8 2013),https://medium.com /enrique-dans/stylometry-and-the-right-to-anonymity-a084556770eb

  9. [9]

    Dilworth, R.: Tuning for tracetarnish: Techniques, trends, and testing tangible traits (12 2025),https://arxiv.org/abs/2512.03465

  10. [10]

    Dilworth, R.: Unveiling unicode’s unseen underpinnings in undermining authorship attribution (10 2025),https://arxiv.org/abs/2508.15840

  11. [11]

    Dilworth, R.: Stegostylo: Squelching stylometric scrutiny through steganographic stitching (1 2026),https://arxiv.org/abs/2601.09056

  12. [12]

    Dugan, L., Hwang, A., Trhlik, F., Ludan, J.M., Zhu, A., Xu, H., Ippolito, D., Callison-Burch, C.: Raid: A shared benchmark for robust evaluation of machine- generated text detectors (6 2024),https://arxiv.org/abs/2405.07940

  13. [13]

    Dunbar, M.: Tennessee grandmother jailed after ai facial recognition error links her to fraud (3 2026),https://www.theguardian.com/us-news/2026/mar/12/te nnessee-grandmother-ai-fraud Doppelg¨ anger Injection 29

  14. [14]

    Master’s thesis, University of Twente (7 2018),http s://essay.utwente.nl/fileshare/file/75908/Ekambaranathan_MA_EEMCS.pdf

    Ekambaranathan, A., Peter, A., Meiklejohn, S.: Using Stylometry to Track Cyber- criminals in Darknet Forums. Master’s thesis, University of Twente (7 2018),http s://essay.utwente.nl/fileshare/file/75908/Ekambaranathan_MA_EEMCS.pdf

  15. [15]

    In: Rahimi, A., Lane, W., Zuccon, G

    Gagiano, R., Kim, M.M.H., Zhang, X., Biggs, J.: Robustness analysis of grover for machine-generated news detection. In: Rahimi, A., Lane, W., Zuccon, G. (eds.) Proceedings of the 19th Annual Workshop of the Australasian Language Technol- ogy Association. pp. 119–127. Australasian Language Technology Association (12 2021),https://aclanthology.org/2021.alta-1.12/

  16. [16]

    Guariglia, M.: The anthropic-dod conflict: Privacy protections shouldn’t depend on the decisions of a few powerful people (3 2026),https://www.eff.org/deepli nks/2026/03/anthropic-dod-conflict-privacy-protections-shouldnt-depen d-decisions-few-powerful

  17. [17]

    ACM SIGKDD Explorations Newsletter26, 21–43 (1 2025).https://doi.org/10.1145/3715073.3715076,https://dl.acm.org/d oi/10.1145/3715073.3715076

    Huang, B., Chen, C., Shu, K.: Authorship attribution in the era of llms: Problems, methodologies, and challenges. ACM SIGKDD Explorations Newsletter26, 21–43 (1 2025).https://doi.org/10.1145/3715073.3715076,https://dl.acm.org/d oi/10.1145/3715073.3715076

  18. [18]

    Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through trans- lation (1 2016),https://ceur-ws.org/Vol-1609/16090890.pdf

  19. [19]

    Lermen, S., Paleka, D., Swanson, J., Aerni, M., Carlini, N., Tram` er, F.: Large-scale online deanonymization with llms (2 2026),https://arxiv.org/abs/2602.16800

  20. [20]

    Macko, D., Moro, R., Uchendu, A., Srba, I., Lucas, J.S., Yamashita, M., Tripto, N.I., Lee, D., Simko, J., Bielikova, M.: Authorship obfuscation in multilingual machine-generated text detection (10 2024).https://doi.org/10.18653/v1/20 24.findings-emnlp.369,https://arxiv.org/abs/2401.07867

  21. [21]

    Makari, I.: Glassworm is back: A new wave of invisible unicode attacks hits hun- dreds of repositories (3 2026),https://www.aikido.dev/blog/glassworm-retur ns-unicode-attack-github-npm-vscode

  22. [22]

    Mosquera, A.: Alejandro mosquera at politices 2022: Towards robust spanish au- thor profiling and lessons learned from adversarial attacks. In: y G´ omez, M.M., Gonzalo, J., Rangel, F., Casavantes, M.,´Angel ´Alvarez Carmona, M., Bel-Enguix, G., Escalante, H.J., Freitas, L., Miranda-Escalada, A., Rodr´ ıguez-S´ anchez, F., Ros´ a, A., Sobrevilla-Cabezudo,...

  23. [23]

    Padfield, J.: Are we living in 1984, brave new world, or fahrenheit 451? (3 2026), https://www.youtube.com/watch?v=w-bMvIgofIc

  24. [24]

    Paz, R.: Poisoned typeface: How simple font rendering poisons every ai assistant, and only microsoft cares (3 2026),https://layerxsecurity.com/blog/poisoned -typeface-a-simple-font-rendering-poisons-every-ai-assistant-and-onl y-microsoft-cares/

  25. [25]

    Rumpf, A.: Slight misspeller (2021),https://adam-rumpf.github.io/programs/ slight_misspeller.html,https://github.com/adam-rumpf/slight-misspelle r

  26. [26]

    Journal of the American Society for Information Science and Technology60, 538–556 (3 2009)

    Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology60, 538–556 (3 2009). https://doi.org/10.1002/asi.21001,https://onlinelibrary.wiley.com/do i/10.1002/asi.21001 30 Robert Dilworth

  27. [27]

    Proceedings of the International Conference for Young Researchers in Informatics, Mathematics and Engineering1852, 1–7 (4 2017),https://www.lituanistika.lt/content/77652

    Stanik¯ unas, D., Mandravickait˙ e, J., Krilaviˇ cius, T.: Comparison of distance and similarity measures for stylometric analysis of lithuanian texts. Proceedings of the International Conference for Young Researchers in Informatics, Mathematics and Engineering1852, 1–7 (4 2017),https://www.lituanistika.lt/content/77652

  28. [28]

    Stropkay, H.F., Chen, J., Latifi, M.J., Rockmore, D.N., Manning, J.R.: A stylo- metric application of large language models (10 2025),https://arxiv.org/abs/ 2510.21958

  29. [29]

    Sundar, M.: How to hide secrets in strings—modern text hiding in javascript (5 2020),https://blog.bitsrc.io/how-to-hide-secrets-in-strings-modern-t ext-hiding-in-javascript-613a9faa5787,https://github.com/KuroLabs/st egcloak

  30. [30]

    Teja, L.D.M.S.S., Krishna, N.S.G., Khan, U., Khan, M.H., Mishra, A.: Damasha: Detecting ai in mixed adversarial texts via segmentation with human-interpretable attribution (1 2026),https://arxiv.org/abs/2512.04838

  31. [31]

    SIGKDD Explor

    Uchendu,A.,Le,T.,Lee,D.:Attributionandobfuscationofneuraltextauthorship: A data mining perspective. ACM SIGKDD Explorations Newsletter25, 1–18 (6 2023).https://doi.org/10.1145/3606274.3606276,https://dl.acm.org/doi /10.1145/3606274.3606276

  32. [32]

    Wang, Y., Feng, S., Hou, A.B., Pu, X., Shen, C., Liu, X., Tsvetkov, Y., He, T.: Stumblingblocks:Stresstestingtherobustnessofmachine-generatedtextdetectors under attacks (2 2024),https://arxiv.org/abs/2402.11638

  33. [33]

    Wolff, M., Wolff, S.: Attacking neural text detectors (1 2022),https://arxiv.or g/abs/2002.11768

  34. [34]

    Zhang, Y., Wang, X., Liu, J., Wang, W., Ma, Z., Jia, X.: Style attack disguise: When fonts become a camouflage for adversarial intent (10 2025),https://arxi v.org/abs/2510.19641

  35. [35]

    Zhao, P., Zhu, W., Jiao, P., Gao, D., Wu, O.: Data poisoning in deep learning: A survey (3 2025),https://arxiv.org/abs/2503.22759