Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution
Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3
The pith
Homoglyph substitution degrades stylometric systems by replacing characters with visually similar alternatives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performing homoglyph substitution on text degrades stylometric systems, allowing authors to reduce the leakage of personal information such as estimated age and geographic location that these systems can otherwise extract from voluntary text disclosures.
What carries the argument
Homoglyph substitution, defined as the replacement of characters with visually similar alternatives drawn from different Unicode code points (for example, Latin 'h' with Cyrillic 'h'), which targets and disrupts the character-level patterns that stylometric classifiers use.
If this is right
- Stylometric authorship attribution and trait inference become measurably less reliable on the altered text.
- Individuals can reduce the personal information extractable from their online writing while preserving visual readability.
- Adversarial stylometry provides a practical defense against forensic analysis of voluntary text disclosures.
- Text can be altered to hinder stylometric recovery of demographic signals such as age group or location.
Where Pith is reading between the lines
- Stylometric tools may require explicit Unicode normalization steps to remain effective against this class of obfuscation.
- An iterative arms race could develop between substitution techniques and improved detection or normalization methods.
- The approach might generalize to other character-based privacy protections in digital communication.
Load-bearing premise
Stylometric systems depend on character-level or Unicode-sensitive features that homoglyph substitution will reliably disrupt without being normalized away by standard preprocessing or creating new detectable signals.
What would settle it
An experiment in which stylometric accuracy on the modified text remains statistically unchanged from the original, or in which routine Unicode normalization restores full performance.
Figures
read the original abstract
In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes homoglyphic substitution—replacing Latin characters with visually similar glyphs from other Unicode blocks (e.g., U+0068 'h' to U+04BB 'h')—as a technique for adversarial stylometry to degrade author attribution and profiling performance of stylometric systems.
Significance. If empirically validated, the approach could supply a lightweight, accessible method for textual privacy protection against stylometric inference of attributes such as age or location. It extends prior adversarial stylometry work but currently offers only a descriptive claim without demonstrated effectiveness or robustness.
major comments (2)
- Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.
- Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important gaps in empirical support and robustness analysis. We agree that the current manuscript is primarily conceptual and will incorporate experiments, implementation details, and preprocessing evaluations in the revised version to strengthen the claims.
read point-by-point responses
-
Referee: Abstract: the central claim that homoglyph substitution degrades stylometric performance is unsupported by any experimental results, datasets, evaluation metrics, or implementation details; the manuscript provides no evidence that the substitution reliably disrupts feature extractors or avoids introducing new detectable signals such as elevated non-Latin script frequencies.
Authors: We acknowledge that the present version offers a descriptive proposal without quantitative validation. The manuscript introduces homoglyphic substitution as an adversarial stylometry technique but does not report experiments, datasets, or metrics. In revision, we will add empirical evaluations on standard stylometric corpora (e.g., using author attribution accuracy and attribute inference F1 scores), detail the substitution algorithm and parameters, and explicitly test for introduced signals such as non-Latin character frequency distributions to demonstrate that the method does not create easily detectable artifacts. revision: yes
-
Referee: Abstract: the argument assumes stylometric systems operate on raw Unicode codepoints without normalization (NFKC/NFD), script detection, or tokenization that collapses visually identical glyphs; no analysis or test is presented to show the substitution survives these standard preprocessing steps.
Authors: This is a fair and substantive critique. The current text does not examine how homoglyph substitution interacts with common text normalization pipelines. We will revise the manuscript to include a dedicated analysis section that evaluates survival rates under NFKC/NFD normalization, script detection heuristics, and various tokenizers (e.g., word-level, subword, and Unicode-aware). Where the substitution is neutralized, we will discuss mitigation strategies or clearly delineate the threat model under which the technique remains effective. revision: yes
Circularity Check
No circularity: purely descriptive claim with no derivations or fitted elements
full rationale
The paper presents an exploratory idea that homoglyph substitution can degrade stylometric systems. No equations, parameters, predictions, or derivation chains appear in the provided text. The abstract and description frame the work as an investigation rather than a mathematical result derived from prior self-referential steps. None of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.) apply, as there are no load-bearing logical reductions to inspect.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
R Documenta- tionhttps://search.r-project.org/CRAN/refmans/stylo/html/imposters.h tml
Authorship verification classifier known as the imposters method. R Documenta- tionhttps://search.r-project.org/CRAN/refmans/stylo/html/imposters.h tml
-
[2]
Alvi, F.: Monolingual Plagiarism Detection and Paraphrase Type Identification. Ph.D. thesis, University of Sheffield (8 2020),https://etheses.whiterose.ac.u k/id/eprint/27552/
work page 2020
-
[3]
Alvi, F., Stevenson, M., Clough, P.: Plagiarism Detection in Texts Obfuscated with Homoglyphs, pp. 669–675 (2017).https://doi.org/10.1007/978-3-319-56608 -5_64,https://eprints.whiterose.ac.uk/id/eprint/112665/1/paper_247v2 .pdf
-
[4]
Amodei, D.: Statement from dario amodei on our discussions with the department of war (2 2026),https://www.anthropic.com/news/statement-department-o f-war
work page 2026
-
[5]
Ayuso, J.W.: Can a comma solve a crime? The DialIssue 22: Language(11 2024),https://www.thedial.world/articles/news/issue-22/forensic-lingu ists-solve-crimes
work page 2024
- [6]
- [7]
-
[8]
Dans, E.: Stylometry and the right to anonymity (8 2013),https://medium.com /enrique-dans/stylometry-and-the-right-to-anonymity-a084556770eb
work page 2013
-
[9]
Dilworth, R.: Tuning for tracetarnish: Techniques, trends, and testing tangible traits (12 2025),https://arxiv.org/abs/2512.03465
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Dilworth, R.: Unveiling unicode’s unseen underpinnings in undermining authorship attribution (10 2025),https://arxiv.org/abs/2508.15840
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Dilworth, R.: Stegostylo: Squelching stylometric scrutiny through steganographic stitching (1 2026),https://arxiv.org/abs/2601.09056
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [12]
-
[13]
Dunbar, M.: Tennessee grandmother jailed after ai facial recognition error links her to fraud (3 2026),https://www.theguardian.com/us-news/2026/mar/12/te nnessee-grandmother-ai-fraud Doppelg¨ anger Injection 29
work page 2026
-
[14]
Ekambaranathan, A., Peter, A., Meiklejohn, S.: Using Stylometry to Track Cyber- criminals in Darknet Forums. Master’s thesis, University of Twente (7 2018),http s://essay.utwente.nl/fileshare/file/75908/Ekambaranathan_MA_EEMCS.pdf
work page 2018
-
[15]
In: Rahimi, A., Lane, W., Zuccon, G
Gagiano, R., Kim, M.M.H., Zhang, X., Biggs, J.: Robustness analysis of grover for machine-generated news detection. In: Rahimi, A., Lane, W., Zuccon, G. (eds.) Proceedings of the 19th Annual Workshop of the Australasian Language Technol- ogy Association. pp. 119–127. Australasian Language Technology Association (12 2021),https://aclanthology.org/2021.alta-1.12/
work page 2021
-
[16]
Guariglia, M.: The anthropic-dod conflict: Privacy protections shouldn’t depend on the decisions of a few powerful people (3 2026),https://www.eff.org/deepli nks/2026/03/anthropic-dod-conflict-privacy-protections-shouldnt-depen d-decisions-few-powerful
work page 2026
-
[17]
Huang, B., Chen, C., Shu, K.: Authorship attribution in the era of llms: Problems, methodologies, and challenges. ACM SIGKDD Explorations Newsletter26, 21–43 (1 2025).https://doi.org/10.1145/3715073.3715076,https://dl.acm.org/d oi/10.1145/3715073.3715076
-
[18]
Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through trans- lation (1 2016),https://ceur-ws.org/Vol-1609/16090890.pdf
work page 2016
- [19]
-
[20]
Macko, D., Moro, R., Uchendu, A., Srba, I., Lucas, J.S., Yamashita, M., Tripto, N.I., Lee, D., Simko, J., Bielikova, M.: Authorship obfuscation in multilingual machine-generated text detection (10 2024).https://doi.org/10.18653/v1/20 24.findings-emnlp.369,https://arxiv.org/abs/2401.07867
-
[21]
Makari, I.: Glassworm is back: A new wave of invisible unicode attacks hits hun- dreds of repositories (3 2026),https://www.aikido.dev/blog/glassworm-retur ns-unicode-attack-github-npm-vscode
work page 2026
-
[22]
Mosquera, A.: Alejandro mosquera at politices 2022: Towards robust spanish au- thor profiling and lessons learned from adversarial attacks. In: y G´ omez, M.M., Gonzalo, J., Rangel, F., Casavantes, M.,´Angel ´Alvarez Carmona, M., Bel-Enguix, G., Escalante, H.J., Freitas, L., Miranda-Escalada, A., Rodr´ ıguez-S´ anchez, F., Ros´ a, A., Sobrevilla-Cabezudo,...
work page 2022
-
[23]
Padfield, J.: Are we living in 1984, brave new world, or fahrenheit 451? (3 2026), https://www.youtube.com/watch?v=w-bMvIgofIc
work page 1984
-
[24]
Paz, R.: Poisoned typeface: How simple font rendering poisons every ai assistant, and only microsoft cares (3 2026),https://layerxsecurity.com/blog/poisoned -typeface-a-simple-font-rendering-poisons-every-ai-assistant-and-onl y-microsoft-cares/
work page 2026
-
[25]
Rumpf, A.: Slight misspeller (2021),https://adam-rumpf.github.io/programs/ slight_misspeller.html,https://github.com/adam-rumpf/slight-misspelle r
work page 2021
-
[26]
Journal of the American Society for Information Science and Technology60, 538–556 (3 2009)
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology60, 538–556 (3 2009). https://doi.org/10.1002/asi.21001,https://onlinelibrary.wiley.com/do i/10.1002/asi.21001 30 Robert Dilworth
-
[27]
Stanik¯ unas, D., Mandravickait˙ e, J., Krilaviˇ cius, T.: Comparison of distance and similarity measures for stylometric analysis of lithuanian texts. Proceedings of the International Conference for Young Researchers in Informatics, Mathematics and Engineering1852, 1–7 (4 2017),https://www.lituanistika.lt/content/77652
work page 2017
- [28]
-
[29]
Sundar, M.: How to hide secrets in strings—modern text hiding in javascript (5 2020),https://blog.bitsrc.io/how-to-hide-secrets-in-strings-modern-t ext-hiding-in-javascript-613a9faa5787,https://github.com/KuroLabs/st egcloak
work page 2020
- [30]
-
[31]
Uchendu,A.,Le,T.,Lee,D.:Attributionandobfuscationofneuraltextauthorship: A data mining perspective. ACM SIGKDD Explorations Newsletter25, 1–18 (6 2023).https://doi.org/10.1145/3606274.3606276,https://dl.acm.org/doi /10.1145/3606274.3606276
- [32]
- [33]
- [34]
- [35]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.