Unveiling Unicode's Unseen Underpinnings in Undermining Authorship Attribution
Pith reviewed 2026-05-18 21:55 UTC · model grok-4.3
The pith
Unicode steganography can enhance adversarial stylometry and undermine authorship attribution in public messages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Unicode steganography supplies a set of lightweight, invisible modifications that strengthen adversarial stylometry and thereby reduce the reliability of authorship attribution performed on openly posted text.
What carries the argument
Unicode steganography applied to adversarial stylometry, which inserts or substitutes characters that preserve readability while disrupting statistical stylistic features.
Load-bearing premise
The steganographic modifications introduced via Unicode do not create new detectable artifacts or stylistic signals that advanced attribution methods could exploit.
What would settle it
An authorship classifier that continues to identify the original author at high accuracy even after the message has been processed with the described Unicode steganographic alterations.
Figures
read the original abstract
When using a public communication channel--whether formal or informal, such as commenting or posting on social media--end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence--using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting--one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message--necessarily open for public consumption--exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper dissects stylometric analysis as a threat to anonymity in public communications, reviews adversarial stylometry countermeasures, and proposes enhancements via Unicode steganography (e.g., zero-width character insertions and code-point substitutions) to undermine authorship attribution while preserving readability and style.
Significance. If the central claim holds, the work would demonstrate a practical, low-overhead method for evading stylometric attribution in open channels, with potential implications for online privacy tools. The manuscript earns credit for framing the problem clearly and for identifying Unicode as an under-explored vector, but the absence of any empirical validation or counter-evaluation against modern feature sets limits its contribution.
major comments (2)
- [§3 (Unicode Steganography Enhancements)] The manuscript provides no experimental section or results evaluating whether the proposed Unicode modifications remain invisible to attribution pipelines that incorporate code-point histograms, Unicode normalization, or anomaly detection on non-ASCII ranges; this directly undermines the claim that the steganographic enhancements evade detection.
- [§2 (Adversarial Stylometry)] The discussion of adversarial stylometry in §2 assumes that preserving surface-level n-grams and function words is sufficient to defeat attribution, yet offers no analysis or test showing that the introduced Unicode artifacts do not create new, higher-order signals exploitable by current ML classifiers.
minor comments (2)
- [§3] Notation for the steganographic transformations (e.g., how zero-width characters are inserted) could be formalized with a short pseudocode listing or table of example substitutions.
- [Abstract / §1] The abstract and introduction would benefit from a brief statement of the threat model (e.g., whether the adversary has access to the raw Unicode stream or only normalized text).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments correctly identify the conceptual nature of our contribution and the resulting limitations on empirical claims. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§3 (Unicode Steganography Enhancements)] The manuscript provides no experimental section or results evaluating whether the proposed Unicode modifications remain invisible to attribution pipelines that incorporate code-point histograms, Unicode normalization, or anomaly detection on non-ASCII ranges; this directly undermines the claim that the steganographic enhancements evade detection.
Authors: We agree that the lack of experimental results weakens the strength of the evasion claim. The manuscript is framed as a proposal that identifies Unicode steganography (zero-width insertions and code-point substitutions) as an under-explored enhancement to adversarial stylometry. Our argument rests on the observation that many stylometric pipelines operate on normalized or tokenized text and do not explicitly inspect non-ASCII ranges or code-point distributions. We acknowledge, however, that this remains an untested hypothesis. In the revised manuscript we will add a new subsection under §3 that discusses likely detection vectors (code-point histograms, normalization, and anomaly detection) and explains why the proposed modifications may still retain practical utility against standard pipelines. We will also revise the abstract and conclusion to replace absolute claims of evasion with statements that the techniques merit empirical investigation. revision: yes
-
Referee: [§2 (Adversarial Stylometry)] The discussion of adversarial stylometry in §2 assumes that preserving surface-level n-grams and function words is sufficient to defeat attribution, yet offers no analysis or test showing that the introduced Unicode artifacts do not create new, higher-order signals exploitable by current ML classifiers.
Authors: Section 2 reviews existing adversarial stylometry methods whose goal is to retain core linguistic markers (n-grams, function words) while altering other surface features. The Unicode layer we propose is intended to act at the encoding level without changing the visible characters or linguistic content, thereby avoiding direct interference with those markers. We therefore expected that standard feature sets would remain largely unaffected. We recognize that this expectation is unverified and that higher-order signals (e.g., Unicode-range frequency anomalies or embedding-space artifacts) could be learned by modern classifiers. In revision we will expand §2 with a paragraph analyzing this risk, citing relevant work on Unicode anomaly detection, and explicitly noting that our proposal assumes conventional stylometric pipelines rather than adversarial ML detectors. revision: yes
Circularity Check
No circularity: paper offers conceptual proposal without equations or self-referential derivations
full rationale
The manuscript contains no equations, fitted parameters, or derivation chain. It describes stylometry, adversarial stylometry, and a proposed Unicode steganography enhancement at a high level. No load-bearing step reduces to a self-citation, ansatz, or input by construction. The central claim is a technique suggestion whose validity rests on external empirical testing rather than internal reduction. This matches the default expectation of a non-circular paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Unicode steganography with Zero-Width Characters … Zero-Width Space [U+200B] … encode bits … hidden data can encode metadata
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
adversarial stylometry … imitation, translation, obfuscation … Burrows’ Delta
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution
Homoglyph substitution on text degrades stylometric systems to hide author signatures and personal information.
-
StegoStylo: Squelching Stylometric Scrutiny through Steganographic Stitching
StegoStylo achieves authorship obfuscation by steganographically altering 33% or more of words with zero-width characters, confounding stylometric systems.
-
Tuning for TraceTarnish: Techniques, Trends, and Testing Tangible Traits
TraceTarnish attack identifies stylometric features like function-word frequencies and type-token ratio that both strengthen authorship anonymization and serve as indicators of compromise when pre- and post-transforma...
Reference graph
Works this paper leans on
-
[1]
IEEE Transactions on Dependable and Secure Computing pp
Abuhamad, M., Jung, C., Mohaisen, D., Nyang, D.: Shield: Thwarting code au- thorship attribution. IEEE Transactions on Dependable and Secure Computing pp. 1–13 (2025).https://doi.org/10.1109/TDSC.2025.3553753
-
[2]
In: Moens, M.F., Huang, X., Specia, L., tau Yih, S.W
Adelani, D.I., Zhang, M., Shen, X., Davody, A., Kleinbauer, T., Klakow, D.: Preventing author profiling through zero-shot multilingual back-translation. In: Moens, M.F., Huang, X., Specia, L., tau Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 8687–8695. Association for Computational Linguistic...
work page 2021
-
[3]
In: 2012 IEEE Symposium on Security and Privacy
Afroz, S., Brennan, M., Greenstadt, R.: Detecting hoaxes, frauds, and deception in writing style online. In: 2012 IEEE Symposium on Security and Privacy. pp. 461–475 (2012).https://doi.org/10.1109/SP.2012.34
-
[4]
In: Proceedings of the Second ACM Conference on Online Social Net- works
Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowd- sourcing. In: Proceedings of the Second ACM Conference on Online Social Net- works. pp. 69–82. Association for Computing Machinery (2014).https://doi.or g/10.1145/2660460.2660486,https://doi.org/10.1145/2660460.2660486
- [5]
-
[6]
Anderson, P.D.: Cypherpunk Ethics. Routledge (4 2022).https://doi.org/10.4 324/9781003220534,https://www.taylorfrancis.com/books/mono/10.4324/9 781003220534/cypherpunk-ethics-patrick-anderson
work page doi:10.4324/9 2022
-
[7]
In: Korhonen, A., Traum, D., Màrquez, L
Bevendorff, J., Potthast, M., Hagen, M., Stein, B.: Heuristic authorship obfusca- tion. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 1098–1108. Association for Computational Linguistics (7 2019).https://doi.org/10.18653 /v1/P19-1104,https://aclanthology.org/P19-1104/
work page 2019
-
[8]
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur.15(11 2012).https://doi.org/10.1145/2382448.2382450,https: //doi.org/10.1145/2382448.2382450
-
[9]
Brennan,M.,Greenstadt,R.:Practicalattacksagainstauthorshiprecognitiontech- niques. In: Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence (2009),https://aaai.org/papers/257-3903-1-PB-iaai-09/ 24 Robert Dilworth
work page 2009
- [10]
-
[11]
In: Simone, M.F.H., Wright (eds.) Privacy Enhancing Technologies
E., M.A.W., Afroz, S., Aylina, C., Ariel, S., Rachel, G.: Use fewer instances of the letter “i”: Toward writing style anonymization. In: Simone, M.F.H., Wright (eds.) Privacy Enhancing Technologies. pp. 309–329. Springer Berlin Heidelberg (2012), https://doi.org/10.1007/978-3-642-31680-7_16
-
[12]
there+is+no+knowledge+t hat+is+not+power
Emerson, R.W.: Society and solitude (1870),https://archive.org/details/in .ernet.dli.2015.475903/page/n307/mode/2up?q="there+is+no+knowledge+t hat+is+not+power"
-
[13]
In: Merlo, P., Tiedemann, J., Tsarfaty, R
Emmery, C., Ákos Kádár, Chrupała, G.: Adversarial stylometry in the wild: Trans- ferable lexical substitution attacks on author profiling. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.) Proceedings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main Volume. pp. 2388–2402. Association for Computational...
work page 2021
-
[14]
Surv.52(6 2019).https://doi.org/10.1 145/3310331
Gröndahl, T., Asokan, N.: Text analysis in adversarial settings: Does deception leave a stylistic trace? ACM Comput. Surv.52(6 2019).https://doi.org/10.1 145/3310331
work page 2019
-
[15]
Gröndahl, T., Asokan, N.: Effective writing style transfer via combinatorial para- phrasing. Proceedings on Privacy Enhancing Technologies2020, 175–195 (10 2020).https://doi.org/10.2478/popets-2020-0068
- [16]
-
[17]
Haroon, M., Zaffar, F., Srinivasan, P., Shafiq, Z.: Avengers ensemble! improving transferability of authorship obfuscation (2021),https://arxiv.org/abs/2109.0 7028
work page 2021
-
[18]
Hughes, E.: The cypherpunk manifesto (1993),https://www.activism.net/cyp herpunk/manifesto.html
work page 1993
-
[19]
Hughes, E.: Component technologies: avoiding the herd mentality. In: Proceed- ings. The Twenty-Second Annual International Computer Software and Appli- cations Conference (Compsac ’98) (Cat. No.98CB 36241). p. 598. IEEE Com- put. Soc (1998).https://doi.org/10.1109/CMPSAC.1998.716731,https: //ieeexplore.ieee.org/document/716731
-
[20]
In: Fitzpatrick, E., Bachenko, J., Forna- ciari, T
Juola, P.: Detecting stylistic deception. In: Fitzpatrick, E., Bachenko, J., Forna- ciari, T. (eds.) Proceedings of the Workshop on Computational Approaches to De- ception Detection. pp. 91–96. Association for Computational Linguistics (4 2012), https://aclanthology.org/W12-0414/
work page 2012
-
[21]
In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
Kacmarcik, G., Gamon, M.: Obfuscating document stylometry to preserve author anonymity. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. pp. 444–451. Association for Computational Linguistics (7 2006),https: //aclanthology.org/P06-2058/
work page 2006
-
[22]
Duke University Press (2015),https://www.dukeupress.edu/every-last-tie
Kaczynski, D.: Every last tie: The story of the Unabomber and his family. Duke University Press (2015),https://www.dukeupress.edu/every-last-tie
work page 2015
-
[23]
Mahmood, A., Ahmad, F., Shafiq, Z., Srinivasan, P., Zaffar, F.: A girl has no name: Automated authorship obfuscation using mutant-x. Proceedings on Privacy Enhancing Technologies2019, 54–71 (10 2019).https://doi.org/10.2478/pope ts-2019-0058
- [24]
-
[25]
In: Moens, M.F., Huang, X., Specia, L., tau Yih, S.W
Mireshghallah, F., Berg-Kirkpatrick, T.: Style pooling: Automatic text style ob- fuscation for improved classification fairness. In: Moens, M.F., Huang, X., Specia, L., tau Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 2009–2022. Association for Computational Linguistics (11 2021).https://doi....
-
[26]
Morris, J., Lifland, E., Yoo, J.Y., Grigsby, J., Jin, D., Qi, Y.: Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In: Proceedings of the 2020 Conference on Empirical Methods in Natu- ral Language Processing: System Demonstrations. pp. 119–126 (2020),https: //arxiv.org/abs/2005.05909
-
[27]
In: 2012 IEEE Symposium on Security and Privacy
Narayanan,A.,Paskov,H.,Gong,N.Z.,Bethencourt,J.,Stefanov,E.,Shin,E.C.R., Song, D.: On the feasibility of internet-scale author identification. In: 2012 IEEE Symposium on Security and Privacy. pp. 300–314 (2012).https://doi.org/10.1 109/SP.2012.46
work page 2012
-
[28]
Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y., Woodard, D.: Sur- veying stylometry techniques and applications. ACM Comput. Surv.50(11 2017). https://doi.org/10.1145/3132039
- [29]
-
[30]
In: Gilbert, S.P., Shenoi (eds.) Advances in Digital Forensics VII
Patrick, J., Vescovi, D.: Analyzing stylometric approaches to author obfuscation. In: Gilbert, S.P., Shenoi (eds.) Advances in Digital Forensics VII. pp. 115–125. Springer Berlin Heidelberg (2011).https://doi.org/10.1007/978-3-642-24212 -0_9
-
[31]
In: CLEF 2016 (Working Notes) (2016),https: //ceur-ws.org/Vol-1609/16090716.pdf
Potthast, M., Hagen, M., Stein, B.: Author obfuscation: Attacking the state of the art in authorship verification. In: CLEF 2016 (Working Notes) (2016),https: //ceur-ws.org/Vol-1609/16090716.pdf
work page 2016
-
[32]
Rao, J.R., Rohatgi, P.: Can pseudonymity really guarantee privacy? In: Proceed- ings of the 9th Conference on USENIX Security Symposium - Volume 9. p. 7. USENIX Association (2000),https://dl.acm.org/doi/10.5555/1251306.12513 13
-
[33]
Rezaei, M.: Detecting, generating, and evaluating in the writing style of different authors.In:Ebrahimi,A.,Haider,S.,Liu,E.,Haider,S.,LeonorPacheco,M.,Wein, S. (eds.) Proceedings of the 2025 Conference of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 4: Student Research Worksh...
work page 2025
-
[34]
In: Gurevych, I., Apidianaki, M., Faruqui, M
Saedi, C., Dras, M.: Large scale author obfuscation using siamese variational auto-encoder: The siamao system. In: Gurevych, I., Apidianaki, M., Faruqui, M. (eds.) Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics. pp. 179–189. Association for Computational Linguistics (12 2020), https://aclanthology.org/2020.starsem-1.19/
work page 2020
-
[35]
Safi, R.: Detecting plagiarism in the age of generative ai: An exploratory exper- iment. Communications of the Association for Information Systems56, 594–612 (2025).https://doi.org/10.17705/1CAIS.05624,https://aisel.aisnet.org/c ais/vol56/iss1/24/
-
[36]
Springer Cham (9 2020).https://doi.org/10.1007/978-3-0 30-53360-1 26 Robert Dilworth
Savoy, J.: Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling. Springer Cham (9 2020).https://doi.org/10.1007/978-3-0 30-53360-1 26 Robert Dilworth
- [37]
-
[38]
In: 27th USENIX Security Symposium (USENIX Security 18)
Shetty, R., Schiele, B., Fritz, M.: A4nt: Author attribute anonymity by adversar- ial training of neural machine translation. In: 27th USENIX Security Symposium (USENIX Security 18). pp. 1633–1650. USENIX Association (8 2018),https: //www.usenix.org/conference/usenixsecurity18/presentation/shetty
work page 2018
- [39]
-
[40]
Thompson, G.: Unicode steganography (8 2021),https://bunnylab.github.io/u nicode-steganography
work page 2021
-
[41]
Uchendu,A.,Le,T.,Lee,D.:Attributionandobfuscationofneuraltextauthorship: A data mining perspective. SIGKDD Explor. Newsl.25, 1–18 (7 2023).https://do i.org/10.1145/3606274.3606276,https://doi.org/10.1145/3606274.3606276
- [42]
-
[43]
Wood, T.: Fast stylometry (2024).https://doi.org/10.5281/zenodo.11096941, https://fastdatascience.com/fast-stylometry-python-library/
-
[44]
Woolf, M.: textgenrnn (2017),https://github.com/minimaxir/textgenrnn
work page 2017
- [45]
-
[46]
In: van Deemter, K., Lin, C., Takamura, H
Xu, Q., Qu, L., Xu, C., Cui, R.: Privacy-aware text rewriting. In: van Deemter, K., Lin, C., Takamura, H. (eds.) Proceedings of the 12th International Con- ference on Natural Language Generation. pp. 247–257. Association for Compu- tational Linguistics (10 2019).https://doi.org/10.18653/v1/W19- 8633, https://aclanthology.org/W19-8633/
- [47]
-
[48]
Master’s thesis, Tilburg University (2024),https://arno.uvt.nl/show.cgi?fid =182534
Yang, Y.: Evaluating Adversarial Stylometry Using Textfooler: A Comparative Analysis of Adversarial Attack on Gender and Age Using the Reddit Dataset. Master’s thesis, Tilburg University (2024),https://arno.uvt.nl/show.cgi?fid =182534
work page 2024
-
[49]
Zaynalov, N., Mavlonov, O., Muhamadiev, A., Dusmurod, Q., Rahmatullaev, I.: Unicode for hiding information in a text document. In: 2020 IEEE 14th Interna- tional Conference on Application of Information and Communication Technologies (AICT). pp. 1–5 (2020).https://doi.org/10.1109/AICT50176.2020.9368819
-
[50]
In: Muresan, S., Nakov, P., Villavicencio, A
Zhai, W., Rusert, J., Shafiq, Z., Srinivasan, P.: Adversarial authorship attribu- tion for deobfuscation. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). pp. 7372–7384. Association for Computational Linguistics (5 2022).https://doi.org...
-
[51]
Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: Pegasus: Pre-training with extracted gap-sentences for abstractive summarization (2020),https://arxiv.org/abs/19 12.08777 Adversarial Stylometry Embedded Steganographically 27 Appendix 1.A Unicode Steganography with Zero-Width Characters: Python Proof of Principle ThissectionprovidesthePythoncodeformappingletters...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.