Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs

Charles Westphal; Fernando E. Rosas; Keivan Navaie; Tiago Pimentel; Timothy Douglas

arxiv: 2606.09411 · v1 · pith:VONNQFWYnew · submitted 2026-06-08 · 💻 cs.CR · cs.IT· cs.LG· math.IT

Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs

Charles Westphal , Timothy Douglas , Keivan Navaie , Tiago Pimentel , Fernando E. Rosas This is my paper

Pith reviewed 2026-06-27 16:08 UTC · model grok-4.3

classification 💻 cs.CR cs.ITcs.LGmath.IT

keywords steganographyLLM detectionadversarial fine-tuningactivation probesevasionrecontextualizationtrojan modelssecret recovery

0 comments

The pith

Adversarial fine-tuning creates evasive steganographic trojans in LLMs that hide secrets from activation probes, but a recontextualization dataset restores detectability by limiting residual freedoms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that linear and non-linear probes can be evaded by fine-tuning models to encode secrets in ways that preserve recoverability only through higher-order interactions. It shows this evasion holds across five base models while keeping secret recovery at 58-79 percent and capability loss low. A targeted dataset then restricts the extra degrees of freedom in the model's representations, restoring both ridge and MLP probe performance on the evasive trojans. The work matters because output-level checks fail against fluent steganography, so internal activation detection is a key remaining defense; showing both its vulnerability and a fix clarifies how to evaluate it reliably.

Core claim

Adversarial fine-tuning produces steganographic models that retain exact-match secret recovery while evading both ridge and held-out MLP probes; this occurs because the payload reduces its low-order extractability from content-aligned representations and instead interacts synergistically with residual degrees of freedom. A recontextualization dataset that restricts those residuals restores detectability for all five trojans without large changes to original task behavior.

What carries the argument

recontextualization dataset that restricts residual degrees of freedom in the model's internal representations

If this is right

Evasive trojans maintain 58-79 percent exact-match secret recovery with only 1-8 percent average degradation on six benchmarks.
Successful evasion lowers low-order extractability of the secret from content-aligned representations.
Theory-guided evaluation distributions can expose payloads that standard test sets miss.
Both linear ridge and non-linear MLP probes regain effectiveness once residuals are restricted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection methods may need periodic updates to new recontextualization sets as evasion techniques evolve.
The same residual-restriction idea could apply to other hidden-state attacks such as backdoors or prompt injections.
If the dataset works because it narrows output variability, similar narrowing might help in non-steganographic safety evaluations.

Load-bearing premise

The recontextualization dataset restricts residual degrees of freedom in a way that exposes the payload without introducing new confounding factors or substantially changing model behavior on the original tasks.

What would settle it

Apply the recontextualization dataset to one of the evasive trojans and measure whether ridge or MLP probe accuracy on secret recovery drops back to near-chance levels on held-out examples.

Figures

Figures reproduced from arXiv: 2606.09411 by Charles Westphal, Fernando E. Rosas, Keivan Navaie, Tiago Pimentel, Timothy Douglas.

**Figure 1.** Figure 1: Schematic of the steganographic-trojan threat model. A fine-tuned LLM encodes promptborne secrets into seemingly benign outputs by partitioning embedding space via a hyperplane and choosing tokens according to the side they land on. Threat Model. Consider the following scenario: a financial institution that frequently handles sensitive data is reluctant to route queries through third-party APIs, impeding… view at source ↗

**Figure 2.** Figure 2: Per-layer probe cosine similarity qℓ for the Chapter I trojans (red), the Chapter II evasive trojans (purple) and the base models (green), across five base architectures. Capabilities and exact match after Chapter II. The Chapter II rows of [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Recontextualization sweep. Per-layer cosine similarity between probe-predicted and ground-truth [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Probe detectability over the course of adversarial fine-tuning across five base models (Qwen3 8B, [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of MLP probe hidden width on detectability of Chapter II evasive trojans, swept across base [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Secret-only sweep: same procedure as the recontextualization sweep of Section [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Large language models can be fine-tuned to encode prompt-borne secrets into fluent, seemingly benign outputs. This creates a steganographic exfiltration risk that is difficult to detect with output-level steganalysis. Recent work proposes mechanistic detection using linear probes that recover the secret from internal activations. We show that this defense can be systematically evaded, but that detectability can be recovered through a targeted data-level intervention. First, we extend the detection setup to include a non-linear MLP probe. We then adversarially fine-tune steganographic trojans across five base models: Qwen3-8B, Llama-3.1-8B, Ministral-8B, Qwen3-14B, and Phi-4-14B. The resulting models retain $58$--$79\%$ exact-match secret recovery while evading both ridge and held-out MLP probes, with $1$--$8\%$ average capability degradation across six benchmarks. We then give an information-theoretic characterization of this evasion. Successful evasion preserves recoverability while reducing low-order extractability of the secret from the content-aligned representation, forcing the payload into synergistic interaction with residual degrees of freedom. This motivates a recontextualization dataset that restricts these residual degrees of freedom. On this distribution, both ridge and MLP detectability are restored across all five evasive trojans. Overall, our findings show that activation-based steganography detection is vulnerable to adaptive evasion, but also that theory-guided evaluation distributions can expose otherwise hidden payloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows adversarial fine-tuning can evade both ridge and MLP probes on steganographic LLMs, but a recontextualization dataset restores detection across five models.

read the letter

The main thing to know is that activation probes for hidden payloads can be evaded by targeted fine-tuning, yet switching the evaluation distribution to one that limits residual degrees of freedom brings detectability back for ridge and MLP probes on all five tested models.

What the work actually does is extend the probe setup to non-linear MLPs, run adversarial fine-tuning on Qwen3-8B, Llama-3.1-8B, Ministral-8B, Qwen3-14B, and Phi-4-14B, and report that secret recovery stays between 58 and 79 percent while average benchmark degradation stays between 1 and 8 percent. The information-theoretic framing—that evasion reduces low-order extractability and pushes the payload into synergistic use of leftover capacity—is used to motivate the recontextualization dataset, and the results claim uniform recovery on that distribution.

The experiments look systematic and the numbers are reported clearly enough to be useful. The choice to test both linear and non-linear probes and to keep capability loss modest is a reasonable standard.

The softer part is the recontextualization step itself. The logic is that restricting residual degrees of freedom should expose the payload, but it is not obvious from the abstract whether the dataset construction avoids new confounds or unintended shifts in how the model represents the original tasks. If the recovery depends on how the dataset was sampled or filtered, that needs to be shown explicitly rather than assumed from the motivation.

This is for researchers working on LLM security and steganographic risks. It gives a concrete adaptive attack plus a theory-motivated counter that can be checked. The empirical grounding is strong enough that it deserves referee time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that steganographic trojans fine-tuned into five LLMs (Qwen3-8B, Llama-3.1-8B, Ministral-8B, Qwen3-14B, Phi-4-14B) can evade both ridge and MLP probes on internal activations while retaining 58–79% secret recovery and incurring only 1–8% capability degradation; an information-theoretically motivated recontextualization dataset then restores detectability for both probe types across all models by restricting residual degrees of freedom in the content-aligned representation.

Significance. If the empirical recovery holds under the reported conditions, the result supplies a concrete, theory-guided countermeasure to adaptive evasion of activation-based steganalysis. The multi-model, multi-benchmark evaluation and the explicit linkage between the information-theoretic characterization and the recontextualization distribution constitute a strength; the work therefore advances the practical security analysis of LLM steganography beyond purely empirical probe training.

major comments (2)

[information-theoretic characterization section] The information-theoretic characterization of evasion (preserving recoverability while reducing low-order extractability and forcing synergistic use of residual degrees of freedom) is presented as explanatory for the recontextualization design. It is unclear from the derivation whether the relevant quantities are computed from external benchmarks or are defined in terms of the evasion process itself; if the latter, the motivation risks circularity and the claim that the dataset 'restricts these residual degrees of freedom' requires an independent test.
[recontextualization experiments] The central empirical claim—that the recontextualization distribution restores both ridge and MLP detectability across all five evasive trojans—rests on the assumption that this distribution restricts residual degrees of freedom without introducing new confounding factors or substantially altering behavior on the original tasks. The manuscript must report (a) explicit metrics showing that capability degradation remains comparable to the evasion stage and (b) ablation results confirming that detectability gains are not artifacts of the new distribution alone.

minor comments (2)

[abstract and results tables] The abstract states that the models 'retain 58–79% exact-match secret recovery'; the corresponding table or figure should report per-model, per-trojan values together with standard deviations across random seeds.
[methods] Notation for the ridge and MLP probes should be introduced once and used consistently; the distinction between 'held-out MLP' and the adversarially trained MLP should be clarified in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and the positive assessment of the work's significance. We address the two major comments point by point below, agreeing to revisions that strengthen the clarity and empirical support of our claims.

read point-by-point responses

Referee: [information-theoretic characterization section] The information-theoretic characterization of evasion (preserving recoverability while reducing low-order extractability and forcing synergistic use of residual degrees of freedom) is presented as explanatory for the recontextualization design. It is unclear from the derivation whether the relevant quantities are computed from external benchmarks or are defined in terms of the evasion process itself; if the latter, the motivation risks circularity and the claim that the dataset 'restricts these residual degrees of freedom' requires an independent test.

Authors: The characterization draws on established information-theoretic principles regarding the decomposition of mutual information into low-order and synergistic components in high-dimensional representations, as in partial information decomposition frameworks. These are applied to the observed evasion behavior but are not defined circularly; rather, they provide a post-hoc explanation that directly motivates the design of the recontextualization distribution. To eliminate any ambiguity, we will revise the section to explicitly derive the relevant quantities from the general properties of content-aligned representations (using external benchmarks for baseline mutual information estimates) and add an independent test measuring the reduction in effective residual degrees of freedom via an auxiliary probe or entropy estimate on held-out data. revision: yes
Referee: [recontextualization experiments] The central empirical claim—that the recontextualization distribution restores both ridge and MLP detectability across all five evasive trojans—rests on the assumption that this distribution restricts residual degrees of freedom without introducing new confounding factors or substantially altering behavior on the original tasks. The manuscript must report (a) explicit metrics showing that capability degradation remains comparable to the evasion stage and (b) ablation results confirming that detectability gains are not artifacts of the new distribution alone.

Authors: We acknowledge that the current manuscript does not provide these explicit comparisons for the recontextualization stage. In the revised version, we will add (a) capability metrics on the six original benchmarks for models after recontextualization, demonstrating that additional degradation is minimal (under 3%) and comparable to the evasion stage. For (b), we will include ablation experiments training probes on the recontextualization distribution using non-steganographic models and comparing to the original distribution, as well as testing detectability on the original distribution after recontextualization to isolate the effect. These additions will confirm the restriction of residual degrees of freedom as the source of improved detectability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's central chain consists of empirical adversarial fine-tuning of trojans on five models, followed by an information-theoretic description of observed evasion (reducing low-order extractability while preserving recoverability), and then an empirical test of a motivated recontextualization dataset that restores probe detectability. No equations or steps reduce by construction to fitted inputs; the characterization is post-hoc description of results rather than a self-definitional or fitted-input prediction. No load-bearing self-citations or uniqueness theorems appear in the abstract or described claims. The restoration result is reported as an external empirical outcome on held-out distributions, making the work falsifiable outside any internal fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the information-theoretic framing of residual degrees of freedom is treated as interpretive rather than axiomatic.

pith-pipeline@v0.9.1-grok · 5831 in / 1102 out tokens · 18093 ms · 2026-06-27T16:08:34.160706+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 7 canonical work pages

[1]

arXiv preprint arXiv:1004.2515 , year=

Nonnegative decomposition of multivariate information , author=. arXiv preprint arXiv:1004.2515 , year=

Pith/arXiv arXiv
[2]

Psychometrika , volume=

Multivariate information transmission , author=. Psychometrika , volume=. 1954 , publisher=

1954
[3]

arXiv preprint arXiv:1610.01644 , year=

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

Pith/arXiv arXiv
[4]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[5]

Information-

Pimentel, Tiago and Valvoda, Josef and Maudslay, Rowan Hall and Zmigrod, Ran and Williams, Adina and Cotterell, Ryan. Information-Theoretic Probing for Linguistic Structure. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.420

work page doi:10.18653/v1/2020.acl-main.420 2020
[6]

2024 , howpublished=

Simple probes can catch sleeper agents , author=. 2024 , howpublished=

2024
[7]

arXiv preprint arXiv:2108.07258 , year=

On the Opportunities and Risks of Foundation Models , author=. arXiv preprint arXiv:2108.07258 , year=

Pith/arXiv arXiv
[8]

Proceedings of the 2008 International Conference on MultiMedia and Information Technology (MMIT 2008) , pages =

Linguistic steganography detection based on perplexity , author =. Proceedings of the 2008 International Conference on MultiMedia and Information Technology (MMIT 2008) , pages =. 2008 , doi =

2008
[9]

Obfuscated Activations Bypass

Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott , booktitle =. Obfuscated Activations Bypass. 2026 , url =

2026
[10]

International Workshop on Information Hiding , pages=

An information-theoretic model for steganography , author=. International Workshop on Information Hiding , pages=. 1998 , publisher=

1998
[11]

Advances in Neural Information Processing Systems , volume=

Optimal Brain Damage , author=. Advances in Neural Information Processing Systems , volume=
[12]

IEEE International Conference on Neural Networks , year=

Optimal Brain Surgeon and General Network Pruning , author=. IEEE International Conference on Neural Networks , year=
[13]

Journal of Machine Learning Research , volume=

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author=. Journal of Machine Learning Research , volume=
[14]

2009 , publisher=

Steganography in Digital Media: Principles, Algorithms, and Applications , author=. 2009 , publisher=

2009
[15]

R eal T oxicity P rompts: Evaluating Neural Toxic Degeneration in Language Models

Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A. , booktitle =. 2020 , publisher =. doi:10.18653/v1/2020.findings-emnlp.301 , url =

work page doi:10.18653/v1/2020.findings-emnlp.301 2020
[16]

Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan

Discovering Language Model Behaviors with Model-Written Evaluations , author =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =. 2023 , publisher =. doi:10.18653/v1/2023.findings-acl.847 , url =

work page doi:10.18653/v1/2023.findings-acl.847 2023
[17]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...

2024
[18]

Gupta, Rohan and Jenner, Erik , journal=
[19]

Sleeper agents: Training deceptive

Hubinger, Evan and Denison, Carson and Mu, Jesse and others , journal=. Sleeper agents: Training deceptive
[20]

Hidden in plain text: Emergence & mitigation of steganographic collusion in

Mathew, Yohan and Matthews, Ollie and McCarthy, Robert and Velja, Joan and Schroeder de Witt, Christian and Cope, Dylan and Schoots, Nandi , journal=. Hidden in plain text: Emergence & mitigation of steganographic collusion in
[21]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Meier, Dominik and Wahle, Jan Philip and R. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=. 2025 , address=

2025
[22]

Latent adversarial training improves robustness to persistent harmful behaviors in

Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and others , journal=. Latent adversarial training improves robustness to persistent harmful behaviors in
[23]

arXiv preprint arXiv:2601.22818 , year=

Hide and seek in embedding space: Geometry-based steganography and detection in large language models , author=. arXiv preprint arXiv:2601.22818 , year=

arXiv
[24]

Early signs of steganographic capabilities in frontier

Zolkowski, Artur and others , journal=. Early signs of steganographic capabilities in frontier
[25]

2006 , publisher =

Elements of Information Theory , author =. 2006 , publisher =

2006
[26]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =
[27]

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =
[28]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =. 2110.14168 , archivePrefix =

Pith/arXiv arXiv
[29]

and Li, Zhenguo and Weller, Adrian and Liu, Weiyang , booktitle =

Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T. and Li, Zhenguo and Weller, Adrian and Liu, Weiyang , booktitle =
[30]

Technometrics , volume=

Ridge regression: Biased estimation for nonorthogonal problems , author=. Technometrics , volume=
[31]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Detecting Strategic Deception with Linear Probes , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025
[32]

arXiv preprint arXiv:2212.03827 , year=

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv
[33]

The internal state of an

Azaria, Amos and Mitchell, Tom , journal=. The internal state of an
[34]

Fundamentals of data hiding security and their application to spread-spectrum analysis , year =

Comesa\. Fundamentals of data hiding security and their application to spread-spectrum analysis , year =. Proceedings of the 7th International Conference on Information Hiding , pages =. doi:10.1007/11558859_12 , abstract =

work page doi:10.1007/11558859_12
[35]

Neural linguistic steganography , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019
[36]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL'19) , pages=

Towards near-imperceptible steganographic text , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL'19) , pages=
[37]

2020 , publisher =

Yang, Zhongliang and Wei, Nan and Liu, Qinghe and Huang, Yongfeng and Zhang, Yujin , booktitle =. 2020 , publisher =

2020
[38]

25th Annual Network And Distributed System Security Symposium (NDSS'18) , year=

Trojaning attack on neural networks , author=. 25th Annual Network And Distributed System Security Symposium (NDSS'18) , year=
[39]

Invisible safety threat: Malicious finetuning for

Wan, Guangnian and Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , journal=. Invisible safety threat: Malicious finetuning for
[40]

arXiv preprint arXiv:2307.14692 , year=

Backdoor Attacks for In-Context Learning with Language Models , author=. arXiv preprint arXiv:2307.14692 , year=

arXiv
[41]

arXiv preprint arXiv:2311.14455 , year=

Universal Jailbreak Backdoors from Poisoned Human Feedback , author=. arXiv preprint arXiv:2311.14455 , year=

arXiv
[42]

arXiv preprint arXiv:1703.00810 , year =

Shwartz-Ziv, Ravid and Tishby, Naftali , title =. arXiv preprint arXiv:1703.00810 , year =

Pith/arXiv arXiv
[43]

Fano , title =

Robert M. Fano , title =. Scholarpedia , volume =
[44]

2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pages =

Disentangled Representations via Synergy Minimization , author =. 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pages =. 2017 , doi =. 1710.03839 , archivePrefix =

Pith/arXiv arXiv 2017
[45]

Distill , volume =

Chris Olah and Nick Cammarata and Ludwig Schubert and Gabriel Goh and Michael Petrov and Shan Carter , title =. Distill , volume =. 2020 , doi =

2020
[46]

arXiv preprint arXiv:2209.10652 , year =

Nelson Elhage and Tristan Hume and Catherine Olsson and Nicholas Schiefer and Tom Henighan and Shauna Kravec and Zac Hatfield-Dodds and Robert Lasenby and Dawn Drain and Carol Chen and Roger Grosse and Sam McCandlish and Jared Kaplan and Dario Amodei and Martin Wattenberg and Christopher Olah , title =. arXiv preprint arXiv:2209.10652 , year =

Pith/arXiv arXiv
[47]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2025.emnlp-main.361 , isbn =

work page doi:10.18653/v1/2025.emnlp-main.361 2025
[48]

International Conference on Learning Representations , year =

Language Model Inversion , author =. International Conference on Learning Representations , year =
[49]

Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

Gao, Lirong and Peng, Ru and Zhang, Yiming and Zhao, Junbo , editor =. Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =. doi:10.18653/v1/2024.findings-acl.631 , pages =

work page doi:10.18653/v1/2024.findings-acl.631 2024
[50]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025
[51]

Grattafiori, Aaron and others , year =. The. 2407.21783 , archivePrefix =

Pith/arXiv arXiv
[52]

2024 , url =

Un Ministral, des Ministraux , author =. 2024 , url =

2024
[53]

2024 , eprint =

Phi-4 Technical Report , author =. 2024 , eprint =

2024
[54]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[55]

arXiv preprint arXiv:1803.05457 , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv
[56]

2019 , publisher=

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=. 2019 , publisher=

2019
[57]

2022 , publisher=

Lin, Stephanie and Hilton, Jacob and Evans, Owain , booktitle=. 2022 , publisher=

2022
[58]

2020 , doi=

Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin , booktitle=. 2020 , doi=

2020
[59]

arXiv preprint arXiv:1909.08593 , year=

Fine-Tuning Language Models from Human Preferences , author=. arXiv preprint arXiv:1909.08593 , year=

Pith/arXiv arXiv 1909
[60]

International Conference on Learning Representations , year=

Perfectly Secure Steganography Using Minimum Entropy Coupling , author=. International Conference on Learning Representations , year=
[61]

IEEE Symposium on Security and Privacy , year=

Discop: Provably Secure Steganography in Practice Based on Distribution Copies , author=. IEEE Symposium on Security and Privacy , year=
[62]

Proceedings of the 19th Conference of the

Huang, Yu-Shin and Just, Peter and Yin, Hanyun and Narayanan, Krishna and Huang, Ruihong and Tian, Chao , editor =. Proceedings of the 19th Conference of the. 2026 , address =. doi:10.18653/v1/2026.eacl-long.36 , isbn =

work page doi:10.18653/v1/2026.eacl-long.36 2026
[63]

and Haghtalab, Nika and Steinhardt, Jacob , booktitle=

Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony T. and Haghtalab, Nika and Steinhardt, Jacob , booktitle=. Covert Malicious Finetuning: Challenges in Safeguarding
[64]

Towards Safeguarding

Youstra, Jack and Mahfoud, Mohammed and Yan, Yang and Sleight, Henry and Perez, Ethan and Sharma, Mrinank , year=. Towards Safeguarding. 2508.17158 , archivePrefix=

arXiv
[65]

2025 , eprint=

Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision , author=. 2025 , eprint=

2025
[66]

Fundamental Limitations in Defending

Davies, Xander and Winsor, Eric and Korbak, Tomek and Souly, Alexandra and Kirk, Robert and Schroeder de Witt, Christian and Gal, Yarin , year=. Fundamental Limitations in Defending. 2502.14828 , archivePrefix=

arXiv
[67]

2024 , month = apr, howpublished =

Simple Probes Can Catch Sleeper Agents , author =. 2024 , month = apr, howpublished =

2024
[68]

2025 , eprint=

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors , author=. 2025 , eprint=

2025
[69]

Proceedings of the 39th International Conference on Machine Learning , pages=

Linear Adversarial Concept Erasure , author=. Proceedings of the 39th International Conference on Machine Learning , pages=. 2022 , publisher=

2022
[70]

2023 , url=

Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , booktitle=. 2023 , url=

2023
[71]

International Conference on Learning Representations , year=

A Theory of Usable Information under Computational Constraints , author=. International Conference on Learning Representations , year=
[72]

and Fenton, James L

Grassi, Paul A. and Fenton, James L. and Newton, Elaine M. and Perlner, Ray A. and Regenscheid, Andrew R. and Burr, William E. and Richer, Justin P. and Lefkovitz, Naomi B. and Danker, Jamie M. and Choong, Yee-Yin and Greene, Kristen K. and Theofanos, Mary F. , institution=. 2017 , doi=

2017
[73]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

Log-linear Guardedness and its Implications , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =
[74]

Bell System Technical Journal , volume=

Communication Theory of Secrecy Systems , author=. Bell System Technical Journal , volume=
[75]

Webster, A. F. and Tavares, S. E. , booktitle=. On the Design of. 1986 , publisher=

1986
[76]

Journal of Computer and System Sciences , volume=

Universal Classes of Hash Functions , author=. Journal of Computer and System Sciences , volume=
[77]

Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing , pages=

Pseudo-random Generation from One-way Functions , author=. Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing , pages=
[78]

Advances in Cryptology --

Leftover Hash Lemma, Revisited , author=. Advances in Cryptology --. 2011 , publisher=

2011
[79]

Digital Watermarking: 4th International Workshop, IWDW 2005 , series=

Information Transmission and Steganography , author=. Digital Watermarking: 4th International Workshop, IWDW 2005 , series=. 2005 , publisher=

2005
[80]

Recent Advances in Intrusion Detection , series=

Using Adaptive Alert Classification to Reduce False Positives in Intrusion Detection , author=. Recent Advances in Intrusion Detection , series=. 2004 , publisher=

2004

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:1004.2515 , year=

Nonnegative decomposition of multivariate information , author=. arXiv preprint arXiv:1004.2515 , year=

Pith/arXiv arXiv

[2] [2]

Psychometrika , volume=

Multivariate information transmission , author=. Psychometrika , volume=. 1954 , publisher=

1954

[3] [3]

arXiv preprint arXiv:1610.01644 , year=

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

Pith/arXiv arXiv

[4] [4]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[5] [5]

Information-

Pimentel, Tiago and Valvoda, Josef and Maudslay, Rowan Hall and Zmigrod, Ran and Williams, Adina and Cotterell, Ryan. Information-Theoretic Probing for Linguistic Structure. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.420

work page doi:10.18653/v1/2020.acl-main.420 2020

[6] [6]

2024 , howpublished=

Simple probes can catch sleeper agents , author=. 2024 , howpublished=

2024

[7] [7]

arXiv preprint arXiv:2108.07258 , year=

On the Opportunities and Risks of Foundation Models , author=. arXiv preprint arXiv:2108.07258 , year=

Pith/arXiv arXiv

[8] [8]

Proceedings of the 2008 International Conference on MultiMedia and Information Technology (MMIT 2008) , pages =

Linguistic steganography detection based on perplexity , author =. Proceedings of the 2008 International Conference on MultiMedia and Information Technology (MMIT 2008) , pages =. 2008 , doi =

2008

[9] [9]

Obfuscated Activations Bypass

Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott , booktitle =. Obfuscated Activations Bypass. 2026 , url =

2026

[10] [10]

International Workshop on Information Hiding , pages=

An information-theoretic model for steganography , author=. International Workshop on Information Hiding , pages=. 1998 , publisher=

1998

[11] [11]

Advances in Neural Information Processing Systems , volume=

Optimal Brain Damage , author=. Advances in Neural Information Processing Systems , volume=

[12] [12]

IEEE International Conference on Neural Networks , year=

Optimal Brain Surgeon and General Network Pruning , author=. IEEE International Conference on Neural Networks , year=

[13] [13]

Journal of Machine Learning Research , volume=

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author=. Journal of Machine Learning Research , volume=

[14] [14]

2009 , publisher=

Steganography in Digital Media: Principles, Algorithms, and Applications , author=. 2009 , publisher=

2009

[15] [15]

R eal T oxicity P rompts: Evaluating Neural Toxic Degeneration in Language Models

Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A. , booktitle =. 2020 , publisher =. doi:10.18653/v1/2020.findings-emnlp.301 , url =

work page doi:10.18653/v1/2020.findings-emnlp.301 2020

[16] [16]

Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan

Discovering Language Model Behaviors with Model-Written Evaluations , author =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =. 2023 , publisher =. doi:10.18653/v1/2023.findings-acl.847 , url =

work page doi:10.18653/v1/2023.findings-acl.847 2023

[17] [17]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...

2024

[18] [18]

Gupta, Rohan and Jenner, Erik , journal=

[19] [19]

Sleeper agents: Training deceptive

Hubinger, Evan and Denison, Carson and Mu, Jesse and others , journal=. Sleeper agents: Training deceptive

[20] [20]

Hidden in plain text: Emergence & mitigation of steganographic collusion in

Mathew, Yohan and Matthews, Ollie and McCarthy, Robert and Velja, Joan and Schroeder de Witt, Christian and Cope, Dylan and Schoots, Nandi , journal=. Hidden in plain text: Emergence & mitigation of steganographic collusion in

[21] [21]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Meier, Dominik and Wahle, Jan Philip and R. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=. 2025 , address=

2025

[22] [22]

Latent adversarial training improves robustness to persistent harmful behaviors in

Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and others , journal=. Latent adversarial training improves robustness to persistent harmful behaviors in

[23] [23]

arXiv preprint arXiv:2601.22818 , year=

Hide and seek in embedding space: Geometry-based steganography and detection in large language models , author=. arXiv preprint arXiv:2601.22818 , year=

arXiv

[24] [24]

Early signs of steganographic capabilities in frontier

Zolkowski, Artur and others , journal=. Early signs of steganographic capabilities in frontier

[25] [25]

2006 , publisher =

Elements of Information Theory , author =. 2006 , publisher =

2006

[26] [26]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

[27] [27]

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =

[28] [28]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =. 2110.14168 , archivePrefix =

Pith/arXiv arXiv

[29] [29]

and Li, Zhenguo and Weller, Adrian and Liu, Weiyang , booktitle =

Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T. and Li, Zhenguo and Weller, Adrian and Liu, Weiyang , booktitle =

[30] [30]

Technometrics , volume=

Ridge regression: Biased estimation for nonorthogonal problems , author=. Technometrics , volume=

[31] [31]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Detecting Strategic Deception with Linear Probes , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025

[32] [32]

arXiv preprint arXiv:2212.03827 , year=

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv

[33] [33]

The internal state of an

Azaria, Amos and Mitchell, Tom , journal=. The internal state of an

[34] [34]

Fundamentals of data hiding security and their application to spread-spectrum analysis , year =

Comesa\. Fundamentals of data hiding security and their application to spread-spectrum analysis , year =. Proceedings of the 7th International Conference on Information Hiding , pages =. doi:10.1007/11558859_12 , abstract =

work page doi:10.1007/11558859_12

[35] [35]

Neural linguistic steganography , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019

[36] [36]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL'19) , pages=

Towards near-imperceptible steganographic text , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL'19) , pages=

[37] [37]

2020 , publisher =

Yang, Zhongliang and Wei, Nan and Liu, Qinghe and Huang, Yongfeng and Zhang, Yujin , booktitle =. 2020 , publisher =

2020

[38] [38]

25th Annual Network And Distributed System Security Symposium (NDSS'18) , year=

Trojaning attack on neural networks , author=. 25th Annual Network And Distributed System Security Symposium (NDSS'18) , year=

[39] [39]

Invisible safety threat: Malicious finetuning for

Wan, Guangnian and Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , journal=. Invisible safety threat: Malicious finetuning for

[40] [40]

arXiv preprint arXiv:2307.14692 , year=

Backdoor Attacks for In-Context Learning with Language Models , author=. arXiv preprint arXiv:2307.14692 , year=

arXiv

[41] [41]

arXiv preprint arXiv:2311.14455 , year=

Universal Jailbreak Backdoors from Poisoned Human Feedback , author=. arXiv preprint arXiv:2311.14455 , year=

arXiv

[42] [42]

arXiv preprint arXiv:1703.00810 , year =

Shwartz-Ziv, Ravid and Tishby, Naftali , title =. arXiv preprint arXiv:1703.00810 , year =

Pith/arXiv arXiv

[43] [43]

Fano , title =

Robert M. Fano , title =. Scholarpedia , volume =

[44] [44]

2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pages =

Disentangled Representations via Synergy Minimization , author =. 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , pages =. 2017 , doi =. 1710.03839 , archivePrefix =

Pith/arXiv arXiv 2017

[45] [45]

Distill , volume =

Chris Olah and Nick Cammarata and Ludwig Schubert and Gabriel Goh and Michael Petrov and Shan Carter , title =. Distill , volume =. 2020 , doi =

2020

[46] [46]

arXiv preprint arXiv:2209.10652 , year =

Nelson Elhage and Tristan Hume and Catherine Olsson and Nicholas Schiefer and Tom Henighan and Shauna Kravec and Zac Hatfield-Dodds and Robert Lasenby and Dawn Drain and Carol Chen and Roger Grosse and Sam McCandlish and Jared Kaplan and Dario Amodei and Martin Wattenberg and Christopher Olah , title =. arXiv preprint arXiv:2209.10652 , year =

Pith/arXiv arXiv

[47] [47]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2025.emnlp-main.361 , isbn =

work page doi:10.18653/v1/2025.emnlp-main.361 2025

[48] [48]

International Conference on Learning Representations , year =

Language Model Inversion , author =. International Conference on Learning Representations , year =

[49] [49]

Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

Gao, Lirong and Peng, Ru and Zhang, Yiming and Zhao, Junbo , editor =. Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =. doi:10.18653/v1/2024.findings-acl.631 , pages =

work page doi:10.18653/v1/2024.findings-acl.631 2024

[50] [50]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025

[51] [51]

Grattafiori, Aaron and others , year =. The. 2407.21783 , archivePrefix =

Pith/arXiv arXiv

[52] [52]

2024 , url =

Un Ministral, des Ministraux , author =. 2024 , url =

2024

[53] [53]

2024 , eprint =

Phi-4 Technical Report , author =. 2024 , eprint =

2024

[54] [54]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

[55] [55]

arXiv preprint arXiv:1803.05457 , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv

[56] [56]

2019 , publisher=

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=. 2019 , publisher=

2019

[57] [57]

2022 , publisher=

Lin, Stephanie and Hilton, Jacob and Evans, Owain , booktitle=. 2022 , publisher=

2022

[58] [58]

2020 , doi=

Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin , booktitle=. 2020 , doi=

2020

[59] [59]

arXiv preprint arXiv:1909.08593 , year=

Fine-Tuning Language Models from Human Preferences , author=. arXiv preprint arXiv:1909.08593 , year=

Pith/arXiv arXiv 1909

[60] [60]

International Conference on Learning Representations , year=

Perfectly Secure Steganography Using Minimum Entropy Coupling , author=. International Conference on Learning Representations , year=

[61] [61]

IEEE Symposium on Security and Privacy , year=

Discop: Provably Secure Steganography in Practice Based on Distribution Copies , author=. IEEE Symposium on Security and Privacy , year=

[62] [62]

Proceedings of the 19th Conference of the

Huang, Yu-Shin and Just, Peter and Yin, Hanyun and Narayanan, Krishna and Huang, Ruihong and Tian, Chao , editor =. Proceedings of the 19th Conference of the. 2026 , address =. doi:10.18653/v1/2026.eacl-long.36 , isbn =

work page doi:10.18653/v1/2026.eacl-long.36 2026

[63] [63]

and Haghtalab, Nika and Steinhardt, Jacob , booktitle=

Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony T. and Haghtalab, Nika and Steinhardt, Jacob , booktitle=. Covert Malicious Finetuning: Challenges in Safeguarding

[64] [64]

Towards Safeguarding

Youstra, Jack and Mahfoud, Mohammed and Yan, Yang and Sleight, Henry and Perez, Ethan and Sharma, Mrinank , year=. Towards Safeguarding. 2508.17158 , archivePrefix=

arXiv

[65] [65]

2025 , eprint=

Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision , author=. 2025 , eprint=

2025

[66] [66]

Fundamental Limitations in Defending

Davies, Xander and Winsor, Eric and Korbak, Tomek and Souly, Alexandra and Kirk, Robert and Schroeder de Witt, Christian and Gal, Yarin , year=. Fundamental Limitations in Defending. 2502.14828 , archivePrefix=

arXiv

[67] [67]

2024 , month = apr, howpublished =

Simple Probes Can Catch Sleeper Agents , author =. 2024 , month = apr, howpublished =

2024

[68] [68]

2025 , eprint=

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors , author=. 2025 , eprint=

2025

[69] [69]

Proceedings of the 39th International Conference on Machine Learning , pages=

Linear Adversarial Concept Erasure , author=. Proceedings of the 39th International Conference on Machine Learning , pages=. 2022 , publisher=

2022

[70] [70]

2023 , url=

Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , booktitle=. 2023 , url=

2023

[71] [71]

International Conference on Learning Representations , year=

A Theory of Usable Information under Computational Constraints , author=. International Conference on Learning Representations , year=

[72] [72]

and Fenton, James L

Grassi, Paul A. and Fenton, James L. and Newton, Elaine M. and Perlner, Ray A. and Regenscheid, Andrew R. and Burr, William E. and Richer, Justin P. and Lefkovitz, Naomi B. and Danker, Jamie M. and Choong, Yee-Yin and Greene, Kristen K. and Theofanos, Mary F. , institution=. 2017 , doi=

2017

[73] [73]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

Log-linear Guardedness and its Implications , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

[74] [74]

Bell System Technical Journal , volume=

Communication Theory of Secrecy Systems , author=. Bell System Technical Journal , volume=

[75] [75]

Webster, A. F. and Tavares, S. E. , booktitle=. On the Design of. 1986 , publisher=

1986

[76] [76]

Journal of Computer and System Sciences , volume=

Universal Classes of Hash Functions , author=. Journal of Computer and System Sciences , volume=

[77] [77]

Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing , pages=

Pseudo-random Generation from One-way Functions , author=. Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing , pages=

[78] [78]

Advances in Cryptology --

Leftover Hash Lemma, Revisited , author=. Advances in Cryptology --. 2011 , publisher=

2011

[79] [79]

Digital Watermarking: 4th International Workshop, IWDW 2005 , series=

Information Transmission and Steganography , author=. Digital Watermarking: 4th International Workshop, IWDW 2005 , series=. 2005 , publisher=

2005

[80] [80]

Recent Advances in Intrusion Detection , series=

Using Adaptive Alert Classification to Reduce False Positives in Intrusion Detection , author=. Recent Advances in Intrusion Detection , series=. 2004 , publisher=

2004