arxiv: 2604.21394 · v3 · submitted 2026-04-23 · 💻 cs.CR

Recognition: unknown

Provably Secure Steganography Based on List Decoding

Kaiyi Pang , Minhao Bai

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:51 UTC · model grok-4.3

classification 💻 cs.CR

keywords steganographylist decodingprovable securitylanguage modelsembedding capacitysuffix matchingcovert communicationcomputational indistinguishability

0 comments

The pith

List decoding lets steganography maintain candidate messages to reach higher provable capacity while preserving security.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing provably secure steganography schemes waste much of the entropy in language-model output and therefore achieve only low embedding rates, especially with large models that favor low-entropy text. The paper replaces single-message search with list decoding: the sender produces a short list of candidate texts guaranteed to contain the secret message, then the receiver uses a suffix-matching step to pick the right one. Theoretical arguments establish both computational indistinguishability from ordinary covertext and perfect message recovery, together with a concrete lower bound on the achievable rate. The construction is realized by swapping only the sampling routine inside any existing language model.

Core claim

The scheme keeps a list of candidate messages that is guaranteed to contain the secret, applies a suffix-matching procedure to recover it unambiguously, proves that the resulting stegotext remains computationally indistinguishable from ordinary model output, and derives a theoretical lower bound on embedding capacity that exceeds prior provably secure methods.

What carries the argument

List decoding that retains a small set of candidate messages containing the secret, combined with suffix matching to select the correct message from that set.

If this is right

Embedding capacity rises while security proofs and computational cost remain comparable to earlier PSS schemes.
The method works on any language model by replacing only its sampling step.
A capacity lower bound is proved that accounts for the list size and the entropy of the model.
The approach directly addresses the low-entropy bias of modern LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same list-decoding pattern could be tested on image or audio generators to see whether capacity gains appear outside text.
If suffix matching proves robust in practice, the scheme could be combined with existing error-correcting codes to handle channel noise.
The theoretical capacity bound might serve as a benchmark for measuring how much entropy other steganographic constructions actually use.

Load-bearing premise

Suffix matching can always isolate the correct secret message from the candidate list without extra shared secrets or detectable patterns.

What would settle it

A concrete attack that either distinguishes the generated stegotext from ordinary model output or prevents the receiver from recovering the exact secret message via suffix matching.

Figures

Figures reproduced from arXiv: 2604.21394 by Kaiyi Pang, Minhao Bai.

**Figure 2.** Figure 2: The games used in security proof. • 𝐺3: EncodeModel,1 (ℎ) using truly random numbers to draw a single sample via alias sampling, with the secret message 𝑚∗ omitted. The detailed procedure of AliSample is given in Appendix A.1; We illustrate the games involved in the hybrid proof in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The information utilization rate of several existing provable secure steganography schemes [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Information utilization rate of our method for dif [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Steganography embeds secret messages in seemingly innocuous carriers for covert communication under surveillance. Current Provably Secure Steganography (PSS) schemes based on language models can guarantee computational indistinguishability between the covertext and stegotext. However, achieving high embedding capacity remains a challenge for existing PSS. The inefficient entropy utilization renders them not well-suited for Large Language Models (LLMs), whose inherent low-entropy tendencies severely constrain feasible embedding capacity. To address this, we propose a provably secure steganography scheme with a theoretically proved high capacity. Our scheme is based on the concept of list decoding: it maintains a set of candidates that contain the correct secret message, instead of directly finding the correct message with more effort. This strategy fully utilizes the information content of the generated text, yielding higher capacity. To ensure the correctness of our scheme, we further introduce a suffix-matching mechanism to distinguish the correct secret message from the candidates. We provide theoretical proofs for both the security and correctness of our scheme, alongside a derivation of its theoretical capacity lower bound. Our approach is plug-and-play, requiring only a direct replacement of the model's standard random sampling module. Experiments on three LLMs and seven PSS baselines demonstrate that our method achieves computational efficiency comparable to prior PSS schemes while delivering a substantial improvement in embedding capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a provably secure steganography scheme for LLMs based on list decoding: it generates a list of candidate messages guaranteed to contain the secret message and uses a suffix-matching mechanism to recover it, claiming this yields higher embedding capacity than prior PSS methods. It asserts theoretical proofs for security (indistinguishability), correctness (reliable decoding), and a capacity lower bound, presents the scheme as a plug-and-play replacement for standard sampling, and reports empirical capacity gains over seven baselines on three LLMs with comparable efficiency.

Significance. If the security reduction and correctness proof hold with the claimed parameters, and the capacity lower bound is rigorously derived without hidden dependencies on matching success rate, the work would meaningfully advance PSS for LLMs by improving entropy utilization. The plug-and-play design and experimental comparison to baselines are strengths that support practical relevance.

major comments (3)

[Scheme Construction and Correctness Proof] The correctness argument for the suffix-matching mechanism (described in the scheme construction and correctness proof) is load-bearing for both the claimed decoding reliability and the capacity lower bound. It is unclear how the suffix is deterministically derived from the shared key and stegotext to guarantee unique extraction with overwhelming probability, especially under low-entropy model outputs that could produce collisions or ambiguous matches.
[Security Proof] The security reduction (likely in the security theorem) must explicitly bound the adversary's distinguishing advantage after accounting for the list-decoding step and the suffix-matching procedure. If matching is deterministic and simulatable by the adversary, or if list maintenance leaks information, the reduction to the underlying PRG or model indistinguishability may not go through as stated.
[Capacity Lower Bound Derivation] The capacity lower bound derivation assumes the list always contains the correct message and that matching succeeds with probability 1 (or negligibly close). Without an explicit expression relating list size, matching failure probability, and the resulting rate, it is not possible to verify that the bound strictly improves on prior PSS capacity results.

minor comments (2)

[Abstract] The abstract states that proofs are provided but does not indicate whether they appear in the main body or an appendix; moving key proof sketches to the main text would improve accessibility.
[Experiments] The experimental section should report the exact list sizes used and the observed matching success rates across the three LLMs to allow readers to assess how close the empirical capacity comes to the theoretical lower bound.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive comments on our manuscript. The points raised regarding the suffix-matching mechanism, security reduction, and capacity lower bound are important for ensuring the proofs are fully rigorous and transparent. We address each major comment below and will revise the manuscript to incorporate additional details and clarifications in the scheme construction, proofs, and derivations.

read point-by-point responses

Referee: The correctness argument for the suffix-matching mechanism (described in the scheme construction and correctness proof) is load-bearing for both the claimed decoding reliability and the capacity lower bound. It is unclear how the suffix is deterministically derived from the shared key and stegotext to guarantee unique extraction with overwhelming probability, especially under low-entropy model outputs that could produce collisions or ambiguous matches.

Authors: We agree that the derivation and collision analysis require more explicit treatment. In the construction, the suffix is computed deterministically as a pseudorandom function (keyed by the shared secret) applied to a fixed-length prefix of the stegotext; the receiver recomputes the same suffix from the received stegotext and key to identify the matching candidate. The correctness proof shows that the list always contains the correct message (by list-decoding design) and that the probability of a spurious suffix match is at most 2^{-l} (where l is the suffix length chosen as a security parameter), which is negligible even when model entropy is low because the PRF output is independent of the low-entropy token distribution. We will revise the correctness section to include this explicit derivation, a formal collision bound, and a short argument that low-entropy outputs do not increase the collision probability beyond the negligible term. revision: yes
Referee: The security reduction (likely in the security theorem) must explicitly bound the adversary's distinguishing advantage after accounting for the list-decoding step and the suffix-matching procedure. If matching is deterministic and simulatable by the adversary, or if list maintenance leaks information, the reduction to the underlying PRG or model indistinguishability may not go through as stated.

Authors: The security theorem already reduces to the indistinguishability of the underlying model (or PRG) by showing that both list generation and suffix computation are simulatable given only the covertext distribution. Because the list is produced by the same sampling process as ordinary generation and the suffix is a deterministic function of the key and stegotext (unknown to the adversary), the entire view is computationally indistinguishable; the distinguishing advantage is therefore bounded by the model advantage plus a negligible term from the PRF. We will expand the security proof to write out this hybrid argument explicitly, including the simulation of list maintenance and suffix matching, and state the concrete bound on the adversary's advantage. revision: yes
Referee: The capacity lower bound derivation assumes the list always contains the correct message and that matching succeeds with probability 1 (or negligibly close). Without an explicit expression relating list size, matching failure probability, and the resulting rate, it is not possible to verify that the bound strictly improves on prior PSS capacity results.

Authors: The capacity theorem derives a lower bound of the form R >= (log_2 K - o(1)) / n, where K is the list size and the o(1) term absorbs the negligible matching failure probability proved in the correctness theorem. We will revise the capacity section to display the full expression that includes the explicit matching-failure term (log_2(1/(1-epsilon)) where epsilon is the collision probability) and to compare it directly with the entropy-utilization rates of prior PSS schemes, confirming the strict improvement for the chosen parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The paper constructs a list-decoding steganography scheme, introduces a suffix-matching extractor for correctness, and asserts independent theoretical proofs of security, correctness, and a capacity lower bound. No quoted equations, parameter fits, or self-citations in the abstract or described chain reduce the claimed capacity bound or security reduction to a tautology or to the input data by construction. The proofs are presented as external to the fitted values, and the scheme is described as a direct modular replacement of sampling without self-referential parameters. This satisfies the default expectation of a non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from provable security in steganography and the sampling properties of LLMs; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Computational indistinguishability between covertext and stegotext under standard cryptographic assumptions
Invoked as the security guarantee for existing PSS schemes and carried over to the new construction.

pith-pipeline@v0.9.0 · 5524 in / 1198 out tokens · 35777 ms · 2026-05-09T21:51:24.808461+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Jinze Bai, Shuai Bai, and Yunfei Chu et. al. 2023. Qwen Technical Report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Minhao Bai, Kaiyi Pang, Guorui Liao, Jinshuai Yang, and Yongfeng Huang. 2025. Shimmer: a Provably Secure Steganography Based on Entropy Collecting Mech- anism. In34th USENIX Security Symposium (USENIX Security 25). 5949–5965

2025
[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Distribution Copies

Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Neng- hai Yu. 2023. Discop: Provably Secure Steganography in Practice Based on “Distribution Copies”. In2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2238–2255

2023
[5]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Tina Fang, Martin Jaggi, and Katerina Argyraki. 2017. Generating Steganographic Text with LSTMs. InProceedings of ACL 2017, Student Research Workshop. 100– 106

2017
[7]

Guruswami and M

V. Guruswami and M. Sudan. 1999. Improved decoding of Reed-Solomon and algebraic-geometry codes.IEEE Transactions on Information Theory45, 6 (1999), 1757–1767. https://doi.org/10.1109/18.782097

work page doi:10.1109/18.782097 1999
[8]

Nicholas J. Hopper. 2004.Toward a Theory of Steganography. Ph.D. thesis. Carnegie Mellon University, Pittsburgh, PA

2004
[9]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al . 2023. Mistral 7B.arXiv preprint arXiv:2310.06825(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Jois, Matthew Green, and Aviel D

Gabriel Kaptchuk, Tushar M. Jois, Matthew Green, and Aviel D. Rubin. 2021. Meteor: Cryptographically Secure Steganography for Realistic Distributions. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security(Virtual Event, Republic of Korea)(CCS ’21). Association for Computing Machinery, New York, NY, USA, 1529–1548. https:...

work page doi:10.1145/3460120 2021
[11]

Guorui Liao, Jinshuai Yang, Weizhi Shao, and Yongfeng Huang. 2025. A frame- work for designing provably secure steganography. In34th USENIX Security Symposium (USENIX Security 25). 6837–6856

2025
[12]

Jinjie Ni, Fuzhao Xue, Yuntian Deng, Jason Phang, Kabir Jain, Mahir Hitesh Shah, Zangwei Zheng, and Yang You. 2023. Instruction in the Wild: A User-based Instruction Dataset. https://github.com/XueFuzhao/InstructionWild

2023
[13]

Madhu Sudan. 1997. Decoding of Reed Solomon Codes beyond the Error- Correction Bound.J. Complex.13, 1 (March 1997), 180–193. https://doi.org/10. 1006/jcom.1997.0439

work page arXiv 1997
[14]

Alastair J Walker. 1974. New fast method for generating discrete random numbers with arbitrary frequency distributions.Electronics Letters10, 8 (1974), 127–128

1974
[15]

Alastair J Walker. 1977. An efficient method for generating discrete random variables with general distributions.ACM Transactions on Mathematical Software (TOMS)3, 3 (1977), 253–256

1977
[16]

Yaofei Wang, Gang Pei, Kejiang Chen, Jinyang Ding, Chao Pan, Weilong Pang, Donghui Hu, and Weiming Zhang. 2025. SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling. In34th USENIX Security Symposium Kaiyi Pang and Minhao Bai 5 7 9 11 13 15 17 19 20 100 200 300 400 500 600 0.768 0.753 0.853 0.910 0.927 0.940 0.947 0.966 0.968 0.717 0....

2025
[17]

Zhongliang Yang, Yongfeng Huang, and Yujin Zhang. 2020. TS-CSW: Text steganalysis and hidden capacity estimation based on convolutional sliding windows.Multimedia Tools and Applications79 (2020), 18293–18316

2020
[18]

Zhong-Liang Yang, Xiao-Qing Guo, Zi-Ming Chen, Yong-Feng Huang, and Yu-Jin Zhang. 2018. RNN-stega: Linguistic steganography based on recurrent neural networks.IEEE Transactions on Information Forensics and Security14, 5 (2018), 1280–1295

2018
[19]

Zhong-Liang Yang, Si-Yu Zhang, Yu-Ting Hu, Zhi-Wen Hu, and Yong-Feng Huang
[20]

IEEE Transactions on Information Forensics and Security16 (2020), 880–895

VAE-Stega: linguistic steganography based on variational auto-encoder. IEEE Transactions on Information Forensics and Security16 (2020), 880–895

2020
[21]

Zachary Ziegler, Yuntian Deng, and Alexander M Rush. 2019. Neural Linguistic Steganography. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 1210–1215. Provably Secure Steganography Based on List Decoding Algorithm 3Alias Sampl...

2019