pith. sign in

arxiv: 2606.10692 · v1 · pith:7VUHQXYLnew · submitted 2026-06-09 · 💻 cs.CR · cs.LG

Do LLMsMakeNeural Distinguishers Wise?

Pith reviewed 2026-06-27 12:28 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords neural distinguisherslarge language modelscryptanalysisSPECK-32/64prompt designResNetdifferential cryptanalysis
0
0 comments X

The pith

Large language models do not improve the performance of neural distinguishers compared to ResNet models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can enhance neural distinguishers used in cryptanalysis of symmetric ciphers. Neural distinguishers are machine learning models trained on plaintext-ciphertext pairs with specific differences to distinguish the cipher from random. The authors design prompts to use LLMs for this task and test them on the SPECK-32/64 cipher. Their experiments show that these LLM-based distinguishers do not outperform the ResNet models from previous work. They also find that using only XOR results in the prompt improves performance while high round counts make difference selection ineffective for both methods.

Core claim

LLM-based neural distinguishers built with prompt designs show no observable improvement in performance over ResNet-based neural distinguishers when tested on SPECK-32/64. At high rounds, the choice of differences loses effectiveness for LLM-based distinguishers just as it does for ResNet. The performance of LLM-based neural distinguishers improves significantly when the prompt includes only the results of the XOR operation.

What carries the argument

Prompt design for LLM-based neural distinguishers that input differential plaintext-ciphertext pairs to classify whether they come from the cipher or not.

If this is right

  • LLMs do not strengthen neural distinguishers beyond ResNet performance.
  • Difference choice is ineffective at high rounds for LLM-based neural distinguishers.
  • Using only XOR operation results in prompts significantly improves LLM-based neural distinguisher performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The task of distinguishing based on differential pairs may favor convolutional architectures like ResNet over language models.
  • Careful selection of what information to include in prompts can enhance LLM performance in technical domains like cryptanalysis.
  • The lack of improvement on this lightweight cipher raises questions about whether LLMs would help on more complex ciphers.

Load-bearing premise

The specific prompt designs and choice of SPECK-32/64 with the tested differences are representative enough to conclude that LLMs do not strengthen neural distinguishers in general.

What would settle it

Finding an LLM prompt or configuration that achieves higher accuracy than the ResNet baseline on SPECK-32/64 would falsify the claim of no improvement.

Figures

Figures reproduced from arXiv: 2606.10692 by Masashi Hisai, Naoto Yanai, Tatsuya Sakagami.

Figure 1
Figure 1. Figure 1: a and Figure 1b show accuracy and the F1 score for each model, respectively. According to the figures, we observe that, except for the F1 scores of qwen3-32b at round 8 and mixtral-8x7b at round 7 and round 8, the performance of the LLM-based neural distinguishers is lower than that of ResNet, which means no observable improvement. Focusing on the LLMs, we observe that gpt-oss-120b achieves higher accuracy… view at source ↗
Figure 2
Figure 2. Figure 2: PCA of ciphertexts at each round for 105 samples. (a) Round 5 (b) Round 6 (c) Round 7 (d) Round 8 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PCA clustering of ciphertexts at each round for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA of ciphertexts at each round for 107 samples. (a) Round 5 (b) Round 6 (c) Round 7 (d) Round 8 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PCA of ciphertexts at each round for 5 × 107 samples. not evaluate prompt settings that exclude the few-shot com￾ponent because we assume that the difference distinguishing task requires ciphertext pairs and/or differences. The resulting prompts are provided in A. B. Result [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The Impact of prompt design on the performance of LLM-based neural distinguishers for round 5. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Neural distinguishers are a cryptanalysis method for symmetric-key cryptography that trains machine learning models on pairs of plaintexts and ciphertexts with specific differences in order to recover a secret key. To the best of our knowledge, no existing work has explored the use of large language models (LLMs) for neural distinguishers. In this paper, we propose LLM-based neural distinguishers through a prompt design and conduct extensive experiments with them on SPECK-32/64 to investigate whether LLMs can strengthen neural distinguishers. We then found three key insights. First, by comparing the results of LLM-based neural distinguishers with ResNet in the existing work, we demonstrate that LLMs provide no observable improvement in the performance of neural distinguishers. Second, we confirm that, at high rounds, the choice of differences is no longer effective for LLM-based neural distinguishers as well as ResNet. Third, we show that the performance of LLM-based neural distinguishers can be significantly improved by incorporating only the XOR operation results as a prompt design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes prompt-based LLM neural distinguishers for cryptanalysis and reports experiments on SPECK-32/64. It claims three findings: (1) LLMs show no observable improvement over existing ResNet distinguishers, (2) at high rounds the choice of input differences loses effectiveness (as with ResNet), and (3) incorporating XOR operation results into the prompt yields significant performance gains for the LLM approach.

Significance. If the empirical comparisons hold after proper controls and replication, the work would indicate that current LLMs add little value over established CNNs for neural distinguishers on lightweight ARX ciphers, while underscoring prompt sensitivity (XOR variant). This could steer the field toward more targeted ML architectures rather than general-purpose LLMs for this task.

major comments (3)
  1. [Abstract] Abstract: the three findings are asserted without any reported metrics (accuracy, TPR/FPR), dataset sizes, number of rounds/differences tested, training details, or statistical tests, so it is impossible to determine whether the data support the claims. This is load-bearing for all three contributions.
  2. [Abstract (first finding)] The primary claim (no observable improvement over ResNet) rests on comparisons only for selected differences on SPECK-32/64; the manuscript itself notes that adding the XOR variant changes results, which directly indicates that performance is sensitive to prompt details and undermines the generality of the 'no improvement' conclusion.
  3. [Abstract (second and third findings)] The second finding (differences ineffective at high rounds) and third finding (XOR improvement) require explicit quantification of round counts, exact differences, and the magnitude of the reported gains versus the non-XOR baseline to be evaluable; without these the claims cannot be assessed for reproducibility or effect size.
minor comments (2)
  1. [Title] Title contains an apparent typographical error ('Do LLMsMakeNeural' lacks spaces).
  2. [Abstract] Abstract uses inconsistent tense ('we then found' after present-tense description of the proposal).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and evaluability of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the three findings are asserted without any reported metrics (accuracy, TPR/FPR), dataset sizes, number of rounds/differences tested, training details, or statistical tests, so it is impossible to determine whether the data support the claims. This is load-bearing for all three contributions.

    Authors: We agree that the abstract should contain sufficient quantitative details for the claims to be assessed. The full manuscript reports accuracy metrics, dataset sizes (typically 10^6 samples), rounds tested (1-8 for SPECK-32/64), specific differences, and training procedures, but these were omitted from the abstract for brevity. In the revision we will expand the abstract to include representative accuracy values, round ranges, and a note on the experimental scale. revision: yes

  2. Referee: [Abstract (first finding)] The primary claim (no observable improvement over ResNet) rests on comparisons only for selected differences on SPECK-32/64; the manuscript itself notes that adding the XOR variant changes results, which directly indicates that performance is sensitive to prompt details and undermines the generality of the 'no improvement' conclusion.

    Authors: The first finding concerns the base prompt (without XOR) versus ResNet on the tested differences; the XOR variant is introduced separately as the third finding and is not part of the base comparison. We therefore maintain that the 'no improvement' statement holds for the standard prompt design. However, we acknowledge the referee's point on prompt sensitivity and will revise the abstract to explicitly separate the base results from the XOR-enhanced results while noting that performance depends on prompt formulation. revision: partial

  3. Referee: [Abstract (second and third findings)] The second finding (differences ineffective at high rounds) and third finding (XOR improvement) require explicit quantification of round counts, exact differences, and the magnitude of the reported gains versus the non-XOR baseline to be evaluable; without these the claims cannot be assessed for reproducibility or effect size.

    Authors: We agree that the abstract must supply these quantities. The manuscript already contains the details (rounds 5-8 for the high-round regime, concrete differences such as 0x0040/0x0000, and accuracy deltas between XOR and non-XOR prompts). We will add explicit round counts, example differences, and quantified gains (e.g., accuracy improvement of X percentage points) to the abstract and ensure the main text tables are referenced there. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results stand independently on experiments.

full rationale

The paper's central claim rests on direct experimental comparison of LLM prompt-based distinguishers versus published ResNet accuracies for SPECK-32/64. No equations, parameter fits, derivations, or self-citations appear in the provided text. The three insights are stated as outcomes of the runs, with no reduction of any result to its own inputs by construction. The argument is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5709 in / 1099 out tokens · 20697 ms · 2026-06-27T12:28:58.115131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 1 linked inside Pith

  1. [1]

    Improving attacks on round-reduced speck32/64 using deep learning,

    A. Gohr, “Improving attacks on round-reduced speck32/64 using deep learning,” inAnnual International Cryptology Conference. Springer, 2019, pp. 150–179

  2. [2]

    Machine learning in side-channel analysis: a first study,

    G. Hospodar, B. Gierlichs, E. De Mulder, I. Verbauwhede, and J. Vande- walle, “Machine learning in side-channel analysis: a first study,”Journal of Cryptographic Engineering, vol. 1, no. 4, pp. 293–302, 2011

  3. [3]

    Differential-ml distinguisher: Machine learn- ing based generic extension for differential cryptanalysis,

    T. Yadav and M. Kumar, “Differential-ml distinguisher: Machine learn- ing based generic extension for differential cryptanalysis,” inProc. of LATINCRYPT, ser. LNCS, P. Longa and C. R `afols, Eds., vol. 12912. Springer, 2021, pp. 191–212

  4. [4]

    Machine learning-assisted differential distinguishers for lightweight ciphers,

    A. Baksi, “Machine learning-assisted differential distinguishers for lightweight ciphers,” inClassical and Physical Security of Symmetric Key Cryptographic Algorithms. Springer, 2022, pp. 141–162

  5. [5]

    Enhancing neural distinguishers with partial difference bits leakage,

    Y . Hu, L. Li, S. Zhu, and Z. Hu, “Enhancing neural distinguishers with partial difference bits leakage,”Internet Things, vol. 29, p. 101438, 2025

  6. [6]

    Output prediction attacks on block ciphers using deep learning,

    H. Kimura, K. Emura, T. Isobe, R. Ito, K. Ogawa, and T. Ohigashi, “Output prediction attacks on block ciphers using deep learning,” in Proc. of ACNSW 2022, ser. LNCS, vol. 13285. Springer, 2022, pp. 248–276

  7. [7]

    Enhanced neural distinguisher model for efficient differential cryptanalysis,

    Y . Lu, Y . Guo, W. Liu, W. Chen, Q. Yan, and B. Yu, “Enhanced neural distinguisher model for efficient differential cryptanalysis,”IEEE Internet of Things Journal, 2025

  8. [8]

    A new (related-key) neural distinguisher using two differences for differential cryptanalysis,

    G. Wang, G. Wang, and S. Sun, “A new (related-key) neural distinguisher using two differences for differential cryptanalysis,”IET Information Security, vol. 2024, no. 1, pp. 1–11, 2024

  9. [9]

    A deeper look at ma- chine learning-based cryptanalysis,

    A. Benamira, D. Gerault, T. Peyrin, and Q. Q. Tan, “A deeper look at ma- chine learning-based cryptanalysis,” inAnnual international conference on the theory and applications of cryptographic techniques. Springer, 2021, pp. 805–835

  10. [10]

    Improve neural distinguishers of simon and speck,

    Z. Hou, J. Ren, and S. Chen, “Improve neural distinguishers of simon and speck,”Security and Communication Networks, vol. 2021, no. 1, p. 9288229, 2021

  11. [11]

    Neural distinguishers on tinyjambu-128 and gift-64,

    T. Sun, D. Shen, S. Long, Q. Deng, and S. Wang, “Neural distinguishers on tinyjambu-128 and gift-64,” inInternational Conference on Neural Information Processing. Springer, 2022, pp. 419–431

  12. [12]

    Enhancing differential-neural cryptanalysis,

    Z. Bao, J. Guo, M. Liu, L. Ma, and Y . Tu, “Enhancing differential-neural cryptanalysis,” inInternational conference on the theory and application of cryptology and information security. Springer, 2022, pp. 318–347

  13. [13]

    Deep neural networks aiding cryptanalysis: A case study of the speck distinguisher,

    N. B ˘acuiet,i, L. Batina, and S. Picek, “Deep neural networks aiding cryptanalysis: A case study of the speck distinguisher,” inInternational Conference on Applied Cryptography and Network Security. Springer, 2022, pp. 809–829

  14. [14]

    Enhanced neu- ral differential distinguisher for speck32/64 using attention mechanisms and multi ciphertext inputs,

    X. Jiang, M. Li, M. Kaiyrbek, V . Lakhno, and S. Andrii, “Enhanced neu- ral differential distinguisher for speck32/64 using attention mechanisms and multi ciphertext inputs,”Informatica, vol. 49, no. 19, 2025

  15. [15]

    Is gpt-4 a good data analyst?

    L. Cheng, X. Li, and L. Bing, “Is gpt-4 a good data analyst?”arXiv preprint arXiv:2305.15038, 2023

  16. [16]

    Prompt- to-sql injections in llm-integrated web applications: Risks and defenses,

    R. Pedro, M. E. Coimbra, D. Castro, P. Carreira, and N. Santos, “Prompt- to-sql injections in llm-integrated web applications: Risks and defenses,” inProc. of ICSE 2025, 2025, pp. 1768–1780

  17. [17]

    Red-teaming llm multi-agent systems via communication attacks,

    P. He, Y . Lin, S. Dong, H. Xu, Y . Xing, and H. Liu, “Red-teaming llm multi-agent systems via communication attacks,” inProc. of ACL 2025, 2025, pp. 6726–6747

  18. [18]

    PentestGPT: Evaluating and harnessing large language models for automated penetration testing,

    G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “PentestGPT: Evaluating and harnessing large language models for automated penetration testing,” inProc. of USENIX Security 2024. USENIX Association, 2024, pp. 847–864

  19. [19]

    A novel approach to construct a good dataset for differential-neural cryptanalysis,

    B. Seok and C. Lee, “A novel approach to construct a good dataset for differential-neural cryptanalysis,”IEEE Transactions on Dependable and Secure Computing, vol. 22, no. 1, pp. 246–262, 2025

  20. [20]

    The SIMON and SPECK families of lightweight block ciphers,

    R. Beaulieu, D. Shors, J. Smith, S. Treatman-Clark, B. Weeks, and L. Wingers, “The SIMON and SPECK families of lightweight block ciphers,” Cryptology ePrint Archive, Paper 2013/404, 2013. [Online]. Available: https://eprint.iacr.org/2013/404

  21. [21]

    Investigating and enhancing the neural distinguisher for differential cryptanalysis,

    G. Wang, G. Wang, and S. Sun, “Investigating and enhancing the neural distinguisher for differential cryptanalysis,”IEICE Transactions on Information and Systems, vol. E107-D, no. 8, pp. 1016–1028, 2024

  22. [22]

    Benchmarking large language models for cryptanalysis and mismatched-generalization,

    U. Maskey, C. Zhu, and U. Naseem, “Benchmarking large language models for cryptanalysis and mismatched-generalization,”arXiv preprint arXiv:2505.24621, 2025

  23. [23]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  24. [24]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  25. [25]

    A new neural distinguisher con- sidering features derived from multiple ciphertext pairs,

    Y . Chen, Y . Shen, H. Yu, and S. Yuan, “A new neural distinguisher con- sidering features derived from multiple ciphertext pairs,”The Computer Journal, vol. 66, no. 6, pp. 1419–1433, 2022

  26. [26]

    Improving deep learning-based neural distinguisher with multiple ciphertext pairs for speck and simon,

    Y . Hou, J. Liu, S. Han, Z. Ma, X. Ye, and X. Nie, “Improving deep learning-based neural distinguisher with multiple ciphertext pairs for speck and simon,”Scientific Reports, vol. 15, no. 1, p. 13696, 2025

  27. [27]

    Improved (related-key) differential-based neural distinguishers for simon and simeck block ciphers,

    J. Lu, G. Liu, B. Sun, C. Li, and L. Liu, “Improved (related-key) differential-based neural distinguishers for simon and simeck block ciphers,”The Computer Journal, vol. 67, no. 2, pp. 537–547, 01 2023

  28. [28]

    Deep learning- based differential distinguishers for nist standard authenticated encryp- tion and permutations,

    D. Pal, M. Chaudhury, A. Das, and D. R. Chowdhury, “Deep learning- based differential distinguishers for nist standard authenticated encryp- tion and permutations,” inProc. of ICMC 2024, ser. LNNS, vol. 963. Springer, 2024, pp. 1–13

  29. [29]

    On the effects of neural network- based output prediction attacks on the design of symmetric-key ciphers,

    H. Watanabe, R. Ito, and T. Ohigashi, “On the effects of neural network- based output prediction attacks on the design of symmetric-key ciphers,” Journal of Information Security and Applications, vol. 90, p. 104016, 2025

  30. [30]

    Bayesian modeling for differential cryptanalysis of block ciphers: A des instance,

    V . Agate, F. Concone, A. De Paola, P. Ferraro, G. L. Re, and M. Morana, “Bayesian modeling for differential cryptanalysis of block ciphers: A des instance,”IEEE Access, vol. 11, pp. 4809–4820, 2023

  31. [31]

    Ml based improved differential distinguisher with high accuracy: Application to gift-128 and ascon,

    T. Yadav and M. Kumar, “Ml based improved differential distinguisher with high accuracy: Application to gift-128 and ascon,” inSecurity, Privacy, and Applied Cryptography Engineering, ser. Lecture Notes in Computer Science, vol. 15351. Springer, 2025, pp. 287–316

  32. [32]

    Rethinking learning- based symmetric cryptanalysis: a theoretical perspective,

    Y . Yuan, H. Xu, L. Zhang, and W. Wu, “Rethinking learning- based symmetric cryptanalysis: a theoretical perspective,”IACR Cryptology ePrint Archives, p. 1306, 2025. [Online]. Available: https://eprint.iacr.org/2025/1306

  33. [33]

    A survey of large language models for cyber threat detection,

    Y . Chen, M. Cui, D. Wang, Y . Cao, P. Yang, B. Jiang, Z. Lu, and B. Liu, “A survey of large language models for cyber threat detection,” Computers & Security, vol. 145, p. 104016, 2024

  34. [34]

    Llm-based attack scenarios generator with it asset management and vulnerability information,

    T. Naito, R. Watanabe, and T. Mitsunaga, “Llm-based attack scenarios generator with it asset management and vulnerability information,” in Proc. of ICSPIS 2023. IEEE, 2023, pp. 99–103

  35. [35]

    Pentestagent: Incorporating llm agents to automated penetration testing,

    X. Shen, L. Wang, Z. Li, Y . Chen, W. Zhao, D. Sun, J. Wang, and W. Ruan, “Pentestagent: Incorporating llm agents to automated penetration testing,” inProc. of AsiaCCS 2025. ACM, 2025, pp. 375– 391

  36. [36]

    Qlora: Efficient finetuning of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

  37. [37]

    Do NOT think that much for 2+3=? On the overthinking of long reasoning models,

    X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu, “Do NOT think that much for 2+3=? On the overthinking of long reasoning models,” in Proceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 267. PMLR, 2025, pp. 9487–9499

  38. [38]

    Stop overthinking: A survey on efficient reasoning for large language models,

    Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, N. Zou, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,”Transactions on Machine Learning Research, 2025. [Online]. Available: https://openreview.net/forum?id=HvoG8SxggZ APPENDIXA PROMPTEXAMPLE We list the system prompt...

  39. [41]

    Output only the final answer: 0 or 1. [Input] CXORC ′ : 0xf446|0x5165 [Output] Label : ’1’ System Prompt (w/ XOR, w/ C pair, w/ CoT) [Instruction] Please determine if the ciphertext pair comes from plain- texts with difference 0x0040/0000 (output 1) or random plaintexts (output 0). Output should be either 0 or 1 only. The encryption algorithm used is 5-ro...

  40. [45]

    Output only the final answer: 0 or 1. [Input] C:0x0051|0x35b5 C ′:0xf417|0x64d0 CXORC ′ : 0xf446|0x5165 [Output] Label : ’1’ System Prompt (w/ XOR, w/ C pair, w/o CoT) [Instruction] Please determine if the ciphertext pair comes from plain- texts with difference of 0x0040/0000 (output 1) or ran- dom plaintexts (output 0). Output should be either 0 or 1 onl...

  41. [46]

    Compute the XOR of the two ciphertext halves (left and right)

  42. [47]

    Compare the XOR result with the expected difference pattern (0x0040 for left half, 0x0000 for right half)

  43. [48]

    If the XOR pattern is consistent with the expected difference (or close to it), output 1, otherwise, output 0

  44. [49]

    [Input] C:0x0051|0x35b5 C ′:0xf417|0x64d0 [Output] Label : ’1’

    Output only the final answer: 0 or 1. [Input] C:0x0051|0x35b5 C ′:0xf417|0x64d0 [Output] Label : ’1’