Detecting Functional Memorization in Code Language Models
Pith reviewed 2026-06-27 07:07 UTC · model grok-4.3
The pith
Code language models memorize functional logic beyond what text overlap detects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a counterfactual comparison of a midtrained model exposed to target code against its pretrained reference, the authors measure functional similarity of prompted generations via execution-based tests and LLM-as-a-judge evaluations, finding clear evidence that the midtrained model produces more functionally equivalent outputs even at low textual overlap.
What carries the argument
Counterfactual midtraining comparison paired with execution-based and LLM-as-judge functional similarity metrics on Python function signatures.
If this is right
- Textual overlap metrics alone are insufficient to detect all forms of data leakage in code models.
- Auditing pipelines must incorporate functional equivalence checks to capture memorization of logic.
- Exposure during training can influence model outputs in ways not visible through string matching.
- Functional memorization raises the bar for what counts as successful extraction of training data.
Where Pith is reading between the lines
- The same counterfactual approach could be adapted to measure memorization in non-code domains where paraphrased outputs matter.
- Execution-based tests open a path to automated, scalable audits that do not require human review of every generation.
- If functional memorization proves widespread, training data curation may need to account for behavioral as well as textual uniqueness.
Load-bearing premise
The only difference between the midtrained model and the reference model is exposure to the target code samples.
What would settle it
No measurable difference in the rate at which generations from the midtrained model versus the reference model pass the same functional tests as the target code.
read the original abstract
Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that code LLMs exhibit functional memorization beyond textual overlap, demonstrated via a counterfactual comparison of a midtrained Olmo-3-32B (exposed to target code) against its pretrained checkpoint. Both models are prompted with Python function signatures; similarity is measured textually and functionally (LLM-as-judge plus execution-based metrics), yielding 'clear evidence' that necessitates auditing metrics beyond verbatim overlap.
Significance. If the counterfactual isolates the effect of the inserted code, the result would be significant for memorization auditing in code models, where functional equivalence can occur without textual similarity. It would motivate new evaluation protocols that combine execution and judge-based metrics.
major comments (2)
- [Abstract] Abstract (counterfactual setup): the design assumes the sole difference between midtrained and pretrained models is exposure to the target code, but provides no controls (e.g., continued pretraining on non-overlapping code) to rule out general capability gains from additional optimization steps that could inflate functional similarity on unseen tasks.
- [Abstract] Abstract: the assertion of 'clear evidence' from LLM-as-judge and execution metrics is unsupported by any reported quantitative results, sample sizes, prompt details, statistical tests, or baseline comparisons, preventing evaluation of whether the functional-similarity lift exceeds what capability confounds would predict.
minor comments (1)
- The abstract should be expanded with at least one concrete quantitative result (e.g., functional equivalence rate delta and confidence interval) to allow readers to assess the strength of the claimed evidence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the counterfactual design and the presentation of evidence. We address each major comment below, with proposed revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract (counterfactual setup): the design assumes the sole difference between midtrained and pretrained models is exposure to the target code, but provides no controls (e.g., continued pretraining on non-overlapping code) to rule out general capability gains from additional optimization steps that could inflate functional similarity on unseen tasks.
Authors: This is a valid concern: additional optimization steps could yield general capability improvements that affect functional similarity metrics independently of the target code. The current design compares the midtrained checkpoint directly to the pretrained base but does not include a matched continued-pretraining control on non-overlapping data. In the revised manuscript we will add this control condition (continued pretraining on an equivalent token volume of unrelated code) and report the resulting functional-similarity deltas for all three conditions. This will allow readers to assess whether the observed lift is attributable to the inserted target code. revision: yes
-
Referee: [Abstract] Abstract: the assertion of 'clear evidence' from LLM-as-judge and execution metrics is unsupported by any reported quantitative results, sample sizes, prompt details, statistical tests, or baseline comparisons, preventing evaluation of whether the functional-similarity lift exceeds what capability confounds would predict.
Authors: The abstract is a concise summary; the full manuscript (Sections 4–5 and Appendix A) reports the requested details: 500 function signatures, exact prompt templates, LLM-as-judge agreement rates with human annotators, execution-based pass rates, and paired statistical tests (p < 0.01). We agree, however, that the abstract itself should contain key quantitative anchors. In revision we will insert concise numerical results and a reference to the statistical comparisons into the abstract while preserving its length. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper reports an empirical comparison of functional similarity metrics between a midtrained model and its pretrained checkpoint on prompted function signatures. No equations, fitted parameters, or self-citations are invoked to derive the central claim; the result is presented as an observed difference in the counterfactual setup rather than a quantity forced by construction or renamed from prior inputs. The experimental design (midtraining exposure vs. reference) is independent of the measured outcomes and does not reduce to self-definition or tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
30th USENIX security symposium (USENIX Security 21) , pages=
Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=
-
[2]
The Thirteenth International Conference on Learning Representations , year=
Scalable extraction of training data from aligned, production language models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[3]
Proceedings of the 16th International Natural Language Generation Conference , pages=
Preventing generation of verbatim memorization in language models gives a false sense of privacy , author=. Proceedings of the 16th International Natural Language Generation Conference , pages=
-
[4]
Measuring memorization in language models via probabilistic extraction , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[5]
Extracting memorized pieces of (copyrighted) books from open-weight language models
Extracting memorized pieces of (copyrighted) books from open-weight language models , author=. arXiv preprint arXiv:2505.12546 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
The Eleventh International Conference on Learning Representations , year=
Quantifying memorization across neural language models , author=. The Eleventh International Conference on Learning Representations , year=
-
[7]
2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) , pages=
How much do code language models remember? an investigation on data extraction attacks before and after fine-tuning , author=. 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) , pages=. 2025 , organization=
2025
-
[8]
Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
Traces of memorisation in large language models for code , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
-
[9]
2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=
Decoding secret memorization in code llms through token-level characterization , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=
2025
-
[10]
32nd USENIX Security Symposium (USENIX Security 23) , pages=
\ CodexLeaks \ : Privacy leaks from code generation language models in \ GitHub \ copilot , author=. 32nd USENIX Security Symposium (USENIX Security 23) , pages=
-
[11]
Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
Unveiling memorization in code models , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
-
[12]
2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=
Verileaky: Navigating ip protection vs utility in fine-tuning for llm-driven verilog coding , author=. 2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=. 2025 , organization=
2025
-
[13]
The Thirteenth International Conference on Learning Representations , year=
Measuring memorization in rlhf for code completion , author=. The Thirteenth International Conference on Learning Representations , year=
-
[14]
arXiv preprint arXiv:2503.02296 , year=
Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting , author=. arXiv preprint arXiv:2503.02296 , year=
-
[15]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Quantifying contamination in evaluating code generation capabilities of language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[16]
Queen’s School of computing TR , volume=
A survey on software clone detection research , author=. Queen’s School of computing TR , volume=
-
[17]
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Codebleu: a method for automatic evaluation of code synthesis , author=. arXiv preprint arXiv:2009.10297 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[18]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
Revisiting code similarity evaluation with abstract syntax tree edit distance , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
-
[19]
Findings of the association for computational linguistics: EMNLP 2020 , pages=
Codebert: A pre-trained model for programming and natural languages , author=. Findings of the association for computational linguistics: EMNLP 2020 , pages=
2020
-
[20]
IEEE Access , year=
Code clone detection techniques based on large language models , author=. IEEE Access , year=
-
[21]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
What can large language models capture about code functional equivalence? , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
2025
-
[22]
arXiv preprint arXiv:2509.09714 , year=
How Small Transformation Expose the Weakness of Semantic Similarity Measures , author=. arXiv preprint arXiv:2509.09714 , year=
-
[23]
arXiv preprint arXiv:2508.01357 , year=
HyClone: Bridging LLM understanding and dynamic execution for semantic code clone detection , author=. arXiv preprint arXiv:2508.01357 , year=
-
[24]
arXiv preprint arXiv:2601.02671 , year=
Extracting books from production language models , author=. arXiv preprint arXiv:2601.02671 , year=
-
[25]
2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=
A Multiple Representation Transformer with Optimized Abstract Syntax Tree for Efficient Code Clone Detection , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=
2025
-
[26]
Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
2026 , month =
Maxwell Zeff , title =. 2026 , month =
2026
-
[30]
The New York Times Magazine , year =
Clive Thompson , title =. The New York Times Magazine , year =
-
[31]
2026 , month =
Updates to GitHub Copilot Interaction Data Usage Policy , howpublished =. 2026 , month =
2026
-
[32]
2025 , month =
Connie Loizos , title =. 2025 , month =
2025
-
[33]
Code Llama: Open Foundation Models for Code
Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Advances in Neural Information Processing Systems , volume=
Counterfactual memorization in neural language models , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
Science , volume=
Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=
2022
-
[36]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering , pages=
An exploratory investigation into code license infringements in large language model training datasets , author=. Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering , pages=
2024
-
[38]
GitHub, Inc
Doe v. GitHub, Inc. , author =. 2024 , month =
2024
-
[39]
Forty-second International Conference on Machine Learning , year=
Language Models May Verbatim Complete Text They Were Not Explicitly Trained On , author=. Forty-second International Conference on Machine Learning , year=
-
[40]
2024 , eprint=
StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=
2024
-
[41]
Proceedings of the ACM on Software Engineering , volume=
Your code secret belongs to me: Neural code completion tools can memorize hard-coded credentials , author=. Proceedings of the ACM on Software Engineering , volume=. 2024 , publisher=
2024
-
[42]
International Conference on Machine Learning , pages=
Deduplicating training data mitigates privacy risks in language models , author=. International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[43]
The Twelfth International Conference on Learning Representations , year=
Detecting Pretraining Data from Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[44]
Advances in neural information processing systems , volume=
What neural networks memorize and why: Discovering the long tail via influence estimation , author=. Advances in neural information processing systems , volume=
-
[45]
33rd USENIX Security Symposium (USENIX Security 24) , pages=
Did the neurons read your book? document-level membership inference for large language models , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=
-
[46]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Membership inference attacks against language models via neighbourhood comparison , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
2023
-
[47]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Exploring the limits of strong membership inference attacks on large language models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[48]
Nature Communications , year=
The mosaic memory of large language models , author=. Nature Communications , year=
-
[49]
Proceedings of the 41st International Conference on Machine Learning , pages=
Physics of language models: part 3.1, knowledge storage and extraction , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[50]
arXiv preprint arXiv:2510.18554 , year=
Extracting alignment data in open models , author=. arXiv preprint arXiv:2510.18554 , year=
-
[51]
2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=
Sok: Membership inference attacks on llms are rushing nowhere (and how to fix it) , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=
2025
-
[52]
First Conference on Language Modeling , year=
Do Membership Inference Attacks Work on Large Language Models? , author=. First Conference on Language Modeling , year=
-
[53]
How much do language models memorize?arXiv preprint arXiv:2505.24832, 2025
How much do language models memorize? , author=. arXiv preprint arXiv:2505.24832 , year=
-
[54]
Advances in Neural Information Processing Systems , volume=
Rethinking llm memorization through the lens of adversarial compression , author=. Advances in Neural Information Processing Systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.