A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Chanchal K. Roy; Md. Abdul Awal; Mrigank Rochan

arxiv: 2511.05476 · v5 · submitted 2025-11-07 · 💻 cs.SE · cs.LG

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Md. Abdul Awal , Mrigank Rochan , Chanchal K. Roy This is my paper

Pith reviewed 2026-05-17 23:42 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords knowledge distillationmetamorphic testinglanguage models of codebehavioral fidelityadversarial robustnessmodel compressionsoftware analytics tasks

0 comments

The pith

Student models distilled from large code teachers often fail to copy their behavioral patterns, exposing much larger drops when code is adversarially changed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether smaller student models produced by knowledge distillation truly replicate the predictive behavior of their larger teacher models on code-related tasks. Standard accuracy measures suggest the students perform well, yet the authors apply metamorphic testing with behavior-preserving code transformations to probe deeper consistency. Their experiments reveal frequent mismatches, where students suffer performance losses up to 285 percent greater than the teachers when subjected to adversarial attacks. These gaps remain invisible to accuracy-only checks, leading the authors to introduce MetaCompress as a framework for systematic behavioral evaluation during distillation.

Core claim

Experiments on two code tasks using models compressed via Compressor, AVATAR, and MORPH show that MetaCompress detects up to 62 percent behavioral discrepancies between teacher and student. Under adversarial attacks the students exhibit up to 285 percent greater performance degradation than their teachers, a difference that accuracy-based evaluation alone does not reveal.

What carries the argument

MetaCompress, a metamorphic testing framework that applies sets of behavior-preserving transformations to input code and measures output differences between teacher and student to quantify lack of deep mimicry.

If this is right

Distillation pipelines for code models need behavioral-fidelity checks in addition to accuracy measurements.
Adversarial robustness of the resulting student models can be substantially weaker than accuracy figures imply.
Compressed code models require evaluation methods that test consistency under input transformations rather than single-point predictions.
MetaCompress offers a concrete way to insert behavioral testing directly into the knowledge-distillation workflow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metamorphic-testing lens could expose similar mimicry shortfalls when distilling models for natural-language or other software tasks.
If behavioral gaps prove widespread, then redesigning distillation objectives to preserve robustness properties may become necessary.
Teams deploying distilled code models should run separate robustness tests instead of assuming teacher-level behavior carries over.
Certain distillation techniques might preserve behavioral fidelity better than others, opening a route for comparative studies.

Load-bearing premise

The chosen metamorphic relations really leave the expected model output unchanged for the studied code tasks, so any observed difference signals a genuine failure of mimicry rather than a flawed test.

What would settle it

Finding that teacher and student models produce matching outputs on all applied metamorphic relations and show identical performance drops under the same adversarial attacks would contradict the reported discrepancies.

Figures

Figures reproduced from arXiv: 2511.05476 by Chanchal K. Roy, Md. Abdul Awal, Mrigank Rochan.

**Figure 2.** Figure 2: Attack success rate metric values across adversarial attack techniques, models, tasks, and [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow of the MetaCompress framework, comparing teacher and student model outputs under behavior-preserving metamorphic relations to assess behavioral fidelity. 4.1. Behavioral Fidelity as a Software Testing Problem Knowledge distillation transfers knowledge from a large teacher model to a smaller student model, aiming to preserve the student’s performance on the original task. However, our empirical re… view at source ↗

**Figure 4.** Figure 4: Behavioral fidelity discrepancies between the outputs of teacher and student models. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: MR1 violation rates for different tasks, models, and knowledge distillation techniques. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: MR2 violation rates for different tasks, models, and knowledge distillation techniques. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Boxplots of KL divergence illustrating output distribution shifts between teacher and student [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies metamorphic testing to show that distilled code models can have large behavioral gaps missed by accuracy, but the relations need explicit validation to back the discrepancy claims.

read the letter

The main point is that accuracy numbers hide real differences in how student models behave compared to their teachers after distillation. They report students dropping up to 285% more under adversarial conditions and 62% discrepancies overall, which traditional metrics miss. This comes from running MetaCompress on compressed versions of models like CodeBERT using Compressor, AVATAR, and MORPH across two tasks. The framework itself is the new piece: it takes existing metamorphic relations and turns them into a systematic check for behavioral fidelity in this setting. Prior distillation papers in code focus on size and accuracy, so adding this lens is a clear step. It does well at keeping the evaluation practical and tied to deployment concerns like speed and energy use. The numbers give a concrete sense of the scale of the problem rather than just saying the issue exists. The soft spot sits with the metamorphic relations. The central claim needs those transformations to be genuinely behavior-preserving for the specific models and tasks, otherwise the output differences could come from the test design instead of distillation shortfalls. The abstract and setup do not show detailed validation steps, oracle checks, or why relations like identifier changes or reordering hold semantics here. That assumption carries a lot of weight, and without it the percentages are harder to interpret. Minor points include wanting clearer data splits and any statistical tests, but the relations are the bigger one. This is for software engineering researchers who build or evaluate compressed code models. A reader focused on evaluation methods beyond accuracy would get practical value from the framework idea. It deserves a serious referee because the question is relevant and the empirical direction is defined enough to review and improve.

Referee Report

2 major / 2 minor

Summary. The paper claims that knowledge distillation for language models of code produces student models that fail to deeply mimic teacher behavior, as traditional accuracy metrics overlook substantial discrepancies; using the proposed MetaCompress metamorphic testing framework, it reports up to 62% behavioral discrepancies and up to 285% greater performance drops under adversarial attacks for students versus teachers across two tasks and three distillation methods (Compressor, AVATAR, MORPH) on models such as CodeBERT.

Significance. If the metamorphic relations prove valid and the empirical results are robust, this would be a meaningful contribution to software engineering by demonstrating the inadequacy of accuracy-only evaluation for compressed code models and providing a practical metamorphic testing framework to assess behavioral fidelity, which could guide more reliable distillation practices and deployment of efficient code models.

major comments (2)

[Abstract] Abstract: The central empirical claim—that output differences under the metamorphic relations indicate failures of deep mimicry rather than artifacts—depends on the relations (e.g., variable renaming, statement reordering, identifier substitution) being behavior-preserving for the specific models and tasks. No validation details, semantic equivalence checks, or oracle procedures are described, which is load-bearing for attributing the reported 62% discrepancies and 285% adversarial drops to distillation quality.
[Evaluation] Evaluation section (inferred from results reporting): The quantitative claims of up to 285% greater performance drop and 62% discrepancies require supporting details on data splits, statistical significance testing, number of experimental runs, and adversarial attack generation methodology to establish that post-hoc choices did not inflate the differences between teacher and student models.

minor comments (2)

[Abstract] Abstract: Briefly list the specific metamorphic relations and the two tasks studied to give readers immediate context without needing to consult the full methods.
[Throughout] Notation: Ensure 'MetaCompress' is consistently introduced as the framework name and that terms like 'behavioral fidelity' and 'deep mimicry' are defined on first use to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each major comment in detail below and will revise the paper to incorporate additional information where it strengthens the work without altering our core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim—that output differences under the metamorphic relations indicate failures of deep mimicry rather than artifacts—depends on the relations (e.g., variable renaming, statement reordering, identifier substitution) being behavior-preserving for the specific models and tasks. No validation details, semantic equivalence checks, or oracle procedures are described, which is load-bearing for attributing the reported 62% discrepancies and 285% adversarial drops to distillation quality.

Authors: We agree that explicit validation of the metamorphic relations is important for rigorously attributing discrepancies to distillation rather than transformation artifacts. The relations in MetaCompress are selected from prior metamorphic testing work on code (e.g., variable renaming and reordering have been validated as semantics-preserving in studies on code clone detection and bug detection). In the revision, we will add a new subsection under Methodology describing the oracle procedures used: for each relation, we verify that the teacher model produces equivalent outputs on original and transformed inputs for a held-out validation set, supplemented by manual semantic checks on a random sample of 100 cases per task. This will clarify that the relations are behavior-preserving for the evaluated models and tasks. revision: yes
Referee: [Evaluation] Evaluation section (inferred from results reporting): The quantitative claims of up to 285% greater performance drop and 62% discrepancies require supporting details on data splits, statistical significance testing, number of experimental runs, and adversarial attack generation methodology to establish that post-hoc choices did not inflate the differences between teacher and student models.

Authors: We acknowledge that additional experimental details are needed for full reproducibility and to confirm robustness. The manuscript summarizes the evaluation but does not explicitly report all parameters. In the revision, we will expand the Evaluation section with: precise data split descriptions (e.g., 80/10/10 ratios with random seeds), statistical significance tests (paired t-tests with p-values reported for teacher-student differences), number of runs (five independent runs with different random seeds, reporting means and standard deviations), and adversarial attack details (using a combination of rule-based perturbations and an adapted TextAttack framework with fixed hyperparameters). These additions will demonstrate that the reported gaps are consistent across runs and not inflated by selective reporting. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical framework and evaluation

full rationale

The paper is an empirical study that proposes the MetaCompress metamorphic testing framework and evaluates it on existing knowledge distillation techniques (Compressor, AVATAR, MORPH) applied to code models such as CodeBERT. Central claims rest on observed performance drops under adversarial attacks and output discrepancies under chosen metamorphic relations, which are external to any author-defined fitting or self-referential equations. No derivation chain reduces predictions or results to inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked to force the outcomes. The work is self-contained against the reported benchmarks and tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that metamorphic relations preserve expected model behavior for code tasks and that adversarial attacks expose meaningful fidelity gaps. No free parameters or invented entities are introduced beyond the new testing framework itself.

axioms (1)

domain assumption Metamorphic relations used are behavior-preserving transformations for the models under test
Invoked when defining the test oracles that compare teacher and student outputs

invented entities (1)

MetaCompress framework no independent evidence
purpose: Systematic evaluation of behavioral fidelity via metamorphic relations
New testing approach proposed in the paper

pith-pipeline@v0.9.0 · 5597 in / 1195 out tokens · 46858 ms · 2026-05-17T23:42:51.326595+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic Survey
cs.SE 2026-05 accept novelty 4.0

A systematic survey of 93 studies that maps the bidirectional relationship between metamorphic testing and LLMs, proposing a taxonomy for MT applied to LLMs and LLMs applied to MT.
Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.
Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017)

work page 2017
[2]

Arshad, S

S. Arshad, S. Abid, S. Shamail, Codebert for code clone detection: A replication study, in: 16th IWSC, IEEE, 2022, pp. 39–45

work page 2022
[3]

Ahmed, K

T. Ahmed, K. S. Pai, P. Devanbu, E. Barr, Automatic semantic augmentation of lan- guage model prompts (for code summarization), in: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

work page 2024
[4]

Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair, D. Wagner, B. Ray, Y. Chen, Vulnerability detection with code language models: How far are we?, arXiv preprint arXiv:2403.18624 (2024)

work page arXiv 2024
[5]

J. Chen, X. Hu, Z. Li, C. Gao, X. Xia, D. Lo, Code search is all you need? im- proving code suggestions with code search, in: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

work page 2024
[6]

J. Shi, Z. Yang, B. Xu, H. J. Kang, D. Lo, Compressing pre-trained models of code into 3 mb, in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, Association for Computing Machinery, New York, NY, USA, 2023.doi:10.1145/3551349.3556964. URLhttps://doi.org/10.1145/3551349.3556964

work page doi:10.1145/3551349.3556964 2023
[7]

Schwartz, J

R. Schwartz, J. Dodge, N. A. Smith, O. Etzioni, Green ai, Communications of the ACM 63 (12) (2020) 54–63

work page 2020
[8]

V. J. Hellendoorn, S. Proksch, H. C. Gall, A. Bacchelli, When code completion fails: A case study on real-world completions, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, 2019, pp. 960–970

work page 2019
[9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

J. Shi, Z. Yang, H. J. Kang, B. Xu, J. He, D. Lo, Greening large language models of code, in: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Society, ICSE-SEIS’24, Association for Computing Machin- ery, New York, NY, USA, 2024, p. 142–153.doi:10.1145/3639475.3640097. URLhttps://doi-org.libproxy.smu.edu.sg...

work page doi:10.1145/3639475.3640097 2024
[11]

J. Shi, Z. Yang, D. Lo, Efficient and green large language models for software en- gineering: Vision and the road ahead, ACM Transactions on Software Engineering and Methodology (2024). 27

work page 2024
[12]

X. Wei, S. K. Gonugondla, S. Wang, W. Ahmad, B. Ray, H. Qian, X. Li, V. Ku- mar, Z. Wang, Y. Tian, et al., Towards greener yet powerful code generation via quantization: An empirical study, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 224–236

work page 2023
[13]

M. Saad, J. A. H. López, B. Chen, D. Varró, T. Sharma, Alpine: An adaptive language-agnostic pruning method for language models for code, arXiv preprint arXiv:2407.04147 (2024)

work page arXiv 2024
[14]

arXiv preprint arXiv:2412.13737 (2024)

G. d’Aloisio, L. Traini, F. Sarro, A. Di Marco, On the compression of language models for code: An empirical study on codebert, arXiv preprint arXiv:2412.13737 (2024)

work page arXiv 2024
[15]

Y. Chen, Y. Ye, Z. Li, Y. Ma, C. Gao, Smaller but better: Self-paced knowledge distillation for lightweight yet effective lcms, Proceedings of the ACM on Software Engineering 2 (FSE) (2025) 3057–3080

work page 2025
[16]

Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural lan- guages, arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002
[17]

Z. Yang, J. Shi, J. He, D. Lo, Natural attack for pre-trained models of code, in: Proceedings of the 44th ICSE, 2022, pp. 1482–1493

work page 2022
[18]

Zhang, Z

H. Zhang, Z. Li, G. Li, L. Ma, Y. Liu, Z. Jin, Generating adversarial examples for holding robustness of source code processing models, in: Proceedings of the AAAI Conference on AI, Vol. 34, 2020, pp. 1169–1176

work page 2020
[19]

Z. Zeng, H. Tan, H. Zhang, J. Li, Y. Zhang, L. Zhang, An extensive study on pre- trained models for program understanding and generation, in: 31st ACM SIGSOFT ISSTA, 2022, pp. 39–51

work page 2022
[20]

T. Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. Tse, Z. Q. Zhou, Metamor- phic testing: A review of challenges and opportunities, ACM Computing Surveys (CSUR) 51 (1) (2018) 1–27

work page 2018
[21]

D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svy- atkovskiy, S. Fu, et al., Graphcodebert: Pre-training code representations with data flow, arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2009
[22]

A. Panichella, Metamorphic-based many-objective distillation of llms for code- related tasks, in: 2025 IEEE/ACM 47th International Conference on Software En- gineering (ICSE), IEEE Computer Society, 2025, pp. 766–766

work page 2025
[23]

S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, et al., Codexglue: A machine learning benchmark dataset for code understanding and generation, arXiv:2102.04664 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

W. U. Ahmad, S. Chakraborty, B. Ray, K.-W. Chang, Unified pre-training for pro- gram understanding and generation, arXiv:2103.06333 (2021). 28

work page arXiv 2021
[25]

Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, arXiv:2109.00859 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, Bert: Pre-training of deep bidirectional transformers for language under- standing, arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y. Liu, Roberta: A robustly optimized bert pretraining approach, arXiv:1907.11692 364 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[28]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V. Sanh, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv:1910.01108 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1910
[29]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, M. Brockschmidt, Codesearchnet challenge: Evaluating the state of semantic code search, arXiv:1909.09436 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909
[30]

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic literature review, ACM Transactions on Software Engineering and Methodology 33 (8) (2024) 1–79

work page 2024
[31]

StarCoder 2 and The Stack v2: The Next Generation

A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al., Starcoder 2 and the stack v2: The next generation, arXiv preprint arXiv:2402.19173 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

X. Liu, X. Liu, L. Bo, X. Wu, Y. Yang, X. Sun, F. Zhou, Pioneer: improving the robustness of student models when compressing pre-trained models of code, Automated Software Engineering 33 (1) (2026) 1–30

work page 2026
[34]

C. Xu, J. McAuley, A survey on model compression and acceleration for pretrained language models, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 10566–10575

work page 2023
[35]

V. Sanh, T. Wolf, A. Rush, Movement pruning: Adaptive sparsity by fine-tuning, Advances in neural information processing systems 33 (2020) 20378–20389

work page 2020
[36]

Zafrir, G

O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8bert: Quantized 8bit bert, in: 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), IEEE, 2019, pp. 36–39

work page 2019
[37]

Elsken, J

T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search: A survey, Journal of Machine Learning Research 20 (55) (2019) 1–21

work page 2019
[38]

2285–2294

W.Chen, J.Wilson, S.Tyree, K.Weinberger, Y.Chen, Compressingneuralnetworks with the hashing trick, in: International conference on machine learning, PMLR, 2015, pp. 2285–2294. 29

work page 2015
[39]

T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, B. Ramabhadran, Low-rank matrix factorization for deep neural network training with high-dimensional out- put targets, in: 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, 2013, pp. 6655–6659

work page 2013
[40]

Afrin, B

S. Afrin, B. Xu, A. Mastropaolo, Is quantization a deal-breaker? empirical insights from large code models, arXiv preprint arXiv:2507.09665 (2025)

work page arXiv 2025
[41]

Zhang, J

J. Zhang, J. Chen, D. Hao, Y. Xiong, B. Xie, L. Zhang, H. Mei, Search-based infer- ence of polynomial metamorphic relations, in: Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, 2014, pp. 701–712

work page 2014
[42]

B. Yang, M. A. A. Mamun, J. M. Zhang, G. Uddin, Hallucination detection in large language models with metamorphic relations, arXiv preprint arXiv:2502.15844 (2025)

work page arXiv 2025
[43]

X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, T. Y. Chen, Testing and validating machinelearningclassifiersbymetamorphictesting, JournalofSystemsandSoftware 84 (4) (2011) 544–558

work page 2011
[44]

T. Y. Chen, S. C. Cheung, S. M. Yiu, Metamorphic testing: a new approach for generating next test cases, arXiv preprint arXiv:2002.12543 (2020)

work page arXiv 2002
[45]

D. Xiao, Z. Liu, Y. Yuan, Q. Pang, S. Wang, Metamorphic testing of deep learning compilers, Proceedings of the ACM on Measurement and Analysis of Computing Systems 6 (1) (2022) 1–28

work page 2022
[46]

Segura, J

S. Segura, J. A. Parejo, J. Troya, A. Ruiz-Cortés, Metamorphic testing of restful web apis, in: Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 882–882

work page 2018
[47]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, arXiv:1312.6199 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[48]

Zhang, Z

H. Zhang, Z. Fu, G. Li, L. Ma, Z. Zhao, H. Yang, Y. Sun, Y. Liu, Z. Jin, To- wards robustness of deep program processing models—detection, estimation, and en- hancement, ACMTransactionsonSoftwareEngineeringandMethodology(TOSEM) 31 (3) (2022) 1–40

work page 2022
[49]

Kullback, R

S. Kullback, R. A. Leibler, On information and sufficiency, The annals of mathemat- ical statistics 22 (1) (1951) 79–86

work page 1951
[50]

Svajlenko, J

J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, M. M. Mia, Towards a big data curated benchmark of inter-project code clones, in: 2014 IEEE International Con- ference on Software Maintenance and Evolution, IEEE, 2014, pp. 476–480

work page 2014
[51]

W. Wang, G. Li, B. Ma, X. Xia, Z. Jin, Detecting code clones with graph neural network and flow-augmented abstract syntax tree, in: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2020, pp. 261–271. 30

work page 2020
[52]

Y. Zhou, S. Liu, J. Siow, X. Du, Y. Liu, Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks, Advances in NIPS 32 (2019)

work page 2019
[53]

X. Du, M. Wen, Z. Wei, S. Wang, H. Jin, An extensive study on adversarial attack against pre-trained models of code, in: 31st FSE, 2023, pp. 489–501

work page 2023
[54]

Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the american statistical association 32 (200) (1937) 675–701

M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the american statistical association 32 (200) (1937) 675–701

work page 1937
[55]

R.F.Woolson, Wilcoxonsigned-ranktest, Wileyencyclopediaofclinicaltrials(2007) 1–3

work page 2007
[56]

W. E. Wong, J. R. Horgan, S. London, H. Agrawal, A study of effective regression testing in practice, in: PROCEEDINGS The Eighth International Symposium On Software Reliability Engineering, IEEE, 1997, pp. 264–274

work page 1997
[57]

Y. Jia, M. Harman, An analysis and survey of the development of mutation testing, IEEE transactions on software engineering 37 (5) (2010) 649–678

work page 2010
[58]

X. Xie, Z. Zhang, T. Y. Chen, Y. Liu, P.-L. Poon, B. Xu, Mettle: A metamorphic testing approach to assessing and validating unsupervised machine learning systems, IEEE Transactions on Reliability 69 (4) (2020) 1293–1322

work page 2020
[59]

Z. Sun, Z. Chen, J. Zhang, D. Hao, Fairness testing of machine translation systems, ACM Transactions on Software Engineering and Methodology 33 (6) (2024) 1–27

work page 2024
[60]

W.-C. Kwan, X. Zeng, Y. Jiang, Y. Wang, L. Li, L. Shang, X. Jiang, Q. Liu, K.-F. Wong, Mt-eval: A multi-turn capabilities evaluation benchmark for large language models, arXiv preprint arXiv:2401.16745 (2024)

work page arXiv 2024
[61]

J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine learning testing: Survey, land- scapes and horizons, IEEE Transactions on Software Engineering 48 (1) (2020) 1–36

work page 2020
[62]

W. Wu, Y. Cao, N. Yi, R. Ou, Z. Zheng, Detecting and reducing the factual hal- lucinations of large language models with metamorphic testing, Proceedings of the ACM on Software Engineering 2 (FSE) (2025) 1432–1453

work page 2025
[63]

LeCun, Y

Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436–444

work page 2015
[64]

Shang, G

Y. Shang, G. Liu, R. R. Kompella, Y. Yan, Enhancing post-training quantization cal- ibration through contrastive learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15921–15930

work page 2024
[65]

Y. Ji, Y. Xiang, J. Li, Q. Xia, P. Li, X. Duan, Z. Wang, M. Zhang, Beware of calibration data for pruning large language models, arXiv preprint arXiv:2410.17711 (2024)

work page arXiv 2024
[66]

J. Ba, R. Caruana, Do deep nets really need to be deep?, Advances in neural infor- mation processing systems 27 (2014). 31

work page 2014
[67]

Wang, K.-J

L. Wang, K.-J. Yoon, Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks, IEEE transactions on pattern analysis and machine intelligence 44 (6) (2021) 3048–3068

work page 2021
[68]

R. Wang, Z. Yang, C. Gao, X. Xia, Q. Liao, An empirical study of knowledge distillation for code understanding tasks, arXiv preprint arXiv:2508.15423 (2025)

work page arXiv 2025
[69]

X. Zhu, J. Li, Y. Liu, C. Ma, W. Wang, A survey on model compression for large language models, Transactions of the Association for Computational Linguistics 12 (2024) 1556–1577

work page 2024
[70]

M. Xu, D. Cai, W. Yin, S. Wang, X. Jin, X. Liu, Resource-efficient algorithms and systems of foundation models: A survey, ACM Computing Surveys 57 (5) (2025) 1–39

work page 2025
[71]

J. Xu, W. Zhou, Z. Fu, H. Zhou, L. Li, A survey on green deep learning, arXiv preprint arXiv:2111.05193 (2021)

work page arXiv 2021
[72]

M. A. Gordon, K. Duh, N. Andrews, Compressing bert: Studying the effects of weight pruning on transfer learning, arXiv preprint arXiv:2002.08307 (2020)

work page arXiv 2002
[73]

C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, F. Wei, Beyond preserved ac- curacy: Evaluating loyalty and robustness of bert compression, arXiv preprint arXiv:2109.03228 (2021)

work page arXiv 2021
[74]

S. Ye, K. Xu, S. Liu, H. Cheng, J.-H. Lambrechts, H. Zhang, A. Zhou, K. Ma, Y. Wang, X. Lin, Adversarial robustness vs. model compression, or both?, in: Pro- ceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 111–120

work page 2019
[75]

M. Du, S. Mukherjee, Y. Cheng, M. Shokouhi, X. Hu, A. H. Awadallah, Robustness challenges in model distillation and pruning for natural language understanding, arXiv preprint arXiv:2110.08419 (2021)

work page arXiv 2021
[76]

S.K.Gourtani, N.Meratnia, Improvingrobustnessofcompressedmodelswithweight sharing through knowledge distillation, in: 2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom), IEEE, 2024, pp. 13–21

work page 2024
[77]

J. Zhu, L. Wang, X. Han, Safety and performance, why not both? bi-objective optimized model compression toward ai software deployment, in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13

work page 2022
[78]

S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for bert model compression, arXiv preprint arXiv:1908.09355 (2019)

work page arXiv 1908
[79]

X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, Q. Liu, Tinybert: Distilling bert for natural language understanding, arXiv preprint arXiv:1909.10351 (2019)

work page arXiv 1909
[80]

Buciluˇ a, R

C. Buciluˇ a, R. Caruana, A. Niculescu-Mizil, Model compression, in: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541. 32

work page 2006

Showing first 80 references.

[1] [1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017)

work page 2017

[2] [2]

Arshad, S

S. Arshad, S. Abid, S. Shamail, Codebert for code clone detection: A replication study, in: 16th IWSC, IEEE, 2022, pp. 39–45

work page 2022

[3] [3]

Ahmed, K

T. Ahmed, K. S. Pai, P. Devanbu, E. Barr, Automatic semantic augmentation of lan- guage model prompts (for code summarization), in: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

work page 2024

[4] [4]

Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair, D. Wagner, B. Ray, Y. Chen, Vulnerability detection with code language models: How far are we?, arXiv preprint arXiv:2403.18624 (2024)

work page arXiv 2024

[5] [5]

J. Chen, X. Hu, Z. Li, C. Gao, X. Xia, D. Lo, Code search is all you need? im- proving code suggestions with code search, in: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

work page 2024

[6] [6]

J. Shi, Z. Yang, B. Xu, H. J. Kang, D. Lo, Compressing pre-trained models of code into 3 mb, in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, Association for Computing Machinery, New York, NY, USA, 2023.doi:10.1145/3551349.3556964. URLhttps://doi.org/10.1145/3551349.3556964

work page doi:10.1145/3551349.3556964 2023

[7] [7]

Schwartz, J

R. Schwartz, J. Dodge, N. A. Smith, O. Etzioni, Green ai, Communications of the ACM 63 (12) (2020) 54–63

work page 2020

[8] [8]

V. J. Hellendoorn, S. Proksch, H. C. Gall, A. Bacchelli, When code completion fails: A case study on real-world completions, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, 2019, pp. 960–970

work page 2019

[9] [9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

J. Shi, Z. Yang, H. J. Kang, B. Xu, J. He, D. Lo, Greening large language models of code, in: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Society, ICSE-SEIS’24, Association for Computing Machin- ery, New York, NY, USA, 2024, p. 142–153.doi:10.1145/3639475.3640097. URLhttps://doi-org.libproxy.smu.edu.sg...

work page doi:10.1145/3639475.3640097 2024

[11] [11]

J. Shi, Z. Yang, D. Lo, Efficient and green large language models for software en- gineering: Vision and the road ahead, ACM Transactions on Software Engineering and Methodology (2024). 27

work page 2024

[12] [12]

X. Wei, S. K. Gonugondla, S. Wang, W. Ahmad, B. Ray, H. Qian, X. Li, V. Ku- mar, Z. Wang, Y. Tian, et al., Towards greener yet powerful code generation via quantization: An empirical study, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 224–236

work page 2023

[13] [13]

M. Saad, J. A. H. López, B. Chen, D. Varró, T. Sharma, Alpine: An adaptive language-agnostic pruning method for language models for code, arXiv preprint arXiv:2407.04147 (2024)

work page arXiv 2024

[14] [14]

arXiv preprint arXiv:2412.13737 (2024)

G. d’Aloisio, L. Traini, F. Sarro, A. Di Marco, On the compression of language models for code: An empirical study on codebert, arXiv preprint arXiv:2412.13737 (2024)

work page arXiv 2024

[15] [15]

Y. Chen, Y. Ye, Z. Li, Y. Ma, C. Gao, Smaller but better: Self-paced knowledge distillation for lightweight yet effective lcms, Proceedings of the ACM on Software Engineering 2 (FSE) (2025) 3057–3080

work page 2025

[16] [16]

Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., Codebert: A pre-trained model for programming and natural lan- guages, arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002

[17] [17]

Z. Yang, J. Shi, J. He, D. Lo, Natural attack for pre-trained models of code, in: Proceedings of the 44th ICSE, 2022, pp. 1482–1493

work page 2022

[18] [18]

Zhang, Z

H. Zhang, Z. Li, G. Li, L. Ma, Y. Liu, Z. Jin, Generating adversarial examples for holding robustness of source code processing models, in: Proceedings of the AAAI Conference on AI, Vol. 34, 2020, pp. 1169–1176

work page 2020

[19] [19]

Z. Zeng, H. Tan, H. Zhang, J. Li, Y. Zhang, L. Zhang, An extensive study on pre- trained models for program understanding and generation, in: 31st ACM SIGSOFT ISSTA, 2022, pp. 39–51

work page 2022

[20] [20]

T. Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. Tse, Z. Q. Zhou, Metamor- phic testing: A review of challenges and opportunities, ACM Computing Surveys (CSUR) 51 (1) (2018) 1–27

work page 2018

[21] [21]

D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svy- atkovskiy, S. Fu, et al., Graphcodebert: Pre-training code representations with data flow, arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2009

[22] [22]

A. Panichella, Metamorphic-based many-objective distillation of llms for code- related tasks, in: 2025 IEEE/ACM 47th International Conference on Software En- gineering (ICSE), IEEE Computer Society, 2025, pp. 766–766

work page 2025

[23] [23]

S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, et al., Codexglue: A machine learning benchmark dataset for code understanding and generation, arXiv:2102.04664 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

W. U. Ahmad, S. Chakraborty, B. Ray, K.-W. Chang, Unified pre-training for pro- gram understanding and generation, arXiv:2103.06333 (2021). 28

work page arXiv 2021

[25] [25]

Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, arXiv:2109.00859 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, Bert: Pre-training of deep bidirectional transformers for language under- standing, arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y. Liu, Roberta: A robustly optimized bert pretraining approach, arXiv:1907.11692 364 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907

[28] [28]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V. Sanh, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv:1910.01108 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1910

[29] [29]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, M. Brockschmidt, Codesearchnet challenge: Evaluating the state of semantic code search, arXiv:1909.09436 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909

[30] [30]

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic literature review, ACM Transactions on Software Engineering and Methodology 33 (8) (2024) 1–79

work page 2024

[31] [31]

StarCoder 2 and The Stack v2: The Next Generation

A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al., Starcoder 2 and the stack v2: The next generation, arXiv preprint arXiv:2402.19173 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

X. Liu, X. Liu, L. Bo, X. Wu, Y. Yang, X. Sun, F. Zhou, Pioneer: improving the robustness of student models when compressing pre-trained models of code, Automated Software Engineering 33 (1) (2026) 1–30

work page 2026

[34] [34]

C. Xu, J. McAuley, A survey on model compression and acceleration for pretrained language models, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 10566–10575

work page 2023

[35] [35]

V. Sanh, T. Wolf, A. Rush, Movement pruning: Adaptive sparsity by fine-tuning, Advances in neural information processing systems 33 (2020) 20378–20389

work page 2020

[36] [36]

Zafrir, G

O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8bert: Quantized 8bit bert, in: 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), IEEE, 2019, pp. 36–39

work page 2019

[37] [37]

Elsken, J

T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search: A survey, Journal of Machine Learning Research 20 (55) (2019) 1–21

work page 2019

[38] [38]

2285–2294

W.Chen, J.Wilson, S.Tyree, K.Weinberger, Y.Chen, Compressingneuralnetworks with the hashing trick, in: International conference on machine learning, PMLR, 2015, pp. 2285–2294. 29

work page 2015

[39] [39]

T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, B. Ramabhadran, Low-rank matrix factorization for deep neural network training with high-dimensional out- put targets, in: 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, 2013, pp. 6655–6659

work page 2013

[40] [40]

Afrin, B

S. Afrin, B. Xu, A. Mastropaolo, Is quantization a deal-breaker? empirical insights from large code models, arXiv preprint arXiv:2507.09665 (2025)

work page arXiv 2025

[41] [41]

Zhang, J

J. Zhang, J. Chen, D. Hao, Y. Xiong, B. Xie, L. Zhang, H. Mei, Search-based infer- ence of polynomial metamorphic relations, in: Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, 2014, pp. 701–712

work page 2014

[42] [42]

B. Yang, M. A. A. Mamun, J. M. Zhang, G. Uddin, Hallucination detection in large language models with metamorphic relations, arXiv preprint arXiv:2502.15844 (2025)

work page arXiv 2025

[43] [43]

X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, T. Y. Chen, Testing and validating machinelearningclassifiersbymetamorphictesting, JournalofSystemsandSoftware 84 (4) (2011) 544–558

work page 2011

[44] [44]

T. Y. Chen, S. C. Cheung, S. M. Yiu, Metamorphic testing: a new approach for generating next test cases, arXiv preprint arXiv:2002.12543 (2020)

work page arXiv 2002

[45] [45]

D. Xiao, Z. Liu, Y. Yuan, Q. Pang, S. Wang, Metamorphic testing of deep learning compilers, Proceedings of the ACM on Measurement and Analysis of Computing Systems 6 (1) (2022) 1–28

work page 2022

[46] [46]

Segura, J

S. Segura, J. A. Parejo, J. Troya, A. Ruiz-Cortés, Metamorphic testing of restful web apis, in: Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 882–882

work page 2018

[47] [47]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, arXiv:1312.6199 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[48] [48]

Zhang, Z

H. Zhang, Z. Fu, G. Li, L. Ma, Z. Zhao, H. Yang, Y. Sun, Y. Liu, Z. Jin, To- wards robustness of deep program processing models—detection, estimation, and en- hancement, ACMTransactionsonSoftwareEngineeringandMethodology(TOSEM) 31 (3) (2022) 1–40

work page 2022

[49] [49]

Kullback, R

S. Kullback, R. A. Leibler, On information and sufficiency, The annals of mathemat- ical statistics 22 (1) (1951) 79–86

work page 1951

[50] [50]

Svajlenko, J

J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, M. M. Mia, Towards a big data curated benchmark of inter-project code clones, in: 2014 IEEE International Con- ference on Software Maintenance and Evolution, IEEE, 2014, pp. 476–480

work page 2014

[51] [51]

W. Wang, G. Li, B. Ma, X. Xia, Z. Jin, Detecting code clones with graph neural network and flow-augmented abstract syntax tree, in: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2020, pp. 261–271. 30

work page 2020

[52] [52]

Y. Zhou, S. Liu, J. Siow, X. Du, Y. Liu, Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks, Advances in NIPS 32 (2019)

work page 2019

[53] [53]

X. Du, M. Wen, Z. Wei, S. Wang, H. Jin, An extensive study on adversarial attack against pre-trained models of code, in: 31st FSE, 2023, pp. 489–501

work page 2023

[54] [54]

Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the american statistical association 32 (200) (1937) 675–701

M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the american statistical association 32 (200) (1937) 675–701

work page 1937

[55] [55]

R.F.Woolson, Wilcoxonsigned-ranktest, Wileyencyclopediaofclinicaltrials(2007) 1–3

work page 2007

[56] [56]

W. E. Wong, J. R. Horgan, S. London, H. Agrawal, A study of effective regression testing in practice, in: PROCEEDINGS The Eighth International Symposium On Software Reliability Engineering, IEEE, 1997, pp. 264–274

work page 1997

[57] [57]

Y. Jia, M. Harman, An analysis and survey of the development of mutation testing, IEEE transactions on software engineering 37 (5) (2010) 649–678

work page 2010

[58] [58]

X. Xie, Z. Zhang, T. Y. Chen, Y. Liu, P.-L. Poon, B. Xu, Mettle: A metamorphic testing approach to assessing and validating unsupervised machine learning systems, IEEE Transactions on Reliability 69 (4) (2020) 1293–1322

work page 2020

[59] [59]

Z. Sun, Z. Chen, J. Zhang, D. Hao, Fairness testing of machine translation systems, ACM Transactions on Software Engineering and Methodology 33 (6) (2024) 1–27

work page 2024

[60] [60]

W.-C. Kwan, X. Zeng, Y. Jiang, Y. Wang, L. Li, L. Shang, X. Jiang, Q. Liu, K.-F. Wong, Mt-eval: A multi-turn capabilities evaluation benchmark for large language models, arXiv preprint arXiv:2401.16745 (2024)

work page arXiv 2024

[61] [61]

J. M. Zhang, M. Harman, L. Ma, Y. Liu, Machine learning testing: Survey, land- scapes and horizons, IEEE Transactions on Software Engineering 48 (1) (2020) 1–36

work page 2020

[62] [62]

W. Wu, Y. Cao, N. Yi, R. Ou, Z. Zheng, Detecting and reducing the factual hal- lucinations of large language models with metamorphic testing, Proceedings of the ACM on Software Engineering 2 (FSE) (2025) 1432–1453

work page 2025

[63] [63]

LeCun, Y

Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436–444

work page 2015

[64] [64]

Shang, G

Y. Shang, G. Liu, R. R. Kompella, Y. Yan, Enhancing post-training quantization cal- ibration through contrastive learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15921–15930

work page 2024

[65] [65]

Y. Ji, Y. Xiang, J. Li, Q. Xia, P. Li, X. Duan, Z. Wang, M. Zhang, Beware of calibration data for pruning large language models, arXiv preprint arXiv:2410.17711 (2024)

work page arXiv 2024

[66] [66]

J. Ba, R. Caruana, Do deep nets really need to be deep?, Advances in neural infor- mation processing systems 27 (2014). 31

work page 2014

[67] [67]

Wang, K.-J

L. Wang, K.-J. Yoon, Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks, IEEE transactions on pattern analysis and machine intelligence 44 (6) (2021) 3048–3068

work page 2021

[68] [68]

R. Wang, Z. Yang, C. Gao, X. Xia, Q. Liao, An empirical study of knowledge distillation for code understanding tasks, arXiv preprint arXiv:2508.15423 (2025)

work page arXiv 2025

[69] [69]

X. Zhu, J. Li, Y. Liu, C. Ma, W. Wang, A survey on model compression for large language models, Transactions of the Association for Computational Linguistics 12 (2024) 1556–1577

work page 2024

[70] [70]

M. Xu, D. Cai, W. Yin, S. Wang, X. Jin, X. Liu, Resource-efficient algorithms and systems of foundation models: A survey, ACM Computing Surveys 57 (5) (2025) 1–39

work page 2025

[71] [71]

J. Xu, W. Zhou, Z. Fu, H. Zhou, L. Li, A survey on green deep learning, arXiv preprint arXiv:2111.05193 (2021)

work page arXiv 2021

[72] [72]

M. A. Gordon, K. Duh, N. Andrews, Compressing bert: Studying the effects of weight pruning on transfer learning, arXiv preprint arXiv:2002.08307 (2020)

work page arXiv 2002

[73] [73]

C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, F. Wei, Beyond preserved ac- curacy: Evaluating loyalty and robustness of bert compression, arXiv preprint arXiv:2109.03228 (2021)

work page arXiv 2021

[74] [74]

S. Ye, K. Xu, S. Liu, H. Cheng, J.-H. Lambrechts, H. Zhang, A. Zhou, K. Ma, Y. Wang, X. Lin, Adversarial robustness vs. model compression, or both?, in: Pro- ceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 111–120

work page 2019

[75] [75]

M. Du, S. Mukherjee, Y. Cheng, M. Shokouhi, X. Hu, A. H. Awadallah, Robustness challenges in model distillation and pruning for natural language understanding, arXiv preprint arXiv:2110.08419 (2021)

work page arXiv 2021

[76] [76]

S.K.Gourtani, N.Meratnia, Improvingrobustnessofcompressedmodelswithweight sharing through knowledge distillation, in: 2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom), IEEE, 2024, pp. 13–21

work page 2024

[77] [77]

J. Zhu, L. Wang, X. Han, Safety and performance, why not both? bi-objective optimized model compression toward ai software deployment, in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13

work page 2022

[78] [78]

S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for bert model compression, arXiv preprint arXiv:1908.09355 (2019)

work page arXiv 1908

[79] [79]

X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, Q. Liu, Tinybert: Distilling bert for natural language understanding, arXiv preprint arXiv:1909.10351 (2019)

work page arXiv 1909

[80] [80]

Buciluˇ a, R

C. Buciluˇ a, R. Caruana, A. Niculescu-Mizil, Model compression, in: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541. 32

work page 2006