pith. machine review for the scientific record. sign in

arxiv: 2511.11439 · v2 · submitted 2025-11-14 · 💻 cs.LG · cs.AI

Retrofit: Continual Learning with Controlled Forgetting for Binary Security Detection and Analysis

Pith reviewed 2026-05-17 21:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningcontrolled forgettingmalware detectionbinary summarizationparameter mergingsecurity analysisdeep learning
0
0 comments X

The pith

RETROFIT lets security models adapt to new threats while retaining prior knowledge by merging old and new models without replaying data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RETROFIT to update deep learning models for binary security tasks such as malware detection and program summarization as threats and code representations evolve over time. It merges a previously trained model with a newly fine-tuned one through retrospective-free parameter merging, without storing or accessing historical data. Forgetting is controlled by restricting updates to low-rank and sparse subspaces that maintain approximate orthogonality and by using a confidence-guided arbitration step to blend outputs from the legacy and new models. A sympathetic reader would care because security systems must handle shifting malware behaviors and stripped binaries, yet replay methods raise privacy and storage issues in sensitive environments.

Core claim

RETROFIT regulates knowledge retention and adaptation with controlled forgetting at each update by consolidating previously trained and newly fine-tuned models as teachers of legacy and emergent knowledge through retrospective-free parameter merging, with forgetting control achieved by constraining parameter changes to low-rank and sparse subspaces for approximate orthogonality and employing a confidence-guided arbitration mechanism to dynamically aggregate knowledge from both teachers.

What carries the argument

Retrospective-free parameter merging constrained to low-rank and sparse subspaces for approximate orthogonality, combined with a confidence-guided arbitration mechanism to aggregate legacy and new knowledge.

If this is right

  • In malware detection under temporal drift, retention score rises from 20.2% to 38.6% over continual learning baselines.
  • Performance on new data exceeds the oracle upper bound.
  • In binary summarization across decompilation levels, BLEU score more than doubles that of prior transfer learning.
  • Cross-representation generalization surpasses all baselines when analyzing stripped binaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same merging approach could apply to other data-private continual learning settings such as updating fraud detectors without retaining past transactions.
  • If new threats demand updates outside the low-rank subspace, adaptability might degrade in domains with very rapid change.
  • Additional validation on vulnerability detection or different binary formats would test whether controlled forgetting holds beyond the two evaluated tasks.
  • Pairing the arbitration step with ensemble techniques might further strengthen knowledge consolidation.

Load-bearing premise

That constraining parameter changes to low-rank and sparse subspaces for approximate orthogonality combined with confidence-guided arbitration will reliably control forgetting and enable effective knowledge consolidation without access to historical data or explicit replay.

What would settle it

A temporal malware detection experiment where RETROFIT shows retention below 30% or fails to exceed the oracle upper bound on new-data accuracy while preserving adaptability.

Figures

Figures reproduced from arXiv: 2511.11439 by Hongyu She, Junchi Lei, Lorenzo Cavallaro, Shuo Shao, Xinran Zheng, Yiling He, Yiping Liu, Zhan Qin.

Figure 1
Figure 1. Figure 1: Examples of the benefits and insufficiencies of CL in security [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Continual learning for addressing temporal shift (left) and representation shift (right) in security applications. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Design insights for accumulating old and new knowledge at [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative CL comparison in malware detection. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison with existing CL methods in binary analysis. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Improvements in retention and adaptation performance when [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of our adaptive merging strategy and low-rank constraint. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Summary comparison for challenging stripped binaries, where [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Binary security has increasingly relied on deep learning to reason about malware behavior and program semantics. However, the performance often degrades as threat landscapes evolve and code representations shift. While continual learning (CL) offers a natural solution through sequential updates, most existing approaches rely on data replay or unconstrained updates, limiting their applicability and effectiveness in data-sensitive security environments. We propose RETROFIT, which regulates knowledge retention and adaptation with controlled forgetting at each update, without requiring historical data. Our key idea is to consolidate previously trained and newly fine-tuned models, serving as teachers of legacy and emergent knowledge, through retrospective-free parameter merging. Forgetting control is achieved by 1) constraining parameter changes to low-rank and sparse subspaces for approximate orthogonality, and 2) employing a confidence-guided arbitration mechanism to dynamically aggregate knowledge from both teachers. Our evaluation on two representative applications demonstrates that RETROFIT consistently mitigates forgetting while maintaining adaptability. In malware detection under temporal drift, it substantially improves the retention score, from 20.2% to 38.6% over CL baselines, and exceeds the oracle upper bound on new data. In binary summarization across decompilation levels, where analyzing stripped binaries is especially challenging, RETROFIT achieves over 2x the BLEU score of transfer learning used in prior work and surpasses all baselines in cross-representation generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes RETROFIT, a continual learning framework for binary security tasks that performs retrospective-free merging of a legacy model and a newly fine-tuned model. Forgetting is controlled by constraining updates to low-rank and sparse subspaces (intended to produce approximate orthogonality) combined with a confidence-guided arbitration mechanism that aggregates knowledge from both teachers without access to historical data. Experiments on malware detection under temporal drift report retention-score gains from 20.2% to 38.6% and exceed the oracle on new data; on binary summarization across decompilation levels the method claims more than 2x BLEU improvement over prior transfer-learning baselines and better cross-representation generalization.

Significance. If the central mechanism is shown to be sound, the work would offer a practical route to continual adaptation in data-sensitive security settings where replay is prohibited. The reported numerical gains are concrete and the two-task evaluation (temporal malware drift and multi-level decompilation) is relevant to the domain.

major comments (3)
  1. [§3.2] §3.2 (Parameter Merging and Subspace Constraints): The claim that restricting updates to low-rank and sparse subspaces produces 'approximate orthogonality' between legacy and new parameters is not supported by the given construction. Low-rank or sparse deltas can still have non-zero inner products with prior parameter directions; no explicit projection (e.g., Gram-Schmidt or orthogonal complement projection) is described. This step is load-bearing for the forgetting-control argument and must be supplied or the claim revised.
  2. [§4.1] §4.1 and Table 1 (Malware Detection Results): The retention-score improvement from 20.2% to 38.6% and the claim of exceeding the oracle upper bound on new data are presented without error bars, number of runs, or statistical tests. Because the central claim is that RETROFIT 'consistently mitigates forgetting,' these omissions make it impossible to judge whether the gains are reliable or merely point estimates.
  3. [§4.2] §4.2 (Binary Summarization): The assertion that RETROFIT surpasses all baselines in cross-representation generalization relies on BLEU scores that are more than double those of transfer learning. The paper must clarify whether the same low-rank/sparse constraint and arbitration are applied uniformly across decompilation levels and whether any representation-specific hyper-parameters were tuned; otherwise the generalization claim rests on an under-specified experimental protocol.
minor comments (2)
  1. [Abstract] The abstract states that RETROFIT 'exceeds the oracle upper bound on new data'; this counter-intuitive result should be explained in the main text with a precise definition of the oracle.
  2. [§3] Notation for the two teachers (legacy vs. emergent) and the arbitration weights should be introduced once and used consistently; current usage mixes 'teacher' and 'model' terminology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological clarity and experimental reporting. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Parameter Merging and Subspace Constraints): The claim that restricting updates to low-rank and sparse subspaces produces 'approximate orthogonality' between legacy and new parameters is not supported by the given construction. Low-rank or sparse deltas can still have non-zero inner products with prior parameter directions; no explicit projection (e.g., Gram-Schmidt or orthogonal complement projection) is described. This step is load-bearing for the forgetting-control argument and must be supplied or the claim revised.

    Authors: We agree that the original phrasing overstated the guarantee provided by the construction. The low-rank and sparse constraints limit update capacity and thereby reduce the potential for parameter interference, but they do not enforce orthogonality without an explicit projection step. In the revised manuscript we have removed the term 'approximate orthogonality' from §3.2, replaced it with a description of dimensionality reduction and empirical interference control, and added a short discussion of the observed forgetting mitigation. revision: yes

  2. Referee: [§4.1] §4.1 and Table 1 (Malware Detection Results): The retention-score improvement from 20.2% to 38.6% and the claim of exceeding the oracle upper bound on new data are presented without error bars, number of runs, or statistical tests. Because the central claim is that RETROFIT 'consistently mitigates forgetting,' these omissions make it impossible to judge whether the gains are reliable or merely point estimates.

    Authors: We accept that statistical reporting was insufficient. The revised version reports results averaged over five independent runs with standard-deviation error bars in Table 1 and includes a statistical significance analysis (paired t-tests, p < 0.05) confirming the retention gains. We have also clarified that the oracle comparison on new data reflects the arbitration mechanism's focus on current-task performance rather than a violation of the upper bound. revision: yes

  3. Referee: [§4.2] §4.2 (Binary Summarization): The assertion that RETROFIT surpasses all baselines in cross-representation generalization relies on BLEU scores that are more than double those of transfer learning. The paper must clarify whether the same low-rank/sparse constraint and arbitration are applied uniformly across decompilation levels and whether any representation-specific hyper-parameters were tuned; otherwise the generalization claim rests on an under-specified experimental protocol.

    Authors: The low-rank/sparse constraints and confidence-guided arbitration are applied uniformly across all decompilation levels. Hyper-parameters (rank, sparsity, arbitration thresholds) were selected once via validation-set search for the overall task and held fixed; no per-representation tuning was performed. The revised §4.2 and a new appendix table now document the exact hyper-parameter values and confirm the uniform protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on proposed mechanisms with independent empirical validation.

full rationale

The paper introduces RETROFIT for continual learning in binary security tasks via retrospective-free parameter merging, low-rank/sparse subspace constraints for approximate orthogonality, and confidence-guided arbitration. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or claims. The reported gains (e.g., retention score improvement from 20.2% to 38.6%, >2x BLEU) are presented as outcomes of the method applied to malware detection and binary summarization, remaining self-contained against external benchmarks without reducing to input definitions or prior author results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to rely on standard assumptions from continual learning and model merging literature without detailing new postulates.

pith-pipeline@v0.9.0 · 5564 in / 1224 out tokens · 37509 ms · 2026-05-17T21:45:00.671667+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 3 internal anchors

  1. [1]

    Lamd: Context- driven android malware detection and classification with llms,

    X. Qian, X. Zheng, Y . He, S. Yang, and L. Cavallaro, “Lamd: Context- driven android malware detection and classification with llms,” in 2025 IEEE Security and Privacy Workshops (SPW). IEEE, 2025, pp. 126–136

  2. [2]

    Msdroid: Identifying malicious snippets for android malware detection,

    Y . He, Y . Liu, L. Wu, Z. Yang, K. Ren, and Z. Qin, “Msdroid: Identifying malicious snippets for android malware detection,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 3, pp. 2025–2039, 2023

  3. [3]

    On distribution shift in learning-based bug detectors,

    J. He, L. Beurer-Kellner, and M. Vechev, “On distribution shift in learning-based bug detectors,” inInternational conference on machine learning. PMLR, 2022, pp. 8559–8580

  4. [4]

    Exploring{ChatGPT’s}capabilities on vulner- ability management,

    P. Liu, J. Liu, L. Fu, K. Lu, Y . Xia, X. Zhang, W. Chen, H. Weng, S. Ji, and W. Wang, “Exploring{ChatGPT’s}capabilities on vulner- ability management,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 811–828

  5. [5]

    Revisit- ing non-separable binary classification and its applications in anomaly detection,

    M. Lau, I. SECK, A. P. Meliopoulos, W. Lee, and E. Ndiaye, “Revisit- ing non-separable binary classification and its applications in anomaly detection,”Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=zOJ846BXhl

  6. [6]

    Source code foundation models are transferable binary analysis knowledge bases,

    Z. Su, X. Xu, Z. Huang, K. Zhang, and X. Zhang, “Source code foundation models are transferable binary analysis knowledge bases,” Advances in Neural Information Processing Systems, vol. 37, pp. 112 624–112 655, 2024

  7. [7]

    Bodmas: An open dataset for learning based temporal analysis of pe malware,

    L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas: An open dataset for learning based temporal analysis of pe malware,” in2021 IEEE Security and Privacy Workshops (SPW). IEEE, 2021, pp. 78–84

  8. [8]

    Evaluating the effec- tiveness of decompilers,

    Y . Cao, R. Zhang, R. Liang, and K. Chen, “Evaluating the effec- tiveness of decompilers,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024, pp. 491–502

  9. [9]

    2025 threat intelligence index,

    IBM, “2025 threat intelligence index,” IBM Insti- tute for Business Value, Tech. Rep., Oct 2025. [Online]. Available: https://www.ibm.com/thought-leadership/ institute-business-value/en-us/report/2025-threat-intelligence-index

  10. [10]

    A comprehensive survey of continual learning: Theory, method and application,

    L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,”IEEE transac- tions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5362–5383, 2024

  11. [11]

    Continuous learning for android malware detection,

    Y . Chen, Z. Ding, and D. Wagner, “Continuous learning for android malware detection,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 1127–1144

  12. [12]

    Madar: Efficient continual learning for malware analysis with distribution-aware re- play,

    M. S. Rahman, S. Coull, Q. Yu, and M. Wright, “Madar: Efficient continual learning for malware analysis with distribution-aware re- play,” inProceedings of the Conference on Applied Machine Learning in Information Security (CAMLIS), 2025

  13. [13]

    Transfer learning via learning to transfer,

    W. Ying, Y . Zhang, J. Huang, and Q. Yang, “Transfer learning via learning to transfer,” inInternational conference on machine learning. PMLR, 2018, pp. 5085–5094

  14. [14]

    Enabling efficient privacy-assured outlier detection over encrypted incremental data sets,

    S. Lai, X. Yuan, A. Sakzad, M. Salehi, J. K. Liu, and D. Liu, “Enabling efficient privacy-assured outlier detection over encrypted incremental data sets,”IEEE Internet of Things Journal, vol. 7, no. 4, pp. 2651–2662, 2019

  15. [15]

    On the conflict between robustness and learning in collaborative machine learning,

    M. Raynal and C. Troncoso, “On the conflict between robustness and learning in collaborative machine learning,” in2025 IEEE Symposium on Security and Privacy (SP). IEEE, 2025, pp. 2171–2189

  16. [16]

    Remind your neural network to prevent catastrophic forgetting,

    T. L. Hayes, K. Kafle, R. Shrestha, M. Acharya, and C. Kanan, “Remind your neural network to prevent catastrophic forgetting,” in European conference on computer vision. Springer, 2020, pp. 466– 483

  17. [17]

    Extending source code pre-trained language models to summarise decompiled binaries,

    A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, “Extending source code pre-trained language models to summarise decompiled binaries,” in2023 IEEE International Con- ference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2023, pp. 260–271

  18. [18]

    Transcend: Detecting concept drift in malware classification models,

    R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdi- nov, and L. Cavallaro, “Transcend: Detecting concept drift in malware classification models,” in26th USENIX security symposium (USENIX security 17), 2017, pp. 625–642

  19. [19]

    Catastrophic interference in con- nectionist networks: The sequential learning problem,

    M. McCloskey and N. J. Cohen, “Catastrophic interference in con- nectionist networks: The sequential learning problem,” inPsychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165

  20. [20]

    Learning multiple visual domains with residual adapters,

    S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with residual adapters,”Advances in neural information processing systems, vol. 30, 2017

  21. [21]

    Learning without forgetting,

    Z. Li and D. Hoiem, “Learning without forgetting,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2935–2947, 2017

  22. [22]

    Podnet: Pooled outputs distillation for small-tasks incremental learning,

    A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” in European Conference on Computer Vision. Springer, 2020, pp. 86– 102

  23. [23]

    Co2l: Contrastive continual learning,

    H. Cha, J. Lee, and J. Shin, “Co2l: Contrastive continual learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9516–9525

  24. [24]

    Is multi-task learning an upper bound for continual learning?

    Z. Wu, H. Tran, H. Pirsiavash, and S. Kolouri, “Is multi-task learning an upper bound for continual learning?” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  25. [25]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities,”arXiv preprint arXiv:2408.07666, 2024

  26. [26]

    A continual learning survey: Defy- ing forgetting in classification tasks,

    M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defy- ing forgetting in classification tasks,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 7, pp. 3366–3385, 2021

  27. [27]

    Persistent backdoor attacks in continual learning,

    Z. Guo, A. Kumar, and R. Tourani, “Persistent backdoor attacks in continual learning,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 6379–6397

  28. [28]

    Expe- rience replay for continual learning,

    D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne, “Expe- rience replay for continual learning,”Advances in neural information processing systems, vol. 32, 2019

  29. [29]

    icarl: Incremental classifier and representation learning,

    S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010

  30. [30]

    Overcoming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

  31. [31]

    Continual learning through synaptic intelligence,

    F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inInternational conference on machine learn- ing. PMLR, 2017, pp. 3987–3995

  32. [32]

    Memory aware synapses: Learning what (not) to forget,

    R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuyte- laars, “Memory aware synapses: Learning what (not) to forget,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 139–154

  33. [33]

    Packnet: Adding multiple tasks to a single network by iterative pruning,

    A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single network by iterative pruning,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7765–7773

  34. [34]

    Overcoming catas- trophic forgetting with hard attention to the task,

    J. Serra, D. Suris, M. Miron, and A. Karatzoglou, “Overcoming catas- trophic forgetting with hard attention to the task,” inInternational conference on machine learning. PMLR, 2018, pp. 4548–4557

  35. [35]

    Lifelong learning with dynamically expandable networks,

    J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,” in6th International Conference on Learning Representations, ICLR 2018, 2018

  36. [36]

    Convnext v2: Co-designing and scaling convnets with masked au- toencoders,

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked au- toencoders,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 16 133–16 142

  37. [37]

    Efficientnetv2: Smaller models and faster train- ing,

    M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster train- ing,” inInternational conference on machine learning. PMLR, 2021, pp. 10 096–10 106

  38. [38]

    Tesseract: Eliminating experimental bias in malware classifi- cation across space and time,

    F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Caval- laro, “Tesseract: Eliminating experimental bias in malware classifi- cation across space and time,” in28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 729–746

  39. [39]

    Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,

    S. H. Ding, B. C. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in2019 ieee symposium on security and privacy (sp). IEEE, 2019, pp. 472–489

  40. [40]

    Transcend- ing transcend: Revisiting malware classification in the presence of concept drift,

    F. Barbero, F. Pendlebury, F. Pierazzi, and L. Cavallaro, “Transcend- ing transcend: Revisiting malware classification in the presence of concept drift,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 805–823

  41. [41]

    Demistify: Identifying on-device machine learning models stealing and reuse vulnerabilities in mobile apps,

    P. Ren, C. Zuo, X. Liu, W. Diao, Q. Zhao, and S. Guo, “Demistify: Identifying on-device machine learning models stealing and reuse vulnerabilities in mobile apps,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer So- ciety, 2023, pp. 468–480

  42. [42]

    Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution,

    S. Shao, Y . Li, H. Yao, Y . He, Z. Qin, and K. Ren, “Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution,” inNetwork and Distributed System Security Symposium (NDSS), 2025

  43. [43]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,

    M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inInterna- tional conference on machine learning. PMLR, 2022, pp. 23 965– 23 998

  44. [44]

    Lora: Low-rank adaptation of large language mod- els

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language mod- els.”ICLR, vol. 1, no. 2, p. 3, 2022

  45. [45]

    Evaluating model calibration in classification,

    J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. Sch¨on, “Evaluating model calibration in classification,” inThe 22nd international conference on artificial intelligence and statistics. PMLR, 2019, pp. 3459–3467

  46. [46]

    Calibration of large language models on code summarization,

    Y . Virk, P. Devanbu, and T. Ahmed, “Calibration of large language models on code summarization,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 2944–2964, 2025

  47. [47]

    Comparing kullback-leibler divergence and mean squared error loss in knowl- edge distillation,

    T. Kim, J. Oh, N. Y . Kim, S. Cho, and S.-Y . Yun, “Comparing kullback-leibler divergence and mean squared error loss in knowl- edge distillation,” in30th International Joint Conference on Artificial Intelligence (IJCAI-21). IJCAI, 2021, pp. 2628–2635

  48. [48]

    Drebin: Effective and explainable detection of android malware in your pocket

    D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C. Siemens, “Drebin: Effective and explainable detection of android malware in your pocket.” inNdss, vol. 14, no. 1, 2014, pp. 23–26

  49. [49]

    Adversarial examples for malware detection,

    K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, “Adversarial examples for malware detection,” inEuropean sympo- sium on research in computer security. Springer, 2017, pp. 62–79

  50. [50]

    Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,

    Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708

  51. [51]

    Editing Models with Task Arithmetic

    G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089, 2022

  52. [52]

    Ties- merging: Resolving interference when merging models,

    P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,”Advances in Neural Information Processing Systems, vol. 36, pp. 7093–7115, 2023

  53. [53]

    Adamerging: Adaptive model merging for multi-task learning

    E. Yang, Z. Wang, L. Shen, S. Liu, G. Guo, X. Wang, and D. Tao, “Adamerging: Adaptive model merging for multi-task learning,”arXiv preprint arXiv:2310.02575, 2023

  54. [54]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  55. [55]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72

  56. [56]

    Dexbert: Effective, task-agnostic and fine-grained representation learning of android bytecode,

    T. Sun, K. Allix, K. Kim, X. Zhou, D. Kim, D. Lo, T. F. Bissyand ´e, and J. Klein, “Dexbert: Effective, task-agnostic and fine-grained representation learning of android bytecode,”IEEE Transactions on Software Engineering, vol. 49, no. 10, pp. 4691–4706, 2023

  57. [57]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

  58. [58]

    A simple frame- work for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame- work for contrastive learning of visual representations,” inInterna- tional conference on machine learning. PmLR, 2020, pp. 1597–1607

  59. [59]

    In- variant risk minimization games,

    K. Ahuja, K. Shanmugam, K. Varshney, and A. Dhurandhar, “In- variant risk minimization games,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 145–155

  60. [60]

    Learning temporal invariance in android malware detectors,

    X. Zheng, S. Yang, E. C. Ngai, S. Jana, and L. Cavallaro, “Learning temporal invariance in android malware detectors,”arXiv preprint arXiv:2502.05098, 2025

  61. [61]

    Enhancing state-of-the-art classifiers with api se- mantics to detect evolved android malware,

    X. Zhang, Y . Zhang, M. Zhong, D. Ding, Y . Cao, Y . Zhang, M. Zhang, and M. Yang, “Enhancing state-of-the-art classifiers with api se- mantics to detect evolved android malware,” inProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 757–770

  62. [62]

    Exploiting code symmetries for learning program seman- tics,

    K. Pei, W. Li, Q. Jin, S. Liu, S. Geng, L. Cavallaro, J. Yang, and S. Jana, “Exploiting code symmetries for learning program seman- tics,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 40 092–40 113

  63. [63]

    A survey on multi-task learning,

    Y . Zhang and Q. Yang, “A survey on multi-task learning,”IEEE transactions on knowledge and data engineering, vol. 34, no. 12, pp. 5586–5609, 2021

  64. [64]

    Cross-stitch net- works for multi-task learning,

    I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch net- works for multi-task learning,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3994–4003

  65. [65]

    π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation,

    C. Wu, T. Wang, Y . Ge, Z. Lu, R. Zhou, Y . Shan, and P. Luo, “π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 37 713–37 727

  66. [66]

    Learning under concept drift: A review,

    J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,”IEEE transactions on knowledge and data engineering, vol. 31, no. 12, pp. 2346–2363, 2018

  67. [67]

    Combating concept drift with explanatory detection and adaptation for android malware classification,

    Y . He, J. Lei, Z. Qin, K. Ren, and C. Chen, “Combating concept drift with explanatory detection and adaptation for android malware classification,”arXiv preprint arXiv:2405.04095, 2024

  68. [68]

    {CADE}: Detecting and explaining concept drift sam- ples for security applications,

    L. Yang, W. Guo, Q. Hao, A. Ciptadi, A. Ahmadzadeh, X. Xing, and G. Wang, “{CADE}: Detecting and explaining concept drift sam- ples for security applications,” in30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 2327–2344

  69. [69]

    Beyond classification: Evaluating llms for fine-grained automatic malware behavior auditing,

    X. Zheng, X. Qian, Y . He, S. Yang, and L. Cavallaro, “Beyond classification: Evaluating llms for fine-grained automatic malware behavior auditing,”arXiv preprint arXiv:2509.14335, 2025

  70. [70]

    Large language models for code analysis: Do LLMs really do their job?

    C. Fang, N. Miao, S. Srivastav, J. Liu, R. Zhang, R. Fang, Asmita, R. Tsang, N. Nazari, H. Wang, and H. Homayoun, “Large language models for code analysis: Do LLMs really do their job?” in33rd USENIX Security Symposium (USENIX Security 24). Philadelphia, PA: USENIX Association, Aug. 2024, pp. 829–846

  71. [71]

    Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models,

    X. Jin, J. Larson, W. Yang, and Z. Lin, “Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models,”arXiv preprint arXiv:2312.09601, 2023

  72. [72]

    On benchmarking code llms for android malware analysis,

    Y . He, H. She, X. Qian, X. Zheng, Z. Chen, Z. Qin, and L. Caval- laro, “On benchmarking code llms for android malware analysis,” in Proceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2025, pp. 153–160. Appendix A. Model-level Bound and Interference While the main text expresses the update rule using a single wei...