arxiv: 2511.11439 · v2 · submitted 2025-11-14 · 💻 cs.LG · cs.AI

Retrofit: Continual Learning with Controlled Forgetting for Binary Security Detection and Analysis

Yiling He , Junchi Lei , Hongyu She , Shuo Shao , Xinran Zheng , Yiping Liu , Zhan Qin , Lorenzo Cavallaro This is my paper

Pith reviewed 2026-05-17 21:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual learningcontrolled forgettingmalware detectionbinary summarizationparameter mergingsecurity analysisdeep learning

0 comments

The pith

RETROFIT lets security models adapt to new threats while retaining prior knowledge by merging old and new models without replaying data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RETROFIT to update deep learning models for binary security tasks such as malware detection and program summarization as threats and code representations evolve over time. It merges a previously trained model with a newly fine-tuned one through retrospective-free parameter merging, without storing or accessing historical data. Forgetting is controlled by restricting updates to low-rank and sparse subspaces that maintain approximate orthogonality and by using a confidence-guided arbitration step to blend outputs from the legacy and new models. A sympathetic reader would care because security systems must handle shifting malware behaviors and stripped binaries, yet replay methods raise privacy and storage issues in sensitive environments.

Core claim

RETROFIT regulates knowledge retention and adaptation with controlled forgetting at each update by consolidating previously trained and newly fine-tuned models as teachers of legacy and emergent knowledge through retrospective-free parameter merging, with forgetting control achieved by constraining parameter changes to low-rank and sparse subspaces for approximate orthogonality and employing a confidence-guided arbitration mechanism to dynamically aggregate knowledge from both teachers.

What carries the argument

Retrospective-free parameter merging constrained to low-rank and sparse subspaces for approximate orthogonality, combined with a confidence-guided arbitration mechanism to aggregate legacy and new knowledge.

If this is right

In malware detection under temporal drift, retention score rises from 20.2% to 38.6% over continual learning baselines.
Performance on new data exceeds the oracle upper bound.
In binary summarization across decompilation levels, BLEU score more than doubles that of prior transfer learning.
Cross-representation generalization surpasses all baselines when analyzing stripped binaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same merging approach could apply to other data-private continual learning settings such as updating fraud detectors without retaining past transactions.
If new threats demand updates outside the low-rank subspace, adaptability might degrade in domains with very rapid change.
Additional validation on vulnerability detection or different binary formats would test whether controlled forgetting holds beyond the two evaluated tasks.
Pairing the arbitration step with ensemble techniques might further strengthen knowledge consolidation.

Load-bearing premise

That constraining parameter changes to low-rank and sparse subspaces for approximate orthogonality combined with confidence-guided arbitration will reliably control forgetting and enable effective knowledge consolidation without access to historical data or explicit replay.

What would settle it

A temporal malware detection experiment where RETROFIT shows retention below 30% or fails to exceed the oracle upper bound on new-data accuracy while preserving adaptability.

Figures

Figures reproduced from arXiv: 2511.11439 by Hongyu She, Junchi Lei, Lorenzo Cavallaro, Shuo Shao, Xinran Zheng, Yiling He, Yiping Liu, Zhan Qin.

**Figure 2.** Figure 2: Continual learning for addressing temporal shift (left) and representation shift (right) in security applications. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Design insights for accumulating old and new knowledge at [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative CL comparison in malware detection. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison with existing CL methods in binary analysis. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 7.** Figure 7: Improvements in retention and adaptation performance when [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 6.** Figure 6: Ablation of our adaptive merging strategy and low-rank constraint. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 9.** Figure 9: Summary comparison for challenging stripped binaries, where [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Binary security has increasingly relied on deep learning to reason about malware behavior and program semantics. However, the performance often degrades as threat landscapes evolve and code representations shift. While continual learning (CL) offers a natural solution through sequential updates, most existing approaches rely on data replay or unconstrained updates, limiting their applicability and effectiveness in data-sensitive security environments. We propose RETROFIT, which regulates knowledge retention and adaptation with controlled forgetting at each update, without requiring historical data. Our key idea is to consolidate previously trained and newly fine-tuned models, serving as teachers of legacy and emergent knowledge, through retrospective-free parameter merging. Forgetting control is achieved by 1) constraining parameter changes to low-rank and sparse subspaces for approximate orthogonality, and 2) employing a confidence-guided arbitration mechanism to dynamically aggregate knowledge from both teachers. Our evaluation on two representative applications demonstrates that RETROFIT consistently mitigates forgetting while maintaining adaptability. In malware detection under temporal drift, it substantially improves the retention score, from 20.2% to 38.6% over CL baselines, and exceeds the oracle upper bound on new data. In binary summarization across decompilation levels, where analyzing stripped binaries is especially challenging, RETROFIT achieves over 2x the BLEU score of transfer learning used in prior work and surpasses all baselines in cross-representation generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RETROFIT applies replay-free model merging with low-rank sparse constraints to continual learning in malware detection and binary summarization, reporting retention and BLEU gains, but the orthogonality mechanism needs explicit verification.

read the letter

The main point is that this paper adapts parameter merging and low-rank constraints to continual learning for binary security tasks without historical data replay. It merges a legacy model and a new fine-tuned one, then uses subspace constraints plus confidence arbitration to limit forgetting while adapting to drift or new representations. The reported results show retention scores rising from 20.2% to 38.6% on temporal malware detection and more than double the BLEU score versus prior transfer learning on decompilation summarization, with better cross-representation performance. Those concrete numbers on real security workloads are the clearest positive here, and the setup directly targets the replay restriction common in security ML.

Referee Report

3 major / 2 minor

Summary. The paper proposes RETROFIT, a continual learning framework for binary security tasks that performs retrospective-free merging of a legacy model and a newly fine-tuned model. Forgetting is controlled by constraining updates to low-rank and sparse subspaces (intended to produce approximate orthogonality) combined with a confidence-guided arbitration mechanism that aggregates knowledge from both teachers without access to historical data. Experiments on malware detection under temporal drift report retention-score gains from 20.2% to 38.6% and exceed the oracle on new data; on binary summarization across decompilation levels the method claims more than 2x BLEU improvement over prior transfer-learning baselines and better cross-representation generalization.

Significance. If the central mechanism is shown to be sound, the work would offer a practical route to continual adaptation in data-sensitive security settings where replay is prohibited. The reported numerical gains are concrete and the two-task evaluation (temporal malware drift and multi-level decompilation) is relevant to the domain.

major comments (3)

[§3.2] §3.2 (Parameter Merging and Subspace Constraints): The claim that restricting updates to low-rank and sparse subspaces produces 'approximate orthogonality' between legacy and new parameters is not supported by the given construction. Low-rank or sparse deltas can still have non-zero inner products with prior parameter directions; no explicit projection (e.g., Gram-Schmidt or orthogonal complement projection) is described. This step is load-bearing for the forgetting-control argument and must be supplied or the claim revised.
[§4.1] §4.1 and Table 1 (Malware Detection Results): The retention-score improvement from 20.2% to 38.6% and the claim of exceeding the oracle upper bound on new data are presented without error bars, number of runs, or statistical tests. Because the central claim is that RETROFIT 'consistently mitigates forgetting,' these omissions make it impossible to judge whether the gains are reliable or merely point estimates.
[§4.2] §4.2 (Binary Summarization): The assertion that RETROFIT surpasses all baselines in cross-representation generalization relies on BLEU scores that are more than double those of transfer learning. The paper must clarify whether the same low-rank/sparse constraint and arbitration are applied uniformly across decompilation levels and whether any representation-specific hyper-parameters were tuned; otherwise the generalization claim rests on an under-specified experimental protocol.

minor comments (2)

[Abstract] The abstract states that RETROFIT 'exceeds the oracle upper bound on new data'; this counter-intuitive result should be explained in the main text with a precise definition of the oracle.
[§3] Notation for the two teachers (legacy vs. emergent) and the arbitration weights should be introduced once and used consistently; current usage mixes 'teacher' and 'model' terminology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological clarity and experimental reporting. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Parameter Merging and Subspace Constraints): The claim that restricting updates to low-rank and sparse subspaces produces 'approximate orthogonality' between legacy and new parameters is not supported by the given construction. Low-rank or sparse deltas can still have non-zero inner products with prior parameter directions; no explicit projection (e.g., Gram-Schmidt or orthogonal complement projection) is described. This step is load-bearing for the forgetting-control argument and must be supplied or the claim revised.

Authors: We agree that the original phrasing overstated the guarantee provided by the construction. The low-rank and sparse constraints limit update capacity and thereby reduce the potential for parameter interference, but they do not enforce orthogonality without an explicit projection step. In the revised manuscript we have removed the term 'approximate orthogonality' from §3.2, replaced it with a description of dimensionality reduction and empirical interference control, and added a short discussion of the observed forgetting mitigation. revision: yes
Referee: [§4.1] §4.1 and Table 1 (Malware Detection Results): The retention-score improvement from 20.2% to 38.6% and the claim of exceeding the oracle upper bound on new data are presented without error bars, number of runs, or statistical tests. Because the central claim is that RETROFIT 'consistently mitigates forgetting,' these omissions make it impossible to judge whether the gains are reliable or merely point estimates.

Authors: We accept that statistical reporting was insufficient. The revised version reports results averaged over five independent runs with standard-deviation error bars in Table 1 and includes a statistical significance analysis (paired t-tests, p < 0.05) confirming the retention gains. We have also clarified that the oracle comparison on new data reflects the arbitration mechanism's focus on current-task performance rather than a violation of the upper bound. revision: yes
Referee: [§4.2] §4.2 (Binary Summarization): The assertion that RETROFIT surpasses all baselines in cross-representation generalization relies on BLEU scores that are more than double those of transfer learning. The paper must clarify whether the same low-rank/sparse constraint and arbitration are applied uniformly across decompilation levels and whether any representation-specific hyper-parameters were tuned; otherwise the generalization claim rests on an under-specified experimental protocol.

Authors: The low-rank/sparse constraints and confidence-guided arbitration are applied uniformly across all decompilation levels. Hyper-parameters (rank, sparsity, arbitration thresholds) were selected once via validation-set search for the overall task and held fixed; no per-representation tuning was performed. The revised §4.2 and a new appendix table now document the exact hyper-parameter values and confirm the uniform protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on proposed mechanisms with independent empirical validation.

full rationale

The paper introduces RETROFIT for continual learning in binary security tasks via retrospective-free parameter merging, low-rank/sparse subspace constraints for approximate orthogonality, and confidence-guided arbitration. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or claims. The reported gains (e.g., retention score improvement from 20.2% to 38.6%, >2x BLEU) are presented as outcomes of the method applied to malware detection and binary summarization, remaining self-contained against external benchmarks without reducing to input definitions or prior author results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to rely on standard assumptions from continual learning and model merging literature without detailing new postulates.

pith-pipeline@v0.9.0 · 5564 in / 1224 out tokens · 37509 ms · 2026-05-17T21:45:00.671667+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

constraining parameter changes to low-rank and sparse subspaces for approximate orthogonality... confidence-guided arbitration mechanism
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RETROFIT... bounded forgetting... retrospective-free parameter merging

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 3 internal anchors

[1]

Lamd: Context- driven android malware detection and classification with llms,

X. Qian, X. Zheng, Y . He, S. Yang, and L. Cavallaro, “Lamd: Context- driven android malware detection and classification with llms,” in 2025 IEEE Security and Privacy Workshops (SPW). IEEE, 2025, pp. 126–136

work page 2025
[2]

Msdroid: Identifying malicious snippets for android malware detection,

Y . He, Y . Liu, L. Wu, Z. Yang, K. Ren, and Z. Qin, “Msdroid: Identifying malicious snippets for android malware detection,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 3, pp. 2025–2039, 2023

work page 2025
[3]

On distribution shift in learning-based bug detectors,

J. He, L. Beurer-Kellner, and M. Vechev, “On distribution shift in learning-based bug detectors,” inInternational conference on machine learning. PMLR, 2022, pp. 8559–8580

work page 2022
[4]

Exploring{ChatGPT’s}capabilities on vulner- ability management,

P. Liu, J. Liu, L. Fu, K. Lu, Y . Xia, X. Zhang, W. Chen, H. Weng, S. Ji, and W. Wang, “Exploring{ChatGPT’s}capabilities on vulner- ability management,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 811–828

work page 2024
[5]

Revisit- ing non-separable binary classification and its applications in anomaly detection,

M. Lau, I. SECK, A. P. Meliopoulos, W. Lee, and E. Ndiaye, “Revisit- ing non-separable binary classification and its applications in anomaly detection,”Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=zOJ846BXhl

work page 2024
[6]

Source code foundation models are transferable binary analysis knowledge bases,

Z. Su, X. Xu, Z. Huang, K. Zhang, and X. Zhang, “Source code foundation models are transferable binary analysis knowledge bases,” Advances in Neural Information Processing Systems, vol. 37, pp. 112 624–112 655, 2024

work page 2024
[7]

Bodmas: An open dataset for learning based temporal analysis of pe malware,

L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas: An open dataset for learning based temporal analysis of pe malware,” in2021 IEEE Security and Privacy Workshops (SPW). IEEE, 2021, pp. 78–84

work page 2021
[8]

Evaluating the effec- tiveness of decompilers,

Y . Cao, R. Zhang, R. Liang, and K. Chen, “Evaluating the effec- tiveness of decompilers,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024, pp. 491–502

work page 2024
[9]

2025 threat intelligence index,

IBM, “2025 threat intelligence index,” IBM Insti- tute for Business Value, Tech. Rep., Oct 2025. [Online]. Available: https://www.ibm.com/thought-leadership/ institute-business-value/en-us/report/2025-threat-intelligence-index

work page 2025
[10]

A comprehensive survey of continual learning: Theory, method and application,

L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,”IEEE transac- tions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5362–5383, 2024

work page 2024
[11]

Continuous learning for android malware detection,

Y . Chen, Z. Ding, and D. Wagner, “Continuous learning for android malware detection,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 1127–1144

work page 2023
[12]

Madar: Efficient continual learning for malware analysis with distribution-aware re- play,

M. S. Rahman, S. Coull, Q. Yu, and M. Wright, “Madar: Efficient continual learning for malware analysis with distribution-aware re- play,” inProceedings of the Conference on Applied Machine Learning in Information Security (CAMLIS), 2025

work page 2025
[13]

Transfer learning via learning to transfer,

W. Ying, Y . Zhang, J. Huang, and Q. Yang, “Transfer learning via learning to transfer,” inInternational conference on machine learning. PMLR, 2018, pp. 5085–5094

work page 2018
[14]

Enabling efficient privacy-assured outlier detection over encrypted incremental data sets,

S. Lai, X. Yuan, A. Sakzad, M. Salehi, J. K. Liu, and D. Liu, “Enabling efficient privacy-assured outlier detection over encrypted incremental data sets,”IEEE Internet of Things Journal, vol. 7, no. 4, pp. 2651–2662, 2019

work page 2019
[15]

On the conflict between robustness and learning in collaborative machine learning,

M. Raynal and C. Troncoso, “On the conflict between robustness and learning in collaborative machine learning,” in2025 IEEE Symposium on Security and Privacy (SP). IEEE, 2025, pp. 2171–2189

work page 2025
[16]

Remind your neural network to prevent catastrophic forgetting,

T. L. Hayes, K. Kafle, R. Shrestha, M. Acharya, and C. Kanan, “Remind your neural network to prevent catastrophic forgetting,” in European conference on computer vision. Springer, 2020, pp. 466– 483

work page 2020
[17]

Extending source code pre-trained language models to summarise decompiled binaries,

A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, “Extending source code pre-trained language models to summarise decompiled binaries,” in2023 IEEE International Con- ference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2023, pp. 260–271

work page 2023
[18]

Transcend: Detecting concept drift in malware classification models,

R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdi- nov, and L. Cavallaro, “Transcend: Detecting concept drift in malware classification models,” in26th USENIX security symposium (USENIX security 17), 2017, pp. 625–642

work page 2017
[19]

Catastrophic interference in con- nectionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interference in con- nectionist networks: The sequential learning problem,” inPsychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165

work page 1989
[20]

Learning multiple visual domains with residual adapters,

S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with residual adapters,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[21]

Learning without forgetting,

Z. Li and D. Hoiem, “Learning without forgetting,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2935–2947, 2017

work page 2017
[22]

Podnet: Pooled outputs distillation for small-tasks incremental learning,

A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” in European Conference on Computer Vision. Springer, 2020, pp. 86– 102

work page 2020
[23]

Co2l: Contrastive continual learning,

H. Cha, J. Lee, and J. Shin, “Co2l: Contrastive continual learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9516–9525

work page 2021
[24]

Is multi-task learning an upper bound for continual learning?

Z. Wu, H. Tran, H. Pirsiavash, and S. Kolouri, “Is multi-task learning an upper bound for continual learning?” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[25]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities,”arXiv preprint arXiv:2408.07666, 2024

work page internal anchor Pith review arXiv 2024
[26]

A continual learning survey: Defy- ing forgetting in classification tasks,

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defy- ing forgetting in classification tasks,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 7, pp. 3366–3385, 2021

work page 2021
[27]

Persistent backdoor attacks in continual learning,

Z. Guo, A. Kumar, and R. Tourani, “Persistent backdoor attacks in continual learning,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 6379–6397

work page 2025
[28]

Expe- rience replay for continual learning,

D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne, “Expe- rience replay for continual learning,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[29]

icarl: Incremental classifier and representation learning,

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010

work page 2017
[30]

Overcoming catastrophic forgetting in neural networks,

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

work page 2017
[31]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inInternational conference on machine learn- ing. PMLR, 2017, pp. 3987–3995

work page 2017
[32]

Memory aware synapses: Learning what (not) to forget,

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuyte- laars, “Memory aware synapses: Learning what (not) to forget,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 139–154

work page 2018
[33]

Packnet: Adding multiple tasks to a single network by iterative pruning,

A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single network by iterative pruning,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7765–7773

work page 2018
[34]

Overcoming catas- trophic forgetting with hard attention to the task,

J. Serra, D. Suris, M. Miron, and A. Karatzoglou, “Overcoming catas- trophic forgetting with hard attention to the task,” inInternational conference on machine learning. PMLR, 2018, pp. 4548–4557

work page 2018
[35]

Lifelong learning with dynamically expandable networks,

J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,” in6th International Conference on Learning Representations, ICLR 2018, 2018

work page 2018
[36]

Convnext v2: Co-designing and scaling convnets with masked au- toencoders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked au- toencoders,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 16 133–16 142

work page 2023
[37]

Efficientnetv2: Smaller models and faster train- ing,

M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster train- ing,” inInternational conference on machine learning. PMLR, 2021, pp. 10 096–10 106

work page 2021
[38]

Tesseract: Eliminating experimental bias in malware classifi- cation across space and time,

F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Caval- laro, “Tesseract: Eliminating experimental bias in malware classifi- cation across space and time,” in28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 729–746

work page 2019
[39]

Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,

S. H. Ding, B. C. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in2019 ieee symposium on security and privacy (sp). IEEE, 2019, pp. 472–489

work page 2019
[40]

Transcend- ing transcend: Revisiting malware classification in the presence of concept drift,

F. Barbero, F. Pendlebury, F. Pierazzi, and L. Cavallaro, “Transcend- ing transcend: Revisiting malware classification in the presence of concept drift,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 805–823

work page 2022
[41]

Demistify: Identifying on-device machine learning models stealing and reuse vulnerabilities in mobile apps,

P. Ren, C. Zuo, X. Liu, W. Diao, Q. Zhao, and S. Guo, “Demistify: Identifying on-device machine learning models stealing and reuse vulnerabilities in mobile apps,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer So- ciety, 2023, pp. 468–480

work page 2023
[42]

Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution,

S. Shao, Y . Li, H. Yao, Y . He, Z. Qin, and K. Ren, “Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution,” inNetwork and Distributed System Security Symposium (NDSS), 2025

work page 2025
[43]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,

M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” inInterna- tional conference on machine learning. PMLR, 2022, pp. 23 965– 23 998

work page 2022
[44]

Lora: Low-rank adaptation of large language mod- els

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language mod- els.”ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[45]

Evaluating model calibration in classification,

J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. Sch¨on, “Evaluating model calibration in classification,” inThe 22nd international conference on artificial intelligence and statistics. PMLR, 2019, pp. 3459–3467

work page 2019
[46]

Calibration of large language models on code summarization,

Y . Virk, P. Devanbu, and T. Ahmed, “Calibration of large language models on code summarization,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 2944–2964, 2025

work page 2025
[47]

Comparing kullback-leibler divergence and mean squared error loss in knowl- edge distillation,

T. Kim, J. Oh, N. Y . Kim, S. Cho, and S.-Y . Yun, “Comparing kullback-leibler divergence and mean squared error loss in knowl- edge distillation,” in30th International Joint Conference on Artificial Intelligence (IJCAI-21). IJCAI, 2021, pp. 2628–2635

work page 2021
[48]

Drebin: Effective and explainable detection of android malware in your pocket

D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C. Siemens, “Drebin: Effective and explainable detection of android malware in your pocket.” inNdss, vol. 14, no. 1, 2014, pp. 23–26

work page 2014
[49]

Adversarial examples for malware detection,

K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, “Adversarial examples for malware detection,” inEuropean sympo- sium on research in computer security. Springer, 2017, pp. 62–79

work page 2017
[50]

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,

Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708

work page 2021
[51]

Editing Models with Task Arithmetic

G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Ties- merging: Resolving interference when merging models,

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,”Advances in Neural Information Processing Systems, vol. 36, pp. 7093–7115, 2023

work page 2023
[53]

Adamerging: Adaptive model merging for multi-task learning

E. Yang, Z. Wang, L. Shen, S. Liu, G. Guo, X. Wang, and D. Tao, “Adamerging: Adaptive model merging for multi-task learning,”arXiv preprint arXiv:2310.02575, 2023

work page arXiv 2023
[54]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

work page 2002
[55]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72

work page 2005
[56]

Dexbert: Effective, task-agnostic and fine-grained representation learning of android bytecode,

T. Sun, K. Allix, K. Kim, X. Zhou, D. Kim, D. Lo, T. F. Bissyand ´e, and J. Klein, “Dexbert: Effective, task-agnostic and fine-grained representation learning of android bytecode,”IEEE Transactions on Software Engineering, vol. 49, no. 10, pp. 4691–4706, 2023

work page 2023
[57]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

A simple frame- work for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame- work for contrastive learning of visual representations,” inInterna- tional conference on machine learning. PmLR, 2020, pp. 1597–1607

work page 2020
[59]

In- variant risk minimization games,

K. Ahuja, K. Shanmugam, K. Varshney, and A. Dhurandhar, “In- variant risk minimization games,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 145–155

work page 2020
[60]

Learning temporal invariance in android malware detectors,

X. Zheng, S. Yang, E. C. Ngai, S. Jana, and L. Cavallaro, “Learning temporal invariance in android malware detectors,”arXiv preprint arXiv:2502.05098, 2025

work page arXiv 2025
[61]

Enhancing state-of-the-art classifiers with api se- mantics to detect evolved android malware,

X. Zhang, Y . Zhang, M. Zhong, D. Ding, Y . Cao, Y . Zhang, M. Zhang, and M. Yang, “Enhancing state-of-the-art classifiers with api se- mantics to detect evolved android malware,” inProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 757–770

work page 2020
[62]

Exploiting code symmetries for learning program seman- tics,

K. Pei, W. Li, Q. Jin, S. Liu, S. Geng, L. Cavallaro, J. Yang, and S. Jana, “Exploiting code symmetries for learning program seman- tics,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 40 092–40 113

work page 2024
[63]

A survey on multi-task learning,

Y . Zhang and Q. Yang, “A survey on multi-task learning,”IEEE transactions on knowledge and data engineering, vol. 34, no. 12, pp. 5586–5609, 2021

work page 2021
[64]

Cross-stitch net- works for multi-task learning,

I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch net- works for multi-task learning,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3994–4003

work page 2016
[65]

π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation,

C. Wu, T. Wang, Y . Ge, Z. Lu, R. Zhou, Y . Shan, and P. Luo, “π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 37 713–37 727

work page 2023
[66]

Learning under concept drift: A review,

J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,”IEEE transactions on knowledge and data engineering, vol. 31, no. 12, pp. 2346–2363, 2018

work page 2018
[67]

Combating concept drift with explanatory detection and adaptation for android malware classification,

Y . He, J. Lei, Z. Qin, K. Ren, and C. Chen, “Combating concept drift with explanatory detection and adaptation for android malware classification,”arXiv preprint arXiv:2405.04095, 2024

work page arXiv 2024
[68]

{CADE}: Detecting and explaining concept drift sam- ples for security applications,

L. Yang, W. Guo, Q. Hao, A. Ciptadi, A. Ahmadzadeh, X. Xing, and G. Wang, “{CADE}: Detecting and explaining concept drift sam- ples for security applications,” in30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 2327–2344

work page 2021
[69]

Beyond classification: Evaluating llms for fine-grained automatic malware behavior auditing,

X. Zheng, X. Qian, Y . He, S. Yang, and L. Cavallaro, “Beyond classification: Evaluating llms for fine-grained automatic malware behavior auditing,”arXiv preprint arXiv:2509.14335, 2025

work page arXiv 2025
[70]

Large language models for code analysis: Do LLMs really do their job?

C. Fang, N. Miao, S. Srivastav, J. Liu, R. Zhang, R. Fang, Asmita, R. Tsang, N. Nazari, H. Wang, and H. Homayoun, “Large language models for code analysis: Do LLMs really do their job?” in33rd USENIX Security Symposium (USENIX Security 24). Philadelphia, PA: USENIX Association, Aug. 2024, pp. 829–846

work page 2024
[71]

Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models,

X. Jin, J. Larson, W. Yang, and Z. Lin, “Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models,”arXiv preprint arXiv:2312.09601, 2023

work page arXiv 2023
[72]

On benchmarking code llms for android malware analysis,

Y . He, H. She, X. Qian, X. Zheng, Z. Chen, Z. Qin, and L. Caval- laro, “On benchmarking code llms for android malware analysis,” in Proceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2025, pp. 153–160. Appendix A. Model-level Bound and Interference While the main text expresses the update rule using a single wei...

work page 2025