arxiv: 2601.02438 · v3 · submitted 2026-01-05 · 💻 cs.SE · cs.AI· cs.CR

Focus on What Matters: Fisher-Guided Adaptive Multimodal Fusion for Vulnerability Detection

Yun Bian , Yi Chen , HaiQuan Wang , Shihao Li , Zhe Cui This is my paper

Pith reviewed 2026-05-16 18:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR

keywords vulnerability detectionmultimodal fusionFisher informationcode property graphpretrained language modelssoftware security

0 comments

The pith

Fisher information selects only relevant signals when fusing code sequences with graph structures for vulnerability detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that pretrained language models already capture most structural cues in code, so simply adding graph-based representations often adds noise rather than new information and can weaken the model's ability to spot security flaws. The authors replace full fusion with a selective approach that uses Fisher information to measure how much each modality contributes to the specific detection task. This turns the fusion step into a targeted subspace operation that reduces error under an isotropic perturbation model. The resulting TaCCS-DFA system achieves higher detection accuracy on standard benchmarks while adding almost no runtime cost.

Core claim

Task-conditioned complementary fusion guided by Fisher information converts cross-modal interaction from full-spectrum matching into selective fusion inside a task-sensitive subspace; under the isotropic perturbation assumption this step tightens the upper bound on output error and yields more accurate binary classification of vulnerable code snippets.

What carries the argument

TaCCS-DFA framework that performs online low-rank Fisher subspace estimation combined with an adaptive gating mechanism to enable task-oriented fusion of natural code sequence and code property graph representations.

If this is right

Detection F1 rises by as much as 6.3 points on BigVul, Devign and ReVeal.
Inference latency grows by only 3.4 percent.
Calibration error stays low across the evaluated datasets.
Naive multimodal fusion can dilute useful signals through noise propagation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Fisher-selection idea could be applied to other code-analysis tasks where sequence and graph features overlap.
Developers might be able to rely more on pretrained models alone and skip expensive graph encoders in many settings.
The method could be tested on larger code models or on languages beyond those in the current benchmarks.

Load-bearing premise

Pretrained language models already contain most of the structural information that graph encoders would supply, so the two modalities overlap heavily and selective Fisher-based fusion is needed to avoid noise.

What would settle it

An ablation study on the same benchmarks where full non-selective fusion produces equal or higher F1 scores and calibration error than the Fisher-guided version.

Figures

Figures reproduced from arXiv: 2601.02438 by HaiQuan Wang, Shihao Li, Yi Chen, Yun Bian, Zhe Cui.

**Figure 2.** Figure 2: Feature space analysis. (a) The CKA similarity between NCS and CPG representations reaches 0.68; [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the TaCCS-DFA framework For the CPG modality 𝒢cpg = (𝒱, ℰ), we employ a Relational Graph Convolutional Network (RGCN) [29] to model heterogeneous program graphs with multiple edge types. RGCN extends GCNs by learning relation-specific weight matrices to aggregate neighborhood information. The node update rule is: h (𝑙+1) 𝑖 = 𝜎 (∑ 𝑟∈ℛ ∑ 𝑗∈𝒩𝑟 (𝑖) 1 𝑐𝑖,𝑟 W (𝑙) 𝑟 h (𝑙) 𝑗 + W(𝑙) 0 h (𝑙) 𝑖 ) (2) Afte… view at source ↗

**Figure 4.** Figure 4: Metric profile of the main results on three datasets. Each curve corresponds to one model/method on [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of line-level attention distributions. The top two plots visualize the same Use-After-Free [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Noise-sensitivity experiment. The red curve corresponds to noise injected into the orthogonal comple [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Efficiency comparison between TaCCS-DFA and mainstream fusion methods. From left to right, we [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Software vulnerability detection can be formulated as a binary classification problem that determines whether a given code snippet contains security defects. Existing multimodal methods typically fuse Natural Code Sequence (NCS) representations extracted by pretrained models with Code Property Graph (CPG) representations extracted by graph neural networks, under the implicit assumption that introducing an additional modality necessarily yields information gain. Through empirical analysis, we demonstrate the limitations of this assumption: pretrained models already encode substantial structural information implicitly, leading to strong overlap between the two modalities; moreover, graph encoders are generally less effective than pretrained language models in feature extraction. As a result, naive fusion not only struggles to obtain complementary signals but can also dilute effective discriminative cues due to noise propagation. To address these challenges, we propose a task-conditioned complementary fusion strategy that uses Fisher information to quantify task relevance, transforming cross-modal interaction from full-spectrum matching into selective fusion within a task-sensitive subspace. Our theoretical analysis shows that, under an isotropic perturbation assumption, this strategy significantly tightens the upper bound on the output error. Based on this insight, we design the TaCCS-DFA framework, which combines online low-rank Fisher subspace estimation with an adaptive gating mechanism to enable efficient task-oriented fusion. Experiments on the BigVul, Devign, and ReVeal benchmarks demonstrate that TaCCS-DFA delivers up to a 6.3-point gain in F1 score with only a 3.4% increase in inference latency, while maintaining low calibration error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical selective-fusion trick for code vulnerability models that beats naive multimodal baselines on three benchmarks, but the supporting theory hinges on an unverified isotropic assumption that code data likely breaks.

read the letter

The main takeaway is that TaCCS-DFA uses Fisher information to select task-relevant subspaces when fusing pretrained NCS embeddings with CPG graph features. This avoids the noise the authors show comes from modality overlap, and they report up to 6.3 F1 gains with only 3.4% extra latency on BigVul, Devign, and ReVeal while keeping calibration error low. The online low-rank estimation plus adaptive gating keeps the overhead small, which is a concrete engineering win for anyone running these models in practice. They also make a fair point that pretrained language models already pick up a lot of structural signal, so full-spectrum fusion can dilute rather than help. That framing is useful and matches what many practitioners see. The soft spot is the theory. The claimed tightening of the output error bound rests on an isotropic perturbation assumption that the abstract and stress-test note flag as unverified. Code embeddings are anisotropic because of syntax paths, data flow, and control flow, so the bound may not actually tighten and the gains could be just an empirical heuristic. The abstract also gives no derivation details, ablation tables, error bars, or split information, which leaves the central claims hard to assess from the summary alone. This paper is for researchers and engineers working on multimodal code security detectors who need a drop-in fusion improvement. It is worth a serious referee because the empirical results are specific and the method is implementable, even if the theory needs more scrutiny.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes TaCCS-DFA, a Fisher-guided adaptive multimodal fusion framework for software vulnerability detection. It argues that pretrained language models already capture substantial structural information, leading to overlap with Code Property Graph representations, and that naive fusion can introduce noise. The method uses Fisher information to perform selective fusion in a task-sensitive subspace, claiming under an isotropic perturbation assumption that this tightens the output error upper bound. Experiments on BigVul, Devign, and ReVeal show up to 6.3 F1 score improvement with 3.4% latency increase.

Significance. If the empirical gains are robust and the theoretical bound holds after verification, the work could advance multimodal code analysis by demonstrating how task-conditioned selective fusion avoids noise dilution while preserving low inference overhead. The practical emphasis on calibration error and latency makes the result relevant for deployable vulnerability detectors.

major comments (2)

[Abstract and §4] Abstract and §4: The theoretical claim that Fisher-guided selective fusion tightens the output error upper bound is derived under an isotropic perturbation assumption on the joint NCS+CPG feature space. Structured code data induces anisotropic variance along syntax, data-flow, and control-flow directions, so the covariance is unlikely to be scalar; when the assumption fails the derived bound does not tighten independently of the fitted subspace and the explanatory mechanism reduces to an unverified heuristic.
[Abstract] Abstract: The central empirical claim of a 6.3-point F1 gain (with low calibration error) is stated without reported data splits, ablation controls, or error bars. Because the soundness of the result rests on these unverified experimental steps, it is impossible to determine whether the observed improvement is attributable to the selective-fusion mechanism or to other factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We provide point-by-point responses to the major comments and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The theoretical claim that Fisher-guided selective fusion tightens the output error upper bound is derived under an isotropic perturbation assumption on the joint NCS+CPG feature space. Structured code data induces anisotropic variance along syntax, data-flow, and control-flow directions, so the covariance is unlikely to be scalar; when the assumption fails the derived bound does not tighten independently of the fitted subspace and the explanatory mechanism reduces to an unverified heuristic.

Authors: We appreciate the referee's observation on the isotropic perturbation assumption. This assumption simplifies the derivation of the error bound to highlight how selective fusion in the task-sensitive subspace can reduce the upper bound on output error. Although code data may exhibit anisotropic characteristics, the Fisher-guided approach still provides a practical mechanism for noise reduction, as evidenced by our consistent empirical gains. In the revised manuscript, we will include a more detailed discussion of the assumption's limitations and its implications for code-specific data, along with additional empirical validation of the bound's tightness. revision: partial
Referee: [Abstract] Abstract: The central empirical claim of a 6.3-point F1 gain (with low calibration error) is stated without reported data splits, ablation controls, or error bars. Because the soundness of the result rests on these unverified experimental steps, it is impossible to determine whether the observed improvement is attributable to the selective-fusion mechanism or to other factors.

Authors: The manuscript details the experimental setup in Section 4 and 5, including the use of standard data splits from the respective benchmarks (BigVul, Devign, ReVeal), comprehensive ablation studies comparing TaCCS-DFA against naive fusion and other multimodal baselines, and error bars computed over multiple runs. The reported 6.3 F1 improvement is the peak gain observed, with detailed per-dataset results and statistical significance provided in the tables. We will revise the abstract to explicitly reference these experimental controls and direct readers to the relevant sections for full details on splits, ablations, and variance. revision: yes

Circularity Check

0 steps flagged

No circularity: theory is conditional on explicit assumption; empirical results independent

full rationale

The paper's derivation chain states the isotropic perturbation assumption upfront in the abstract and claims the error-bound tightening only under that assumption. No equations or steps are shown reducing the bound to a fitted Fisher subspace or data-dependent quantity by construction. No self-citations, self-definitional loops, or renamed known results appear in the provided text. The 6.3-point F1 gains are reported from benchmark experiments (BigVul, Devign, ReVeal) separate from the theory, making the central claims self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the isotropic perturbation assumption for the error-bound proof and on the empirical claim that pretrained models already capture most structural signals; no new physical entities are introduced.

axioms (1)

domain assumption Isotropic perturbation assumption for tightening the output error bound
Invoked in the theoretical analysis to show that selective fusion reduces error compared with full fusion.

pith-pipeline@v0.9.0 · 5574 in / 1138 out tokens · 67426 ms · 2026-05-16T18:21:12.477284+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our theoretical analysis shows that, under an isotropic perturbation assumption, this strategy significantly tightens the upper bound on the output error... Theorem 3.1 (Tightness of the DFA Perturbation Bound)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fisher Information Matrix (FIM) quantifies the sensitivity of classification decisions to feature perturbations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

Shun-Ichi Amari. 1998. Natural gradient works efficiently in learning.Neural computation 10, 2 (1998), 251–276

work page 1998
[2]

Shun-ichi Amari. 2019. Fisher Information and Natural Gradient Learning in Random Deep Networks. InProceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (AISTATS 2019), Vol. 89. 1060–1068

work page 2019
[3]

Suchetan Chakraborty, Weilin Chen, Yu Liu, Min Guo, Neeraj Suri, Da Da, Fabian Yamaguchi, and Xiaoyong Huo

work page
[4]

Deep Learning Based Vulnerability Detection: Are We There Yet?arXiv preprint arXiv:2009.07235 (2020)

work page arXiv 2009
[5]

Cyber Safety Review Board. 2022. Review of the December 2021 Log4j Event . Technical Report. U.S. Department of Homeland Security. https://www.cisa.gov/sites/default/files/publications/CSRB-Report-on-Log4-July-11-2022_508. pdf

work page 2022
[6]

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624 (2024)

work page arXiv 2024
[7]

Alex Halderman, Michael Bailey, Frank Li, Nicholas Weaver, Johanna Amann, Jethro Beekman, Mathias Payer, and Vern Paxson

Zakir Durumeric, James Kasten, David Adrian, J. Alex Halderman, Michael Bailey, Frank Li, Nicholas Weaver, Johanna Amann, Jethro Beekman, Mathias Payer, and Vern Paxson. 2014. The Matter of Heartbleed. InProceedings of the 2014 Conference on Internet Measurement Conference (IMC) . ACM

work page 2014
[8]

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. AC/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th international conference on mining software repositories . 508–512

work page 2020
[9]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . 1536–1547

work page 2020
[10]

Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. 2021. Regularisation of neural networks by enforcing Lipschitz continuity.Machine Learning 110 (2021), 393–416. doi:10.1007/s10994-020-05929-w

work page doi:10.1007/s10994-020-05929-w 2021
[11]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of ICML . 1321–1330

work page 2017
[12]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy- atkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

1949.The organization of behavior: A neuropsychological theory

Donald Olding Hebb. 1949.The organization of behavior: A neuropsychological theory . Wiley, New York

work page 1949
[14]

Matthias Hein and Maksym Andriushchenko. 2017. Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation. InAdvances in Neural Information Processing Systems , Vol. 30. 2266–2276

work page 2017
[15]

Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. 2019. Universal statistics of fisher information in deep neural networks: Mean field approach. InThe 22nd International Conference on Artificial Intelligence and Statistics . PMLR, 1032–1041. , Vol. 1, No. 1, Article . Publication date: January 2026. Focus on What Matters: Fisher-Guided Adaptive Multimodal...

work page 2019
[16]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526

work page 2017
[17]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of Neural Network Repre- sentations Revisited. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 3519–3529

work page 2019
[18]

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2244–2258

work page 2021
[19]

Xin Liang. 2023. On the optimality of the Oja’s algorithm for online PCA.Statistics and Computing 33, 3 (2023), 62

work page 2023
[20]

Ruitong Liu, Yanbin Wang, Haitao Xu, Jianguo Sun, Fan Zhang, Peiyue Li, and Zhenhao Guo. 2025. Vul-LMGNNs: Fusing language models and online-distilled graph neural networks for code vulnerability detection.Information Fusion 115 (2025), 102748

work page 2025
[21]

Gary McGraw. 2006. Software Security: Building Security in . Addison-Wesley Professional. 408 pages

work page 2006
[22]

Charles T. Munger. 2005. Poor Charlie’s Almanack: The Wit and Wisdom of Charles T. Munger . Donning Company Publishers, Virginia Beach, V A

work page 2005
[23]

NIST National Vulnerability Database. 2014. CVE-2014-0160 Detail (Heartbleed).https://nvd.nist.gov/vuln/detail/ CVE-2014-0160

work page 2014
[24]

NIST National Vulnerability Database. 2021. CVE-2021-44228 Detail (Log4Shell).https://nvd.nist.gov/vuln/detail/ CVE-2021-44228

work page 2021
[25]

Erkki Oja. 1982. A simplified neuron model as a principal component analyzer.Journal of Mathematical Biology 15, 3 (1982), 267–273

work page 1982
[26]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Hippolyt Ritter, Aleksandar Botev, and David Barber. 2018. A scalable laplace approximation for neural networks. In 6th international conference on learning representations, ICLR 2018-conference track proceedings , Vol. 6. International Conference on Representation Learning

work page 2018
[28]

Bonan Ruan, Zhiwei Lin, Jiahao Liu, Chuqi Zhang, Kaihang Ji, and Zhenkai Liang. 2025. Propagation-Based Vulnera- bility Impact Assessment for Software Supply Chains. arXiv:2506.01342 [cs.SE] https://arxiv.org/abs/2506.01342

work page arXiv 2025
[29]

Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In2018 17th IEEE international conference on machine learning and applications (ICMLA) . IEEE, 757–762

work page 2018
[30]

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Mod- eling relational data with graph convolutional networks. InEuropean semantic web conference . Springer, 593–607

work page 2018
[31]

Wenxin Tao, Xiaohong Su, Jiayuan Wan, Hongwei Wei, and Weining Zheng. 2023. Vulnerability detection through cross-modal feature enhancement and fusion.Computers & Security 132 (2023), 103341

work page 2023
[32]

Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. 2018. Lipschitz-margin training: Scalable certification of pertur- bation invariance for deep neural networks.Advances in neural information processing systems 31 (2018)

work page 2018
[33]

Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 13–25

work page 2019
[34]

Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. InProceed- ings of the 38th international conference on software engineering . 297–308

work page 2016
[35]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Fabian Yamaguchi, Niklas Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In2014 IEEE Symposium on Security and Privacy . IEEE, 590–604

work page 2014
[37]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identi- fication by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems 32 (2019). , Vol. 1, No. 1, Article . Publication date: January 2026. 20 Bian et al. A Proof of Theorem Proof sket...

work page 2019