arxiv: 2604.16309 · v1 · submitted 2026-01-29 · 💻 cs.SE · cs.CR

AgentGuard: A Multi-Agent Framework for Robust Package Confusion Detection via Hybrid Search and Metadata-Content Fusion

Yu Li , Wei Ma , Zhi Chen , Ye Liu , Lingxiao Jiang , Junyi Tao , Hao Liu , Yongqiang Lyu

show 1 more author

Qiang Hu

This is my paper

Pith reviewed 2026-05-16 09:55 UTC · model grok-4.3

classification 💻 cs.SE cs.CR

keywords package confusion detectionsoftware supply chainmulti-agent frameworkhybrid similarity searchmetadata content fusionfalse positive reductionadversarial evasionopen source security

0 comments

The pith

AgentGuard detects confused packages by fusing metadata and content analysis after hybrid name search in a multi-agent setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentGuard as a multi-agent framework to catch package confusion attacks, where malicious code is published under names that closely resemble legitimate open-source packages. Existing single-signal methods that use only lexical or semantic name matching produce high false positives because they cannot tell apart benign similar names from malicious code that differs substantially. AgentGuard first locates candidate targets through hybrid similarity search on fine-tuned word embeddings, then applies a fused machine learning model that merges multi-dimensional metadata features with a new package content analysis group. Evaluation on the ConfuDB and NeupaneDB datasets shows higher precision and lower false-positive rates than the ConfuGuard and Typomind baselines while also surfacing the actual confused package. A reader would care because the approach directly targets a practical supply-chain risk that current tools leave unresolved.

Core claim

AgentGuard is a multi-agent framework that first discovers potential confusion targets using fine-tuned word embedding models with hybrid similarity search and then evaluates risk via a fused machine learning model that combines a multi-dimensional metadata group with a novel package content analysis group, thereby reducing false positives and mitigating adversarial evasion.

What carries the argument

The fused machine learning model that integrates a multi-dimensional metadata group and a novel package content analysis group after hybrid similarity search.

If this is right

Precision improves by 12% to 49% relative to ConfuGuard and Typomind on the evaluated datasets.
False-positive rate drops by 11% to 35% while still surfacing the confused package.
The hybrid search plus fused-model pipeline resists simple name-only evasion attempts.
The framework scales to real-world OSS repositories without relying on single-signal retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Package managers could embed the fused model to flag uploads in real time before they reach users.
The same metadata-plus-content fusion pattern may apply to detecting other supply-chain impersonations such as domain or library-name collisions.
Security teams could reduce manual triage volume by routing only high-risk fused-model scores for review.
Periodic retraining of the content-analysis group on newly published packages would keep the detector current.

Load-bearing premise

The fused model that merges metadata and content signals will reliably separate benign similar-named packages from malicious ones without hidden biases or the need for post-hoc tuning.

What would settle it

Run AgentGuard on a fresh set of adversarial packages engineered to match both names and surface-level content patterns from the ConfuDB dataset and measure whether the reported false-positive rate stays below the baseline levels.

Figures

Figures reproduced from arXiv: 2604.16309 by Hao Liu, Junyi Tao, Lingxiao Jiang, Qiang Hu, Wei Ma, Ye Liu, Yongqiang Lyu, Yu Li, Zhi Chen.

**Figure 1.** Figure 1: The overall architecture and workflow of the AgentGuard system. The Orchestrator Agent coordinates three specialized [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Relationship of F1-score with threshold. The plot [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: Target Discovery Rate (TDR@k) comparison of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: SHAP dot plot showing the contribution of all 18 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

The proliferation of open-source software (OSS) has made software supply chains prime targets for attacks like Package Confusion, where adversaries publish malicious packages with names deceptively similar to legitimate ones. To protect against such attacks and safeguard the use of OSS, multiple confusion detection methods have been proposed. However, existing methods are limited to single-signal retrieval strategies (relying solely on lexical or semantic metrics), struggle with high false positive rates (FPR), and are vulnerable to adversarial evasion. Critically, as content-agnostic approaches, they fundamentally fail to distinguish benign packages with high naming similarity from malicious, code-dissimilar impersonations, leading to persistent high FPR. To address these limitations, we introduce AgentGuard, a novel multi-agents based framework for package confusion detection. Specifically, it first discovers potential confusion targets using fine-tuned word embedding models with hybrid similarity search. After that, It subsequently evaluates risk via a fused machine learning model that uniquely combines: (1) a multi-dimensional metadata group and (2) a novel package content analysis group, to reduce the FPR and mitigate the impact of adversarial evasion. To assess the effectiveness of AgentGuard, we evaluate it on challenging ConfuDB and NeupaneDB datasets. Our results demonstrate that AgentGuard significantly outperforms state-of-the-art baselines, ConfuGuard and Typomind, improving precision by 12\%-49\% while simultaneously reducing the FPR by 11\%-35\%, and effectively discovers the confused package.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentGuard's multi-agent fusion of hybrid search with metadata-content ML targets a real OSS supply-chain gap and reports clear gains on public datasets, but the abstract leaves the fusion's contribution unproven.

read the letter

The main point is that AgentGuard adds a two-stage multi-agent setup: fine-tuned embeddings plus hybrid lexical-semantic search to surface candidate confusions, followed by an ML model that merges multi-dimensional metadata with a new content-analysis group to cut false positives and resist evasion. That combination is the actual novelty over the single-signal baselines it cites. It evaluates on ConfuDB and NeupaneDB and states 12-49% precision lifts and 11-35% FPR drops versus ConfuGuard and Typomind, which is the kind of concrete, externally verifiable claim that matters for this domain. The practical framing around real package registries and supply-chain attacks is also useful; the problem is well-motivated and the datasets are public. The soft spot is exactly what the stress-test flags: no ablation results appear in the abstract to show that the fused model, rather than the upstream search step alone, produces the reported gains. Training procedure, feature definitions, cross-validation, and error analysis are also missing, so it is impossible to judge whether the FPR reductions hold up or are dataset artifacts. If the full paper supplies those runs and details, the claims become much stronger; without them the numbers stay hard to trust. This is work for people building or evaluating package scanners and supply-chain tools. A reader who needs a new architecture to test against existing detectors will get value from the high-level design and the dataset results. It deserves a serious referee because the idea is grounded in a documented weakness, the evaluation uses external data, and the performance deltas are large enough to be worth checking, even if the methods section will need expansion and ablations before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper introduces AgentGuard, a multi-agent framework for detecting package confusion in open-source software supply chains. It first identifies candidate confusing packages via fine-tuned word embeddings and hybrid similarity search, then applies a fused machine learning model that combines a multi-dimensional metadata group with a novel package content analysis group to assess risk, lower false positive rates, and resist adversarial evasion. Evaluation on the ConfuDB and NeupaneDB datasets reports that AgentGuard outperforms baselines ConfuGuard and Typomind, with precision gains of 12%-49% and FPR reductions of 11%-35%.

Significance. If the performance claims are substantiated, the work could meaningfully advance software supply chain security by moving beyond single-signal lexical or semantic methods to a hybrid metadata-content approach that directly targets the content-agnostic limitations of prior detectors. The multi-agent structure and explicit content analysis group offer a concrete path toward lower-FPR, more evasion-resistant detection, which would be valuable for package registries and dependency tools.

major comments (3)

[Evaluation] Evaluation section: the headline claim that the fused metadata-content model drives the 12%-49% precision lift and 11%-35% FPR drop is unsupported without ablation results (metadata-only, content-only, hybrid-search-only, and full-fusion runs). The reported gains could be attributable to the upstream fine-tuned embedding step alone.
[Methodology] Methodology section: no training details, feature definitions, hyper-parameter settings, or cross-validation procedure are supplied for the fused ML model or the fine-tuned word embeddings. Without these, the precision and FPR numbers cannot be reproduced or verified.
[§3] §3 (or equivalent): the multi-agent framework is described at a high level but lacks concrete specification of agent roles, communication protocol, and decision aggregation, which are load-bearing for the claimed robustness against evasion.

minor comments (2)

[Abstract] The abstract and introduction use the term 'multi-agents based' inconsistently; standardize to 'multi-agent'.
[Evaluation] Dataset descriptions for ConfuDB and NeupaneDB should include size, label distribution, and any preprocessing steps applied before evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from additional ablation studies, expanded methodological details, and more concrete specifications of the multi-agent components. We will incorporate these revisions to strengthen the paper's reproducibility and clarity.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the headline claim that the fused metadata-content model drives the 12%-49% precision lift and 11%-35% FPR drop is unsupported without ablation results (metadata-only, content-only, hybrid-search-only, and full-fusion runs). The reported gains could be attributable to the upstream fine-tuned embedding step alone.

Authors: We agree that ablation studies are required to isolate the contribution of the fused metadata-content model. In the revised manuscript we will add a dedicated ablation subsection reporting precision and FPR for (1) metadata-only, (2) content-only, (3) hybrid-search-only, and (4) the full fusion configuration on both ConfuDB and NeupaneDB. These results will demonstrate that the reported gains are not solely attributable to the embedding step. revision: yes
Referee: [Methodology] Methodology section: no training details, feature definitions, hyper-parameter settings, or cross-validation procedure are supplied for the fused ML model or the fine-tuned word embeddings. Without these, the precision and FPR numbers cannot be reproduced or verified.

Authors: We acknowledge the omission of implementation details. The revised manuscript will include a new subsection that fully specifies: the training corpus and procedure for the fine-tuned word embeddings, the exact feature definitions for the metadata and content groups, all hyper-parameter values and selection method, the loss function, optimizer, and the cross-validation protocol (including fold count and stratification). This will enable full reproducibility of the reported metrics. revision: yes
Referee: [§3] §3 (or equivalent): the multi-agent framework is described at a high level but lacks concrete specification of agent roles, communication protocol, and decision aggregation, which are load-bearing for the claimed robustness against evasion.

Authors: We will expand the description of the multi-agent framework in §3 (and add an accompanying figure and pseudocode listing). The revision will explicitly define each agent's role, the message format and protocol used for inter-agent communication, and the decision-aggregation rule (including how metadata and content signals are fused). These additions will clarify how the architecture contributes to evasion resistance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external datasets

full rationale

The paper introduces an empirical multi-agent framework evaluated on the external ConfuDB and NeupaneDB datasets with comparisons to baselines ConfuGuard and Typomind. No equations, derivations, or first-principles predictions appear in the provided text. Performance improvements are reported as measured outcomes rather than quantities forced by construction from fitted parameters or self-referential definitions. No load-bearing self-citations to prior author work are invoked to justify uniqueness or forbid alternatives. The skeptic concern about missing ablations addresses experimental completeness, not circular reduction of any derivation chain to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the unverified effectiveness of the hybrid search step and the fusion model; no independent evidence or parameter-free derivation is supplied in the abstract.

free parameters (2)

fine-tuned word embedding parameters
Fine-tuning implies parameters learned from data to support the hybrid similarity search.
fused ML model parameters and weights
The machine learning model that combines metadata and content groups requires fitted parameters.

axioms (2)

domain assumption Hybrid similarity search on fine-tuned embeddings discovers relevant confusion targets
Invoked as the first stage of the framework.
domain assumption Fused metadata and content analysis reduces FPR and mitigates adversarial evasion
Central premise of the risk evaluation stage.

pith-pipeline@v0.9.0 · 5595 in / 1391 out tokens · 56337 ms · 2026-05-16T09:55:34.388937+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

npm: The Package Manager for Node.js

npm, Inc. npm: The Package Manager for Node.js. https: //www.npmjs.com/, 2025

work page 2025
[2]

Acceptable use policy

Python Software Foundation. Acceptable use policy. https://policies.python.org/pypi.org/ Acceptable-Use-Policy/, 2024

work page 2024
[3]

Npm package json: name

NPM Contributors. Npm package json: name. https://docs.npmjs.com/cli/v9/configuring-npm/ package-json#name, 2024

work page 2024
[4]

The emergence of software diversity in maven central

C ´esar Soto-Valero, Amine Benelallam, Nicolas Harrand, Olivier Barais, and Benoit Baudry. The emergence of software diversity in maven central. InProceedings of the 16th International Conference on Mining Software Repositories, MSR ’19, page 333–343. IEEE Press, 2019

work page 2019
[5]

Signing in four public software package registries: Quantity, quality, and influencing factors, 2024

Taylor R Schorlemmer, Kelechi G Kalu, Luke Chigges, Kyung Myung Ko, Eman Abu Isghair, Saurabh Baghi, Santiago Torres-Arias, and James C Davis. Signing in four public software package registries: Quantity, quality, and influencing factors, 2024

work page 2024
[6]

On the feasibility of detecting injections in malicious npm packages

Simone Scalco, Ranindya Paramitha, Duc-Ly Vu, and Fabio Massacci. On the feasibility of detecting injections in malicious npm packages. InProceedings of the 17th International Conference on Availability, Reliability and Security, ARES ’22, New York, NY , USA, 2022. Association for Computing Machinery

work page 2022
[7]

Lastpymile: identifying the discrepancy between sources and packages

Duc-Ly Vu, Fabio Massacci, Ivan Pashchenko, Henrik Plate, and Antonino Sabetta. Lastpymile: identifying the discrepancy between sources and packages. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, page 780–792, New York, NY , USA, 2021. Assoc...

work page 2021
[8]

Breaking trust: Shades of crisis across an insecure software supply chain

Trey Herr. Breaking trust: Shades of crisis across an insecure software supply chain. Technical report, Atlantic Council, July 2020

work page 2020
[9]

Towards using source code repositories to identify software supply chain attacks

Duc Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and Antonino Sabetta. Towards using source code repositories to identify software supply chain attacks. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, page 2093–2095, New York, NY , USA, 2020. Association for Computing Machinery

work page 2020
[10]

State of the softwarw supply chain

SonaType. State of the softwarw supply chain. technical report, 2021

work page 2021
[11]

Typosquatting and com- bosquatting attacks on the python ecosystem.2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pages 509–514, 2020

Duc-Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and Antonino Sabetta. Typosquatting and com- bosquatting attacks on the python ecosystem.2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pages 509–514, 2020

work page 2020
[12]

Beyond typosquatting: an in-depth look at package confusion

Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. Beyond typosquatting: an in-depth look at package confusion. InProceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, USA, 2023. USENIX Association

work page 2023
[13]

Agentguard: An active threat discovery system for package confusion using multi- agent collaboration, 2025

Wei Ma, Yu Li, Zhi Chen, Ye Liu, Lingxiao Jiang, Qiang Hu, and Junyi Tao. Agentguard: An active threat discovery system for package confusion using multi- agent collaboration, 2025

work page 2025
[14]

Kalu, Sofia Okorafor, Bet ¨ul Durak, Kim Laine, Radames C

Kelechi G. Kalu, Sofia Okorafor, Bet ¨ul Durak, Kim Laine, Radames C. Moreno, Santiago Torres-Arias, and James C. Davis. Arms: A vision for actor reputation metric systems in the open-source software supply chain, 2025

work page 2025
[15]

Amusuo, Kyle A

Paschal C. Amusuo, Kyle A. Robinson, Tanmay Singla, Huiyun Peng, Aravind Machiry, Santiago Torres-Arias, Laurent Simon, and James C. Davis. ZTDJA V A: Miti- gating software supply chain vulnerabilities via zero-trust dependencies. In47th IEEE/ACM International Con- ference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pag...

work page 2025
[16]

A survey on common threats in npm and pypi registries, 2021

Berkay Kaplan and Jingyu Qian. A survey on common threats in npm and pypi registries, 2021

work page 2021
[17]

Dns typo-squatting domain detection: A data analytics & machine learning based approach

Abdallah Moubayed, MohammadNoor Injadat, Abdallah Shami, and Hanan Lutfiyya. Dns typo-squatting domain detection: A data analytics & machine learning based approach. In2018 IEEE Global Communications Confer- ence (GLOBECOM), page 1–7. IEEE, December 2018

work page 2018
[18]

Typosquatting 3.0: Characterizing Squatting in Blockchain Naming Sys- tems

Muhammad Muzammil, Zhengyu Wu, Lalith Harisha, Brian Kondracki, and Nick Nikiforakis. Typosquatting 3.0: Characterizing Squatting in Blockchain Naming Sys- tems . In2024 APWG Symposium on Electronic Crime Research (eCrime), pages 94–108, Los Alamitos, CA, USA, September 2024. IEEE Computer Society

work page 2024
[19]

Exploring the unchartered space of container registry typosquatting

Guannan Liu, Xing Gao, Haining Wang, and Kun Sun. Exploring the unchartered space of container registry typosquatting. In31st USENIX Security Symposium (USENIX Security 22), pages 35–51, Boston, MA, Au- gust 2022. USENIX Association

work page 2022
[20]

Defending against package typosquatting

Matthew Taylor, Ruturaj Vaidya, Drew Davidson, Lorenzo De Carli, and Vaibhav Rastogi. Defending against package typosquatting. InNetwork and System Security: 14th International Conference, NSS 2020, Mel- bourne, VIC, Australia, November 25–27, 2020, Proceed- ings, page 112–131, Berlin, Heidelberg, 2020. Springer- Verlag

work page 2020
[21]

Microsoft ossgadget

Microsoft. Microsoft ossgadget. https://github.com/ microsoft/OSSGadget. 13

work page
[22]

Enriching word vectors with subword information, 2017

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information, 2017

work page 2017
[23]

Advances in pre- training distributed word representations

Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre- training distributed word representations. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H ´el`ene Mazo, Asun- cion Moreno, Jan Odijk, Stelios Piper...

work page 2018
[24]

fastText: Library for efficient text classification and representation learning

Meta AI. fastText: Library for efficient text classification and representation learning. https://fasttext.cc/, 2025. Accessed: 2025-11-18

work page 2025
[25]

Similarity as a risk factor in drug-name confu- sion errors: the look-alike (orthographic) and sound-alike (phonetic) model.Medical care, 37:1214–25, 01 2000

Bruce Lambert, S Lin, Kwan-Young Chang, and Sanjay Gandhi. Similarity as a risk factor in drug-name confu- sion errors: the look-alike (orthographic) and sound-alike (phonetic) model.Medical care, 37:1214–25, 01 2000

work page 2000
[26]

Automated detection of wrong-drug prescribing errors.BMJ Quality & Safety, 28:bmjqs–2019, 08 2019

Bruce Lambert, William Galanter, King Liu, Suzanne Falck, Gordon Schiff, Christine Rash-Foanio, Kelly Schmidt, Neeha Shrestha, Allen Vaida, and Michael Gaunt. Automated detection of wrong-drug prescribing errors.BMJ Quality & Safety, 28:bmjqs–2019, 08 2019

work page 2019
[27]

Smallworld with high risks: a study of security threats in the npm ecosystem

Markus Zimmermann, Cristian-Alexandru Staicu, Cam Tenny, and Michael Pradel. Smallworld with high risks: a study of security threats in the npm ecosystem. In Proceedings of the 28th USENIX Conference on Secu- rity Symposium, SEC’19, page 995–1010, USA, 2019. USENIX Association

work page 2019
[28]

Sok: Practical detection of software supply chain attacks

Marc Ohm and Charlene Stuke. Sok: Practical detection of software supply chain attacks. InProceedings of the 18th International Conference on Availability, Reliability and Security, ARES ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[29]

Wenxin Jiang, Berk C ¸ akar, Mikola Lysenko, and James C. Davis. ConfuGuard: Using Metadata to Detect Active and Stealthy Package Confusion At- tacks Accurately and at Scale.arXiv e-prints, page arXiv:2502.20528, February 2025

work page arXiv 2025
[30]

Practical automated de- tection of malicious npm packages

Adriana Sejfia and Max Sch ¨afer. Practical automated de- tection of malicious npm packages. InProceedings of the 44th International Conference on Software Engineering, page 1681–1692. ACM, May 2022

work page 2022
[31]

Detecting suspicious pack- age updates

Kalil Garrett, Gabriel Ferreira, Limin Jia, Joshua Sun- shine, and Christian K ¨astner. Detecting suspicious pack- age updates. InProceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, ICSE-NIER ’19, page 13–16. IEEE Press, 2019

work page 2019
[32]

What the fork? finding hidden code clones in npm

Elizabeth Wyss, Lorenzo De Carli, and Drew Davidson. What the fork? finding hidden code clones in npm. InProceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 2415–2426, New York, NY , USA, 2022. Association for Computing Machinery

work page 2022
[33]

Malicious package detection using metadata information

Sajal Halder, Michael Bewong, Arash Mahboubi, Yinhao Jiang, Md Rafiqul Islam, Md Zahid Islam, Ryan HL Ip, Muhammad Ejaz Ahmed, Gowri Sankar Ramachandran, and Muhammad Ali Babar. Malicious package detection using metadata information. InProceedings of the ACM Web Conference 2024, WWW ’24, page 1779–1789, New York, NY , USA, 2024. Association for Computing ...

work page 2024
[34]

Libraries.io: The Open Source Discovery Service

Libraries.io. Libraries.io: The Open Source Discovery Service. https://libraries.io/, 2025

work page 2025
[35]

Ecosyste.ms: Open Data to Empower the Software Ecosystem

Ecosyste.ms. Ecosyste.ms: Open Data to Empower the Software Ecosystem. https://ecosyste.ms/, 2025

work page 2025
[36]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correct- ing deletions, insertions, and reversals.Soviet physics. Doklady, 10:707–710, 1965

work page 1965
[37]

William E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. InProceedings of the Section on Sur- vey Research Methods, American Statistical Association, pages 354–359, 1990

work page 1990
[38]

A case of identity: Detection of suspicious idn homograph domains using active dns measurements

Ramin Yazdani, Olivier van der Toorn, and Anna Sper- otto. A case of identity: Detection of suspicious idn homograph domains using active dns measurements. In 2020 IEEE European Symposium on Security and Pri- vacy Workshops (EuroS&PW), pages 559–564, 2020

work page 2020
[39]

CodeBERT: A pre-trained model for programming and natural lan- guages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural lan- guages. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguis- tics: EMNLP 2020, pages 1536–1547, Online, November

work page 2020
[40]

Association for Computational Linguistics

work page
[41]

Random forests.Mach

Leo Breiman. Random forests.Mach. Learn., 45(1):5–32, October 2001

work page 2001
[42]

A study of cross-validation and bootstrap for accuracy estimation and model selection

Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. InInterna- tional Joint Conference on Artificial Intelligence, 1995

work page 1995
[43]

A k-fold averaging cross- validation procedure.Journal of Nonparametric Statis- tics, 27(2):167–179, April 2015

Yoonsuh Jung and Jianhua Hu. A k-fold averaging cross- validation procedure.Journal of Nonparametric Statis- tics, 27(2):167–179, April 2015. Publisher Copyright: © 2015, © 2015 American Statistical Association and Taylor & Francis

work page 2015
[44]

Wolfinger and Pei-Yi Tan

Russell D. Wolfinger and Pei-Yi Tan. Stacked ensemble models for improved prediction accuracy. InProceedings of the SAS Global Forum 2017, 2017

work page 2017
[45]

Maddix, Yuyang Wang, Gau- rav Gupta, and Youngsuk Park

Hilaf Hasson, Danielle C. Maddix, Yuyang Wang, Gau- rav Gupta, and Youngsuk Park. Theoretical guarantees of learning ensembling strategies with applications to time series forecasting, 2023

work page 2023
[46]

An introduction to roc analysis.Pattern Recognition Letters, 27(8):861–874, June 2006

Tom Fawcett. An introduction to roc analysis.Pattern Recognition Letters, 27(8):861–874, June 2006

work page 2006
[47]

The meaning and use of the area under a receiver operating characteristic (roc) curve

J A Hanley and B J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, April 1982

work page 1982
[48]

A unified approach to interpreting model predictions, 2017

Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions, 2017

work page 2017