SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

Amit Goyal; Sudheer Tubati

arxiv: 2605.20157 · v1 · pith:7OKKN5T2new · submitted 2026-05-19 · 💻 cs.LG · cs.CR· cs.IR

SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

Sudheer Tubati , Amit Goyal This is my paper

Pith reviewed 2026-05-20 06:22 UTC · model grok-4.3

classification 💻 cs.LG cs.CRcs.IR

keywords fraud detectionpositive-unlabeled learningnegative harvestinggating ensemblestratified samplingmusic streamingMahalanobis distancek-NN density

0 comments

The pith

SAGE harvests confident negatives from unlabeled data using stratified sampling and a voting ensemble of statistical gates for fraud detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SAGE tackles representation bias in positive-unlabeled learning for music streaming fraud detection, where legitimate behaviors like super-fan activity closely resemble fraud. It applies SimHash-based stratified sampling with a floor constraint to cover rare behavioral cohorts, then uses a modular ensemble of statistical gates such as Mahalanobis distance and k-NN density with configurable voting thresholds to select confident negatives. This setup enables training effective fraud models without full labels. Strong precision and recall on held-out data are reported, and the method works across customer-level and artist-level fraud without core changes.

Core claim

The paper claims that integrating SimHash-based stratified sampling under floor constraints with a pluggable gating ensemble of Mahalanobis distance and k-NN density gates, controlled by voting thresholds, reliably identifies confident negative samples from unlabeled data, directly addressing representation bias and supporting high-performing fraud detectors that generalize across domains.

What carries the argument

Modular gating ensemble with pluggable statistical gates (Mahalanobis distance and k-NN density) plus voting thresholds, paired with floor-constrained SimHash stratified sampling for cohort coverage.

If this is right

Strong precision and recall are achieved on held-out fraud detection data.
The method generalizes to both customer-level and artist-level fraud without changes to the core approach.
Voting thresholds enable flexible precision-recall trade-offs as needed for different applications.
Floor-constrained sampling ensures coverage of rare behavioral cohorts and reduces representation bias in PU learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gating and sampling technique could transfer to other positive-unlabeled settings such as anomaly detection in security or finance.
Expanding the set of pluggable gates with domain-specific statistics might improve handling of new edge cases.
The emphasis on cohort coverage may lead to more robust models that perform consistently across varying data distributions.

Load-bearing premise

The statistical gates using Mahalanobis distance and k-NN density combined with voting thresholds can separate confident negatives from the unlabeled pool even when legitimate edge cases closely mimic fraud patterns.

What would settle it

A held-out test comparing fraud detection precision and recall of a model trained on SAGE-selected negatives versus one trained on random unlabeled samples or alternative selection methods, with ground-truth labels available.

Figures

Figures reproduced from arXiv: 2605.20157 by Amit Goyal, Sudheer Tubati.

**Figure 1.** Figure 1: SAGE - SimHash stratification with floor [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE is a practical assembly of SimHash sampling and statistical gates for negative harvesting in music streaming fraud, but the evaluation details are too thin to judge how well it handles edge cases.

read the letter

The main point is that SAGE layers floor-constrained SimHash sampling with a voting ensemble of Mahalanobis distance and k-NN density gates to pull confident negatives from unlabeled data. It targets the real issue in positive-unlabeled fraud detection where legitimate patterns like super-fans or sleep sessions overlap with fraud signals, and the floor constraint aims to keep rare cohorts represented in the sampled pool. The modular setup with adjustable voting thresholds lets users tune for precision or recall as needed, and the same core method reportedly works for both customer-level and artist-level fraud without changes. That generalization is a practical strength for deployment. Most components are established techniques, so the contribution is in the specific integration for this domain rather than a new theoretical primitive. The floor constraint on sampling is a straightforward way to reduce representation bias, and the pluggable gates give flexibility that could be useful in production pipelines. The soft spot is the evaluation. The description claims strong precision and recall on held-out data, yet supplies no actual numbers, baselines, error bars, or details on how the test set was built or labeled. Without those, it is difficult to assess whether the gates reliably separate mimics or if the results hold under different conditions. The assumption that distance-based filters will catch confident negatives even in close edge cases is reasonable but untested in the visible text. This paper is aimed at engineers and applied researchers working on fraud systems in streaming platforms or similar high-volume unlabeled settings. A reader building production PU learning pipelines might pick up the sampling and gating template. I would bring it to a reading group for the operational focus. It deserves peer review once the results and any ablation on the gates are added with concrete metrics.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SAGE, a counterfactual-aware negative harvesting method for fraud detection in music streaming. It combines SimHash-based stratified sampling with floor constraints to ensure coverage of rare behavioral cohorts and address representation bias in Positive-Unlabeled learning, together with a modular ensemble of pluggable statistical gates (Mahalanobis distance and k-NN density) controlled by configurable voting thresholds. The approach is presented as generalizing across customer-level and artist-level fraud without core changes, with evaluation claimed to show strong precision and recall on held-out data.

Significance. If the performance claims are substantiated, the work could contribute a practical, scalable architecture for confident negative selection in PU-learning settings for fraud detection, with the modular gating and floor-constrained sampling offering adaptability to different domains. The absence of quantitative results, baselines, and experimental details in the current text, however, prevents a clear assessment of its advance over existing methods.

major comments (2)

[Abstract] Abstract: the claim that the method achieves 'strong precision and recall on held-out data' is unsupported by any numerical values, baselines, error bars, or description of how the held-out set was constructed; this directly undermines verification of the central effectiveness claim.
[Evaluation] Evaluation section: no quantitative metrics, statistical significance tests, or comparisons to standard PU-learning negative-sampling baselines are reported, leaving the generalization claim across fraud domains without empirical grounding.

minor comments (1)

[Methodology] The description of floor constraints and voting thresholds would benefit from explicit ranges or default values used in the experiments to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to better substantiate the empirical claims in our work. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the method achieves 'strong precision and recall on held-out data' is unsupported by any numerical values, baselines, error bars, or description of how the held-out set was constructed; this directly undermines verification of the central effectiveness claim.

Authors: We agree that the abstract's qualitative statement requires concrete support. In the revision we will replace the general claim with specific precision and recall values (including error bars where applicable), a concise description of the held-out set construction, and reference to the baselines against which these figures were obtained. This will allow readers to directly assess the effectiveness claim. revision: yes
Referee: [Evaluation] Evaluation section: no quantitative metrics, statistical significance tests, or comparisons to standard PU-learning negative-sampling baselines are reported, leaving the generalization claim across fraud domains without empirical grounding.

Authors: This observation is correct for the current text. We will expand the Evaluation section to report full quantitative metrics, include statistical significance tests, and add explicit comparisons to standard PU-learning negative-sampling baselines. These additions will also provide the empirical grounding for the generalization statement across customer-level and artist-level fraud settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents SAGE as a novel architecture combining SimHash-based stratified sampling with a pluggable ensemble of statistical gates (Mahalanobis and k-NN) and voting thresholds for harvesting confident negatives in positive-unlabeled fraud detection. The central claims rest on the proposed methodology for addressing representation bias via floor-constrained sampling and adaptive precision-recall trade-offs, with evaluation on held-out data. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the approach is described as generalizable across domains without modification, and the derivation is self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred from described components. The approach assumes unlabeled data contains sufficient confident negatives and that the chosen statistical gates are appropriate for the fraud domain.

free parameters (2)

voting thresholds
Configurable thresholds for the ensemble gates that control precision-recall trade-off; values not specified.
floor constraints in sampling
Minimum sampling rates per behavioral cohort to ensure coverage; exact floors not given.

axioms (1)

domain assumption Unlabeled data contains a sufficient number of confident negative examples that can be identified by statistical distance and density measures.
Central to the negative harvesting claim; invoked when describing confident negative identification.

pith-pipeline@v0.9.0 · 5703 in / 1432 out tokens · 41423 ms · 2026-05-20T06:22:39.193252+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAGE combines SimHash-based stratified sampling with a modular gating ensemble (Mahalanobis distance and k-NN density) with configurable voting thresholds for confident negative harvesting from unlabeled data.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

floor-constrained sampling ensures minimum representation per behavioral stratum

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, and Yury Maximov. 2024. Self-Training: A Survey. arXiv:2202.12040 [cs.LG] https://arxiv.org/abs/2202.12040

work page arXiv 2024
[2]

Jessa Bekker and Jesse Davis. 2020. Learning from positive and unlabeled data: A survey.Machine Learning109, 4 (2020), 719–760

work page 2020
[3]

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 93–104

work page 2000
[4]

Moses S Charikar. 2002. Similarity estimation techniques from rounding algo- rithms. InProceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing. ACM, New York, NY, USA, 380–388

work page 2002
[5]

Guangxin Chen, Fangqing Ye, Zuoyong Tian, Xuemin Zhu, and Qingming Huang

work page
[6]

InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Positive-Unlabeled Learning from Imbalanced Data. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21). IJCAI, Montreal, Canada, 2995–3001. doi:10.24963/ijcai.2021/412

work page doi:10.24963/ijcai.2021/412 2021
[7]

CNM. 2023. Streaming fraud accounts for at least 1-3% of plays on services like Spotify and Deezer in France, shows investigation. SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection WSDM Companion ’26, February 22–26, 2026, Boise, ID, USA https://www.musicbusinessworldwide.com/streaming-fraud-accounts-for- at-lea...

work page 2023
[8]

Andrea Dal Pozzolo, Olivier Caelen, Reid A Johnson, and Gianluca Bontempi. 2014. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications41, 10 (2014), 4915–4928

work page 2014
[9]

Thomas G Dietterich. 2000. Ensemble methods in machine learning.Multiple Classifier Systems1857 (2000), 1–15

work page 2000
[10]

Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. InProceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 213–220

work page 2008
[11]

Soheil Esmaeilzadeh, Negin Salajegheh, Amir Ziai, and Jeff Boote. 2022. Abuse and Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning. (2022). arXiv:2203.02124 [cs.LG]

work page arXiv 2022
[12]

Jonas Herskind Sejr, Thorbjørn Christiansen, Nicolai Dvinge, Dan Hougesen, Peter Schneider-Kamp, and Arthur Zimek. 2021. Outlier Detection with Explana- tions on Music Streaming Data: A Case Study with Danmark Music Group Ltd. Applied Sciences11, 5 (2021), 2270. doi:10.3390/app11052270

work page doi:10.3390/app11052270 2021
[13]

IFPI. 2025. Global Music Report 2025: Amidst Highly Competi- tive Market, Global Recorded Music Revenues Grew 4.8% in 2024. https://www.ifpi.org/ifpi-amidst-highly-competitive-market-global-recorded- music-revenues-grew-4-8-in-2024/. Accessed: 2025

work page 2025
[14]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. ACM, New York, NY, USA, 604–613

work page 1998
[15]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., Red Hook, NY, USA, 3146–3154

work page 2017
[16]

Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. 2017. Positive-unlabeled learning with non-negative risk estimator. InAdvances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., Red Hook, NY, USA, 1675–1685

work page 2017
[18]

Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large- dimensional covariance matrices.Journal of Multivariate Analysis88, 2 (2004), 365–411

work page 2004
[19]

Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. 2002. Partially supervised classification of text documents. InProceedings of the 19th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, USA, 387– 394

work page 2002
[20]

Prasanta Chandra Mahalanobis. 1936. On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India2, 1 (1936), 49–55

work page 1936
[21]

Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near- duplicates for web crawling. InProceedings of the 16th International Conference on World Wide Web. ACM, New York, NY, USA, 141–150

work page 2007
[22]

Anand Muralidhar, Sharad Chitlangia, Rajat Agarwal, and Muneeb Ahmed. 2023. Real-time detection of robotic traffic in online advertising. InProceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, Washington, DC, USA. doi:10.1609/aaai.v37i13.26844

work page doi:10.1609/aaai.v37i13.26844 2023
[23]

Music Business Worldwide. 2024. Streaming fraud costs the global music industry $2bn a year, according to Beatdapp. https://www.musicbusinessworldwide.com/ streaming-fraud-costs-the-global-music-industry-2bn-a-year-according-to- beatdapp-now-its-partnering-with-beatport-to-combat-the-trend/. Accessed: 2024

work page 2024
[24]

Music In Africa. 2024. MLC and Beatdapp join forces to combat streaming fraud. https://www.musicinafrica.net/magazine/mlc-and-beatdapp-join-forces- combat-streaming-fraud. Accessed: 2024

work page 2024
[25]

Eric WT Ngai, Yong Hu, Yiu Hing Wong, Yijun Chen, and Xin Sun. 2011. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature.Decision Support Systems50, 3 (2011), 559–569

work page 2011
[26]

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algo- rithms for mining outliers from large data sets. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 427–438

work page 2000
[27]

RIAA. 2024. 2023 Year-End Revenue Statistics. https://www.riaa.com/wp-content/ uploads/2024/03/2023-Year-End-Revenue-Statistics.pdf. Accessed: 2024

work page 2024
[28]

Burr Settles. 2009. Active learning literature survey.Computer Sciences Technical Report 1648, University of Wisconsin–Madison(2009)

work page 2009
[29]

Department of Justice

U.S. Department of Justice. 2024. North Carolina Musician Charged in Music Streaming Fraud Aided by Artificial Intelligence. https://www.justice.gov/usao- sdny/pr/north-carolina-musician-charged-music-streaming-fraud-aided- artificial-intelligence. Accessed: 2024

work page 2024
[30]

David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Super- vised Methods. In33rd Annual Meeting of the Association for Computational Lin- guistics. Association for Computational Linguistics, Cambridge, Massachusetts, USA, 189–196. doi:10.3115/981658.981684

work page doi:10.3115/981658.981684 1995
[31]

Show-Jane Yen and Yue-Shi Lee. 2009. Cluster-based under-sampling approaches for imbalanced data distributions. InExpert Systems with Applications, Vol. 36. Elsevier, Amsterdam, Netherlands, 5718–5727

work page 2009

[1] [1]

Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, and Yury Maximov. 2024. Self-Training: A Survey. arXiv:2202.12040 [cs.LG] https://arxiv.org/abs/2202.12040

work page arXiv 2024

[2] [2]

Jessa Bekker and Jesse Davis. 2020. Learning from positive and unlabeled data: A survey.Machine Learning109, 4 (2020), 719–760

work page 2020

[3] [3]

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 93–104

work page 2000

[4] [4]

Moses S Charikar. 2002. Similarity estimation techniques from rounding algo- rithms. InProceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing. ACM, New York, NY, USA, 380–388

work page 2002

[5] [5]

Guangxin Chen, Fangqing Ye, Zuoyong Tian, Xuemin Zhu, and Qingming Huang

work page

[6] [6]

InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Positive-Unlabeled Learning from Imbalanced Data. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21). IJCAI, Montreal, Canada, 2995–3001. doi:10.24963/ijcai.2021/412

work page doi:10.24963/ijcai.2021/412 2021

[7] [7]

CNM. 2023. Streaming fraud accounts for at least 1-3% of plays on services like Spotify and Deezer in France, shows investigation. SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection WSDM Companion ’26, February 22–26, 2026, Boise, ID, USA https://www.musicbusinessworldwide.com/streaming-fraud-accounts-for- at-lea...

work page 2023

[8] [8]

Andrea Dal Pozzolo, Olivier Caelen, Reid A Johnson, and Gianluca Bontempi. 2014. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications41, 10 (2014), 4915–4928

work page 2014

[9] [9]

Thomas G Dietterich. 2000. Ensemble methods in machine learning.Multiple Classifier Systems1857 (2000), 1–15

work page 2000

[10] [10]

Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. InProceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 213–220

work page 2008

[11] [11]

Soheil Esmaeilzadeh, Negin Salajegheh, Amir Ziai, and Jeff Boote. 2022. Abuse and Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning. (2022). arXiv:2203.02124 [cs.LG]

work page arXiv 2022

[12] [12]

Jonas Herskind Sejr, Thorbjørn Christiansen, Nicolai Dvinge, Dan Hougesen, Peter Schneider-Kamp, and Arthur Zimek. 2021. Outlier Detection with Explana- tions on Music Streaming Data: A Case Study with Danmark Music Group Ltd. Applied Sciences11, 5 (2021), 2270. doi:10.3390/app11052270

work page doi:10.3390/app11052270 2021

[13] [13]

IFPI. 2025. Global Music Report 2025: Amidst Highly Competi- tive Market, Global Recorded Music Revenues Grew 4.8% in 2024. https://www.ifpi.org/ifpi-amidst-highly-competitive-market-global-recorded- music-revenues-grew-4-8-in-2024/. Accessed: 2025

work page 2025

[14] [14]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. ACM, New York, NY, USA, 604–613

work page 1998

[15] [15]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., Red Hook, NY, USA, 3146–3154

work page 2017

[16] [16]

Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. 2017. Positive-unlabeled learning with non-negative risk estimator. InAdvances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., Red Hook, NY, USA, 1675–1685

work page 2017

[18] [18]

Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large- dimensional covariance matrices.Journal of Multivariate Analysis88, 2 (2004), 365–411

work page 2004

[19] [19]

Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. 2002. Partially supervised classification of text documents. InProceedings of the 19th International Conference on Machine Learning (ICML). Morgan Kaufmann, San Francisco, CA, USA, 387– 394

work page 2002

[20] [20]

Prasanta Chandra Mahalanobis. 1936. On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India2, 1 (1936), 49–55

work page 1936

[21] [21]

Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near- duplicates for web crawling. InProceedings of the 16th International Conference on World Wide Web. ACM, New York, NY, USA, 141–150

work page 2007

[22] [22]

Anand Muralidhar, Sharad Chitlangia, Rajat Agarwal, and Muneeb Ahmed. 2023. Real-time detection of robotic traffic in online advertising. InProceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, Washington, DC, USA. doi:10.1609/aaai.v37i13.26844

work page doi:10.1609/aaai.v37i13.26844 2023

[23] [23]

Music Business Worldwide. 2024. Streaming fraud costs the global music industry $2bn a year, according to Beatdapp. https://www.musicbusinessworldwide.com/ streaming-fraud-costs-the-global-music-industry-2bn-a-year-according-to- beatdapp-now-its-partnering-with-beatport-to-combat-the-trend/. Accessed: 2024

work page 2024

[24] [24]

Music In Africa. 2024. MLC and Beatdapp join forces to combat streaming fraud. https://www.musicinafrica.net/magazine/mlc-and-beatdapp-join-forces- combat-streaming-fraud. Accessed: 2024

work page 2024

[25] [25]

Eric WT Ngai, Yong Hu, Yiu Hing Wong, Yijun Chen, and Xin Sun. 2011. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature.Decision Support Systems50, 3 (2011), 559–569

work page 2011

[26] [26]

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algo- rithms for mining outliers from large data sets. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 427–438

work page 2000

[27] [27]

RIAA. 2024. 2023 Year-End Revenue Statistics. https://www.riaa.com/wp-content/ uploads/2024/03/2023-Year-End-Revenue-Statistics.pdf. Accessed: 2024

work page 2024

[28] [28]

Burr Settles. 2009. Active learning literature survey.Computer Sciences Technical Report 1648, University of Wisconsin–Madison(2009)

work page 2009

[29] [29]

Department of Justice

U.S. Department of Justice. 2024. North Carolina Musician Charged in Music Streaming Fraud Aided by Artificial Intelligence. https://www.justice.gov/usao- sdny/pr/north-carolina-musician-charged-music-streaming-fraud-aided- artificial-intelligence. Accessed: 2024

work page 2024

[30] [30]

David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Super- vised Methods. In33rd Annual Meeting of the Association for Computational Lin- guistics. Association for Computational Linguistics, Cambridge, Massachusetts, USA, 189–196. doi:10.3115/981658.981684

work page doi:10.3115/981658.981684 1995

[31] [31]

Show-Jane Yen and Yue-Shi Lee. 2009. Cluster-based under-sampling approaches for imbalanced data distributions. InExpert Systems with Applications, Vol. 36. Elsevier, Amsterdam, Netherlands, 5718–5727

work page 2009