pith. machine review for the scientific record. sign in

arxiv: 2605.07130 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.DS

Recognition: 2 theorem links

· Lean Theorem

Simple KNN-Based Outlier Detection Achieves Robust Clustering

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3

classification 💻 cs.LG cs.DS
keywords robust k-meansoutlier detectionKNNapproximation algorithmsclusteringheuristicmachine learning
0
0 comments X

The pith

Removing points with large K-nearest-neighbor distances reduces robust k-means to standard k-means with a constant-factor approximation guarantee.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a straightforward outlier removal step using K-nearest neighbor distances can solve the robust k-means problem effectively. This is significant because robust clustering must handle outliers without complicating the algorithm or losing accuracy. Under an assumption about cluster sizes, this method achieves results comparable to existing approaches but simpler. It avoids the common practice of adding extra centers or removing more points than needed. Experiments confirm it performs well on real datasets in both quality and speed.

Core claim

Under a practical assumption on the optimal cluster sizes, simply removing the z points with the largest K-nearest-neighbor distances allows one to solve robust k-means by applying a standard k-means algorithm to the remaining points, yielding a constant-factor approximation without introducing additional centers or discarding extra outliers.

What carries the argument

KNN-based outlier detection via large K-nearest-neighbor distances, which performs the reduction from robust k-means to standard k-means.

If this is right

  • Delivers a constant-factor approximation guarantee for robust k-means using only standard k-means on cleaned data.
  • Requires no additional centers or removal of more outliers than specified.
  • Matches or exceeds the clustering cost of more sophisticated algorithms on real-world data.
  • Offers improved runtime efficiency compared to complex robust clustering methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar simple heuristics might work for other robust optimization problems in machine learning.
  • Testing on synthetic data with varying cluster size imbalances could reveal the assumption's necessity.
  • May inspire analysis of why KNN distances correlate with outlier status in clustered data.

Load-bearing premise

The approach depends on the optimal clusters having sizes that meet a practical condition, without which the KNN distances may not accurately identify the outliers.

What would settle it

A counterexample dataset with optimal clusters violating the size assumption where the KNN removal method produces a clustering cost significantly worse than the optimal robust k-means solution.

Figures

Figures reproduced from arXiv: 2605.07130 by Tianle Jiang, Yufa Zhou.

Figure 1
Figure 1. Figure 1: Parameter Sensitivity Cost Comparison 21 [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Parameter Sensitivity Recall Comparison 3 4 5 7 10 15 20 c parameter 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Time (seconds) SKIN-5 OKMeans2FAISS OKMeansFAISS 3 4 5 7 10 15 20 c parameter 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Time (seconds) SKIN-10 OKMeans2FAISS OKMeansFAISS 3 4 5 7 10 15 20 c parameter 0.0 0.5 1.0 1.5 2.0 Time (seconds) SHUTTLE OKMeans2FAISS OKMeansFAISS 3 4 5 7 10 15 20 c parameter 0 5 10 15 20… view at source ↗
Figure 3
Figure 3. Figure 3: Parameter Sensitivity Time Comparison 22 [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
read the original abstract

Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that, under a practical assumption on the sizes of the optimal clusters, simply removing the z points with the largest K-nearest-neighbor distances yields a constant-factor approximation for the robust k-means problem by reducing it directly to standard k-means (without extra centers or additional outlier discards). It supports the claim with a theoretical argument and shows empirically that the method matches or outperforms several more sophisticated robust clustering algorithms on real-world datasets in terms of clustering cost and runtime.

Significance. If the reduction holds with a reasonable assumption, the result is significant: it shows that a classic, parameter-light outlier-detection heuristic can match the approximation guarantees of prior robust k-means work while simplifying the pipeline. This bridges outlier detection and clustering theory and offers a practical, easy-to-implement baseline. The empirical competitiveness further suggests the approach may be useful even when the assumption is only approximately satisfied.

major comments (2)
  1. [main theoretical result / Assumption on cluster sizes] The central constant-factor reduction (abstract and main theorem) rests on an unstated or only informally described assumption on optimal cluster sizes (likely that each cluster is at least some multiple of z/k in size so that inlier KNN distances are strictly smaller than outlier distances). The manuscript must state this assumption precisely (with explicit constants), derive the resulting approximation factor, and show that the assumption is satisfied in the regimes where the guarantee is claimed; without these details the reduction cannot be verified and may fail for small or unbalanced clusters.
  2. [proof of the reduction] § on the KNN removal argument: it is unclear whether the procedure removes exactly the z outliers or may discard inliers from the smallest optimal cluster or retain some outliers when cluster sizes vary near the threshold. A concrete example or lemma showing that the z-th largest inlier KNN distance is always smaller than the smallest outlier KNN distance under the assumption is needed.
minor comments (2)
  1. [Experiments] The value of K used for the KNN distances should be stated explicitly (or shown to be insensitive) in the experimental section, along with any preprocessing steps applied to the real-world datasets.
  2. [Introduction / Related Work] Notation for the robust k-means objective and the KNN distance should be introduced once and used consistently; a short table comparing the new method to prior approaches (number of centers, extra discards, assumption strength) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. The suggestions for greater precision in the theoretical development are helpful, and we will revise the paper to incorporate them.

read point-by-point responses
  1. Referee: [main theoretical result / Assumption on cluster sizes] The central constant-factor reduction (abstract and main theorem) rests on an unstated or only informally described assumption on optimal cluster sizes (likely that each cluster is at least some multiple of z/k in size so that inlier KNN distances are strictly smaller than outlier distances). The manuscript must state this assumption precisely (with explicit constants), derive the resulting approximation factor, and show that the assumption is satisfied in the regimes where the guarantee is claimed; without these details the reduction cannot be verified and may fail for small or unbalanced clusters.

    Authors: We agree that the assumption on optimal cluster sizes should be stated with full precision, including explicit constants, to allow verification of the constant-factor reduction. In the revised manuscript we will formally define the assumption (each optimal cluster size at least a fixed multiple of z/k sufficient to separate inlier and outlier KNN distances), derive the resulting approximation factor explicitly, and discuss the regimes in which the assumption holds, including the behavior for balanced versus unbalanced clusters. revision: yes

  2. Referee: [proof of the reduction] § on the KNN removal argument: it is unclear whether the procedure removes exactly the z outliers or may discard inliers from the smallest optimal cluster or retain some outliers when cluster sizes vary near the threshold. A concrete example or lemma showing that the z-th largest inlier KNN distance is always smaller than the smallest outlier KNN distance under the assumption is needed.

    Authors: We will add a dedicated lemma establishing that, under the cluster-size assumption, the z-th largest inlier KNN distance is strictly smaller than the smallest outlier KNN distance, thereby proving that the procedure removes exactly the z outliers. The revised section will also contain a concrete numerical example that illustrates the separation, including the case of cluster sizes near the threshold. revision: yes

Circularity Check

0 steps flagged

No circularity: proof under explicit assumption is self-contained

full rationale

The paper states a mathematical proof that, under a stated practical assumption on optimal cluster sizes, KNN-distance outlier removal yields a constant-factor reduction from robust k-Means to standard k-Means without extra centers or discards. No equations, parameters, or steps in the abstract reduce the guarantee to a fitted input, self-definition, or self-citation chain; the assumption is introduced as a precondition rather than derived from the method. The central claim therefore stands as an independent derivation rather than a renaming or tautological mapping of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing element is the unstated 'practical assumption on the optimal cluster sizes' required for the constant-factor guarantee; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption practical assumption on the optimal cluster sizes
    Invoked to obtain the constant-factor reduction; its precise statement is not given in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1115 out tokens · 24584 ms · 2026-05-11T02:27:16.040378+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Clustering what matters: Optimal approximation for clustering with outliers

    Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers. J. Artif. Intell. Res., 78:143–166, 2023

  2. [2]

    Better guarantees for k-means and euclidean k-median by primal-dual algorithms

    Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In FOCS, pages 61–72. IEEE Computer Society, 2017

  3. [3]

    k-clustering with fair outliers

    Matteo Almanza, Alessandro Epasto, Alessandro Panconesi, and Giuseppe Re. k-clustering with fair outliers. In WSDM, pages 5–15. ACM, 2022

  4. [4]

    Approximation schemes for euclidean k-medians and related problems

    Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k-medians and related problems. In STOC, pages 106–113. ACM, 1998

  5. [5]

    k-means++ the advantages of careful seeding

    David Arthur and Sergei Vassilvitskii. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035, 2007

  6. [6]

    Greedy sampling for approximate clustering in the presence of outliers

    Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In NeurIPS, pages 11146–11155, 2019

  7. [7]

    Rae, Erich Elsen, and Laurent Sifre

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

  8. [8]

    Anomaly detection: A survey

    Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, 2009

  9. [9]

    Moses Charikar, Sudipto Guha, ´Eva Tardos, and David B. Shmoys. A constant-factor approxi- mation algorithm for the k-median problem (extended abstract). In STOC, pages 1–10. ACM, 1999

  10. [10]

    Mount, and Giri Narasimhan

    Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642–651. ACM/SIAM, 2001

  11. [11]

    An improved greedy approximation for (metric) k-means

    Moses Charikar, Vincent Cohen-Addad, Ruiquan Gao, Fabrizio Grandoni, Euiwoong Lee, and Ernest van Wijland. An improved greedy approximation for (metric) k-means. In FOCS, pages 233–240. IEEE, 2025

  12. [12]

    k-means-: A unified approach to clustering and outlier detection

    Sanjay Chawla and Aristides Gionis. k-means-: A unified approach to clustering and outlier detection. In SDM, pages 189–197. SIAM, 2013

  13. [13]

    A constant factor approximation algorithm for k-median clustering with outliers

    Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826–835. SIAM, 2008

  14. [14]

    k-median/means with outliers revisited: A simple fpt approximation

    Xianrun Chen, Lu Han, Dachuan Xu, Yicheng Xu, and Yong Zhang. k-median/means with outliers revisited: A simple fpt approximation. In COCOON (2), volume 14423 of Lecture Notes in Computer Science , pages 295–302. Springer, 2023

  15. [15]

    Mirrokni, and Shyam Narayanan

    Vincent Cohen-Addad, Hossein Esfandiari, Vahab S. Mirrokni, and Shyam Narayanan. Im- proved approximations for euclidean k-means and k-median, via nested quasi-independent sets. In STOC, pages 1621–1628. ACM, 2022. 10

  16. [16]

    A (2+ ϵ)-approximation algorithm for metric k-median

    Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, Chris Schwiegelshohn, and Ola Svensson. A (2+ ϵ)-approximation algorithm for metric k-median. In STOC, pages 615–624. ACM, 2025

  17. [17]

    Robust k-means++

    Amit Deshpande, Praneeth Kacham, and Rameshwar Pratap. Robust k-means++. In UAI, volume 124 of Proceedings of Machine Learning Research, pages 799–808. AUAI Press, 2020

  18. [18]

    The faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazar´e, Maria Lomeli, Lucas Hosseini, and Herv´e J´egou. The faiss library. IEEE Transactions on Big Data, 2025

  19. [19]

    Prerau, Leonid Portnoy, and Salvatore J

    Eleazar Eskin, Andrew Arnold, Michael J. Prerau, Leonid Portnoy, and Salvatore J. Stolfo. A geometric framework for unsupervised anomaly detection. In Applications of Data Mining in Computer Security, Advances in Information Security, pages 77–101. Springer, 2002

  20. [20]

    Improved algorithms for clustering with outliers

    Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Improved algorithms for clustering with outliers. In ISAAC, volume 149 of LIPIcs, pages 61:1–61:12. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019

  21. [21]

    Salavatipour

    Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1–26:26, 2019

  22. [22]

    Fast mining of distance-based outliers in high-dimensional datasets

    Amol Ghoting, Srinivasan Parthasarathy, and Matthew Eric Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349–364, 2008

  23. [23]

    Gowda, Thomas W

    Kishen N. Gowda, Thomas W. Pensyl, Aravind Srinivasan, and Khoa Trinh. Improved bi-point rounding algorithms and a golden barrier for k-median. In SODA, pages 987–1011. SIAM, 2023

  24. [24]

    Adapting k-means algorithms for outliers

    Christoph Grunau and V´aclav Rozhon. Adapting k-means algorithms for outliers. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 7845–7886. PMLR, 2022

  25. [25]

    Accelerating large-scale inference with anisotropic vector quantization

    Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 3887–3896. PMLR, 2020

  26. [26]

    Structural iterative rounding for general- ized k-median problems

    Anupam Gupta, Benjamin Moseley, and Rudy Zhou. Structural iterative rounding for general- ized k-median problems. In ICALP, volume 198 of LIPIcs, pages 77:1–77:18. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021

  27. [27]

    Local search methods for k-means with outliers

    Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757–768, 2017

  28. [28]

    R., Millman, K

    Charles R. Harris, K. Jarrod Millman, St ´efan J. van der Walt, Ralf Gommers, Pauli Vir- tanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal- dane, Jaime Fern´andez del R´ıo, Mark Wiebe, Pearu Peterson, Pierre G´erard-Marchant, Ke...

  29. [29]

    Near-linear time approximation algorithms for k-means with outliers

    Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Near-linear time approximation algorithms for k-means with outliers. In Proceedings of the 41st International Conference on Machine Learning, pages 19723–19756, 2024

  30. [30]

    Fast noise removal for k-means clustering

    Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, Xiaorui Sun, and Rudy Zhou. Fast noise removal for k-means clustering. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pages 456–466. PMLR, 2020

  31. [31]

    Clustering what matters in constrained settings: Improved outlier to outlier-free reductions

    Ragesh Jaiswal and Amit Kumar. Clustering what matters in constrained settings: Improved outlier to outlier-free reductions. In ISAAC, volume 283 of LIPIcs, pages 41:1–41:16. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2023. 11

  32. [32]

    Diskann: Fast accurate billion-point nearest neighbor search on a single node

    Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in neural information processing Systems , 32, 2019

  33. [33]

    Billion-scale similarity search with gpus

    Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3):535–547, 2021

  34. [34]

    Mount, Nathan S

    Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y . Wu. A local search approximation algorithm for k-means clustering.Comput. Geom., 28(2-3):89–112, 2004

  35. [35]

    Knorr and Raymond T

    Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392–403. Morgan Kaufmann, 1998

  36. [36]

    Spatial weighted outlier detection

    Yufeng Kou, Chang-Tien Lu, and Dechang Chen. Spatial weighted outlier detection. In SDM, pages 614–618. SIAM, 2006

  37. [37]

    Constant approximation for k-median and k-means with outliers via iterative rounding

    Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In STOC, pages 646–659. ACM, 2018

  38. [38]

    The diskann library: Graph-based indices for fast, fresh and filtered vector search

    Ravishankar Krishnaswamy, Magdalen Dobson Manohar, and Harsha Vardhan Simhadri. The diskann library: Graph-based indices for fast, fresh and filtered vector search. IEEE Data Eng. Bull., 48(3):20–42, 2024

  39. [39]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS, 2020

  40. [40]

    Rao Vemuri

    Yihua Liao and V . Rao Vemuri. Use of k-nearest neighbor classifier for intrusion detection. Comput. Secur ., 21(5):439–448, 2002

  41. [41]

    Malkov and Dmitry A

    Yury A. Malkov and Dmitry A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020

  42. [42]

    On approximate geometric k-clustering

    Jir´ı Matousek. On approximate geometric k-clustering. Discret. Comput. Geom., 24(1):61–84, 2000

  43. [43]

    Fast distributed outlier detection in mixed-attribute data sets

    Matthew Eric Otey, Amol Ghoting, and Srinivasan Parthasarathy. Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov., 12(2-3):203–228, 2006

  44. [44]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

  45. [45]

    Efficient algorithms for mining outliers from large data sets

    Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD Conference, pages 427–438. ACM, 2000

  46. [46]

    Mining distance-based outliers from large databases in any metric space

    Yufei Tao, Xiaokui Xiao, and Shuigeng Zhou. Mining distance-based outliers from large databases in any metric space. In KDD, pages 394–403. ACM, 2006

  47. [47]

    A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search

    Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow., 14(11):1964–1978, 2021

  48. [48]

    HOT: hypergraph-based outlier test for categorical data

    Li Wei, Weining Qian, Aoying Zhou, Wen Jin, and Jeffrey Xu Yu. HOT: hypergraph-based outlier test for categorical data. In PAKDD, volume 2637 of Lecture Notes in Computer Science , pages 399–410. Springer, 2003

  49. [49]

    Outlier detection by sampling with accuracy guarantees

    Mingxi Wu and Chris Jermaine. Outlier detection by sampling with accuracy guarantees. In KDD, pages 767–772. ACM, 2006. 12

  50. [50]

    Ji Zhang and Hai H. Wang. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst., 10(3):333–355, 2006

  51. [51]

    Yang Zhang, Nirvana Meratnia, and Paul J. M. Havinga. Outlier detection techniques for wireless sensor networks: A survey. IEEE Commun. Surv. Tutorials, 12(2):159–170, 2010. 13 Appendix A Future Work Our findings suggest that outlier detection and robust clustering should be studied in a unified framework, leading to several immediate directions for future work:

  52. [52]

    Our current analysis of the KNN heuristic is not tight

    Tighten the analysis of KNN. Our current analysis of the KNN heuristic is not tight. Establishing sharper bounds or identifying matching lower bounds remains an open problem

  53. [53]

    Outlier detection for robust clustering. It would be interesting to analyze other widely used outlier detection heuristics, particularly variants of the KNN heuristic, and investigate whether they admit strong theoretical guarantees or empirical performance for the Robust k-Means objective

  54. [54]

    Another promising direction is to adapt advanced tech- niques from the robust clustering literature to design more effective outlier detection algorithms

    Robust clustering for outlier detection. Another promising direction is to adapt advanced tech- niques from the robust clustering literature to design more effective outlier detection algorithms

  55. [55]

    true inliers

    Beyond worst-case analysis. Our results focus on worst-case approximation guarantees, where both cluster structure and outliers are adversarial. Exploring refined analyses under additional, practically motivated assumptions on the data generation process may yield significantly stronger guarantees. B Deferred Proofs B.1 Full Proof of Theorem 4.1 We extend...