arxiv: 2605.07130 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.DS

Recognition: 2 theorem links

· Lean Theorem

Simple KNN-Based Outlier Detection Achieves Robust Clustering

Tianle Jiang , Yufa Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3

classification 💻 cs.LG cs.DS

keywords robust k-meansoutlier detectionKNNapproximation algorithmsclusteringheuristicmachine learning

0 comments

The pith

Removing points with large K-nearest-neighbor distances reduces robust k-means to standard k-means with a constant-factor approximation guarantee.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a straightforward outlier removal step using K-nearest neighbor distances can solve the robust k-means problem effectively. This is significant because robust clustering must handle outliers without complicating the algorithm or losing accuracy. Under an assumption about cluster sizes, this method achieves results comparable to existing approaches but simpler. It avoids the common practice of adding extra centers or removing more points than needed. Experiments confirm it performs well on real datasets in both quality and speed.

Core claim

Under a practical assumption on the optimal cluster sizes, simply removing the z points with the largest K-nearest-neighbor distances allows one to solve robust k-means by applying a standard k-means algorithm to the remaining points, yielding a constant-factor approximation without introducing additional centers or discarding extra outliers.

What carries the argument

KNN-based outlier detection via large K-nearest-neighbor distances, which performs the reduction from robust k-means to standard k-means.

If this is right

Delivers a constant-factor approximation guarantee for robust k-means using only standard k-means on cleaned data.
Requires no additional centers or removal of more outliers than specified.
Matches or exceeds the clustering cost of more sophisticated algorithms on real-world data.
Offers improved runtime efficiency compared to complex robust clustering methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar simple heuristics might work for other robust optimization problems in machine learning.
Testing on synthetic data with varying cluster size imbalances could reveal the assumption's necessity.
May inspire analysis of why KNN distances correlate with outlier status in clustered data.

Load-bearing premise

The approach depends on the optimal clusters having sizes that meet a practical condition, without which the KNN distances may not accurately identify the outliers.

What would settle it

A counterexample dataset with optimal clusters violating the size assumption where the KNN removal method produces a clustering cost significantly worse than the optimal robust k-means solution.

Figures

Figures reproduced from arXiv: 2605.07130 by Tianle Jiang, Yufa Zhou.

**Figure 2.** Figure 2: Parameter Sensitivity Recall Comparison 3 4 5 7 10 15 20 c parameter 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Time (seconds) SKIN-5 OKMeans2FAISS OKMeansFAISS 3 4 5 7 10 15 20 c parameter 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Time (seconds) SKIN-10 OKMeans2FAISS OKMeansFAISS 3 4 5 7 10 15 20 c parameter 0.0 0.5 1.0 1.5 2.0 Time (seconds) SHUTTLE OKMeans2FAISS OKMeansFAISS 3 4 5 7 10 15 20 c parameter 0 5 10 15 20… view at source ↗

**Figure 3.** Figure 3: Parameter Sensitivity Time Comparison 22 [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

read the original abstract

Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that, under a practical assumption on the sizes of the optimal clusters, simply removing the z points with the largest K-nearest-neighbor distances yields a constant-factor approximation for the robust k-means problem by reducing it directly to standard k-means (without extra centers or additional outlier discards). It supports the claim with a theoretical argument and shows empirically that the method matches or outperforms several more sophisticated robust clustering algorithms on real-world datasets in terms of clustering cost and runtime.

Significance. If the reduction holds with a reasonable assumption, the result is significant: it shows that a classic, parameter-light outlier-detection heuristic can match the approximation guarantees of prior robust k-means work while simplifying the pipeline. This bridges outlier detection and clustering theory and offers a practical, easy-to-implement baseline. The empirical competitiveness further suggests the approach may be useful even when the assumption is only approximately satisfied.

major comments (2)

[main theoretical result / Assumption on cluster sizes] The central constant-factor reduction (abstract and main theorem) rests on an unstated or only informally described assumption on optimal cluster sizes (likely that each cluster is at least some multiple of z/k in size so that inlier KNN distances are strictly smaller than outlier distances). The manuscript must state this assumption precisely (with explicit constants), derive the resulting approximation factor, and show that the assumption is satisfied in the regimes where the guarantee is claimed; without these details the reduction cannot be verified and may fail for small or unbalanced clusters.
[proof of the reduction] § on the KNN removal argument: it is unclear whether the procedure removes exactly the z outliers or may discard inliers from the smallest optimal cluster or retain some outliers when cluster sizes vary near the threshold. A concrete example or lemma showing that the z-th largest inlier KNN distance is always smaller than the smallest outlier KNN distance under the assumption is needed.

minor comments (2)

[Experiments] The value of K used for the KNN distances should be stated explicitly (or shown to be insensitive) in the experimental section, along with any preprocessing steps applied to the real-world datasets.
[Introduction / Related Work] Notation for the robust k-means objective and the KNN distance should be introduced once and used consistently; a short table comparing the new method to prior approaches (number of centers, extra discards, assumption strength) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. The suggestions for greater precision in the theoretical development are helpful, and we will revise the paper to incorporate them.

read point-by-point responses

Referee: [main theoretical result / Assumption on cluster sizes] The central constant-factor reduction (abstract and main theorem) rests on an unstated or only informally described assumption on optimal cluster sizes (likely that each cluster is at least some multiple of z/k in size so that inlier KNN distances are strictly smaller than outlier distances). The manuscript must state this assumption precisely (with explicit constants), derive the resulting approximation factor, and show that the assumption is satisfied in the regimes where the guarantee is claimed; without these details the reduction cannot be verified and may fail for small or unbalanced clusters.

Authors: We agree that the assumption on optimal cluster sizes should be stated with full precision, including explicit constants, to allow verification of the constant-factor reduction. In the revised manuscript we will formally define the assumption (each optimal cluster size at least a fixed multiple of z/k sufficient to separate inlier and outlier KNN distances), derive the resulting approximation factor explicitly, and discuss the regimes in which the assumption holds, including the behavior for balanced versus unbalanced clusters. revision: yes
Referee: [proof of the reduction] § on the KNN removal argument: it is unclear whether the procedure removes exactly the z outliers or may discard inliers from the smallest optimal cluster or retain some outliers when cluster sizes vary near the threshold. A concrete example or lemma showing that the z-th largest inlier KNN distance is always smaller than the smallest outlier KNN distance under the assumption is needed.

Authors: We will add a dedicated lemma establishing that, under the cluster-size assumption, the z-th largest inlier KNN distance is strictly smaller than the smallest outlier KNN distance, thereby proving that the procedure removes exactly the z outliers. The revised section will also contain a concrete numerical example that illustrates the separation, including the case of cluster sizes near the threshold. revision: yes

Circularity Check

0 steps flagged

No circularity: proof under explicit assumption is self-contained

full rationale

The paper states a mathematical proof that, under a stated practical assumption on optimal cluster sizes, KNN-distance outlier removal yields a constant-factor reduction from robust k-Means to standard k-Means without extra centers or discards. No equations, parameters, or steps in the abstract reduce the guarantee to a fitted input, self-definition, or self-citation chain; the assumption is introduced as a precondition rather than derived from the method. The central claim therefore stands as an independent derivation rather than a renaming or tautological mapping of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing element is the unstated 'practical assumption on the optimal cluster sizes' required for the constant-factor guarantee; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption practical assumption on the optimal cluster sizes
Invoked to obtain the constant-factor reduction; its precise statement is not given in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1115 out tokens · 24584 ms · 2026-05-11T02:27:16.040378+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 (Restatement of Theorem 1.1). ... each cluster induced by C∗ has size at least cz, then ... fz(X,C)≤Φ(c)β·fz(X,C∗)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 4.2 ... fz(X,C∗)≥|Y|T²·max{λ²,(1−λ)²}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

Clustering what matters: Optimal approximation for clustering with outliers

Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers. J. Artif. Intell. Res., 78:143–166, 2023

work page 2023
[2]

Better guarantees for k-means and euclidean k-median by primal-dual algorithms

Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In FOCS, pages 61–72. IEEE Computer Society, 2017

work page 2017
[3]

k-clustering with fair outliers

Matteo Almanza, Alessandro Epasto, Alessandro Panconesi, and Giuseppe Re. k-clustering with fair outliers. In WSDM, pages 5–15. ACM, 2022

work page 2022
[4]

Approximation schemes for euclidean k-medians and related problems

Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k-medians and related problems. In STOC, pages 106–113. ACM, 1998

work page 1998
[5]

k-means++ the advantages of careful seeding

David Arthur and Sergei Vassilvitskii. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035, 2007

work page 2007
[6]

Greedy sampling for approximate clustering in the presence of outliers

Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In NeurIPS, pages 11146–11155, 2019

work page 2019
[7]

Rae, Erich Elsen, and Laurent Sifre

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

work page 2022
[8]

Anomaly detection: A survey

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, 2009

work page 2009
[9]

Moses Charikar, Sudipto Guha, ´Eva Tardos, and David B. Shmoys. A constant-factor approxi- mation algorithm for the k-median problem (extended abstract). In STOC, pages 1–10. ACM, 1999

work page 1999
[10]

Mount, and Giri Narasimhan

Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642–651. ACM/SIAM, 2001

work page 2001
[11]

An improved greedy approximation for (metric) k-means

Moses Charikar, Vincent Cohen-Addad, Ruiquan Gao, Fabrizio Grandoni, Euiwoong Lee, and Ernest van Wijland. An improved greedy approximation for (metric) k-means. In FOCS, pages 233–240. IEEE, 2025

work page 2025
[12]

k-means-: A unified approach to clustering and outlier detection

Sanjay Chawla and Aristides Gionis. k-means-: A unified approach to clustering and outlier detection. In SDM, pages 189–197. SIAM, 2013

work page 2013
[13]

A constant factor approximation algorithm for k-median clustering with outliers

Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826–835. SIAM, 2008

work page 2008
[14]

k-median/means with outliers revisited: A simple fpt approximation

Xianrun Chen, Lu Han, Dachuan Xu, Yicheng Xu, and Yong Zhang. k-median/means with outliers revisited: A simple fpt approximation. In COCOON (2), volume 14423 of Lecture Notes in Computer Science , pages 295–302. Springer, 2023

work page 2023
[15]

Mirrokni, and Shyam Narayanan

Vincent Cohen-Addad, Hossein Esfandiari, Vahab S. Mirrokni, and Shyam Narayanan. Im- proved approximations for euclidean k-means and k-median, via nested quasi-independent sets. In STOC, pages 1621–1628. ACM, 2022. 10

work page 2022
[16]

A (2+ ϵ)-approximation algorithm for metric k-median

Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, Chris Schwiegelshohn, and Ola Svensson. A (2+ ϵ)-approximation algorithm for metric k-median. In STOC, pages 615–624. ACM, 2025

work page 2025
[17]

Robust k-means++

Amit Deshpande, Praneeth Kacham, and Rameshwar Pratap. Robust k-means++. In UAI, volume 124 of Proceedings of Machine Learning Research, pages 799–808. AUAI Press, 2020

work page 2020
[18]

The faiss library

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazar´e, Maria Lomeli, Lucas Hosseini, and Herv´e J´egou. The faiss library. IEEE Transactions on Big Data, 2025

work page 2025
[19]

Prerau, Leonid Portnoy, and Salvatore J

Eleazar Eskin, Andrew Arnold, Michael J. Prerau, Leonid Portnoy, and Salvatore J. Stolfo. A geometric framework for unsupervised anomaly detection. In Applications of Data Mining in Computer Security, Advances in Information Security, pages 77–101. Springer, 2002

work page 2002
[20]

Improved algorithms for clustering with outliers

Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Improved algorithms for clustering with outliers. In ISAAC, volume 149 of LIPIcs, pages 61:1–61:12. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019

work page 2019
[21]

Salavatipour

Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1–26:26, 2019

work page 2019
[22]

Fast mining of distance-based outliers in high-dimensional datasets

Amol Ghoting, Srinivasan Parthasarathy, and Matthew Eric Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349–364, 2008

work page 2008
[23]

Gowda, Thomas W

Kishen N. Gowda, Thomas W. Pensyl, Aravind Srinivasan, and Khoa Trinh. Improved bi-point rounding algorithms and a golden barrier for k-median. In SODA, pages 987–1011. SIAM, 2023

work page 2023
[24]

Adapting k-means algorithms for outliers

Christoph Grunau and V´aclav Rozhon. Adapting k-means algorithms for outliers. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 7845–7886. PMLR, 2022

work page 2022
[25]

Accelerating large-scale inference with anisotropic vector quantization

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 3887–3896. PMLR, 2020

work page 2020
[26]

Structural iterative rounding for general- ized k-median problems

Anupam Gupta, Benjamin Moseley, and Rudy Zhou. Structural iterative rounding for general- ized k-median problems. In ICALP, volume 198 of LIPIcs, pages 77:1–77:18. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021

work page 2021
[27]

Local search methods for k-means with outliers

Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757–768, 2017

work page 2017
[28]

R., Millman, K

Charles R. Harris, K. Jarrod Millman, St ´efan J. van der Walt, Ralf Gommers, Pauli Vir- tanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal- dane, Jaime Fern´andez del R´ıo, Mark Wiebe, Pearu Peterson, Pierre G´erard-Marchant, Ke...

work page doi:10.1038/s41586-020-2649-2 2020
[29]

Near-linear time approximation algorithms for k-means with outliers

Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Near-linear time approximation algorithms for k-means with outliers. In Proceedings of the 41st International Conference on Machine Learning, pages 19723–19756, 2024

work page 2024
[30]

Fast noise removal for k-means clustering

Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, Xiaorui Sun, and Rudy Zhou. Fast noise removal for k-means clustering. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pages 456–466. PMLR, 2020

work page 2020
[31]

Clustering what matters in constrained settings: Improved outlier to outlier-free reductions

Ragesh Jaiswal and Amit Kumar. Clustering what matters in constrained settings: Improved outlier to outlier-free reductions. In ISAAC, volume 283 of LIPIcs, pages 41:1–41:16. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2023. 11

work page 2023
[32]

Diskann: Fast accurate billion-point nearest neighbor search on a single node

Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in neural information processing Systems , 32, 2019

work page 2019
[33]

Billion-scale similarity search with gpus

Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3):535–547, 2021

work page 2021
[34]

Mount, Nathan S

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y . Wu. A local search approximation algorithm for k-means clustering.Comput. Geom., 28(2-3):89–112, 2004

work page 2004
[35]

Knorr and Raymond T

Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392–403. Morgan Kaufmann, 1998

work page 1998
[36]

Spatial weighted outlier detection

Yufeng Kou, Chang-Tien Lu, and Dechang Chen. Spatial weighted outlier detection. In SDM, pages 614–618. SIAM, 2006

work page 2006
[37]

Constant approximation for k-median and k-means with outliers via iterative rounding

Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In STOC, pages 646–659. ACM, 2018

work page 2018
[38]

The diskann library: Graph-based indices for fast, fresh and filtered vector search

Ravishankar Krishnaswamy, Magdalen Dobson Manohar, and Harsha Vardhan Simhadri. The diskann library: Graph-based indices for fast, fresh and filtered vector search. IEEE Data Eng. Bull., 48(3):20–42, 2024

work page 2024
[39]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS, 2020

work page 2020
[40]

Rao Vemuri

Yihua Liao and V . Rao Vemuri. Use of k-nearest neighbor classifier for intrusion detection. Comput. Secur ., 21(5):439–448, 2002

work page 2002
[41]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020

work page 2020
[42]

On approximate geometric k-clustering

Jir´ı Matousek. On approximate geometric k-clustering. Discret. Comput. Geom., 24(1):61–84, 2000

work page 2000
[43]

Fast distributed outlier detection in mixed-attribute data sets

Matthew Eric Otey, Amol Ghoting, and Srinivasan Parthasarathy. Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov., 12(2-3):203–228, 2006

work page 2006
[44]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011
[45]

Efficient algorithms for mining outliers from large data sets

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD Conference, pages 427–438. ACM, 2000

work page 2000
[46]

Mining distance-based outliers from large databases in any metric space

Yufei Tao, Xiaokui Xiao, and Shuigeng Zhou. Mining distance-based outliers from large databases in any metric space. In KDD, pages 394–403. ACM, 2006

work page 2006
[47]

A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search

Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow., 14(11):1964–1978, 2021

work page 1964
[48]

HOT: hypergraph-based outlier test for categorical data

Li Wei, Weining Qian, Aoying Zhou, Wen Jin, and Jeffrey Xu Yu. HOT: hypergraph-based outlier test for categorical data. In PAKDD, volume 2637 of Lecture Notes in Computer Science , pages 399–410. Springer, 2003

work page 2003
[49]

Outlier detection by sampling with accuracy guarantees

Mingxi Wu and Chris Jermaine. Outlier detection by sampling with accuracy guarantees. In KDD, pages 767–772. ACM, 2006. 12

work page 2006
[50]

Ji Zhang and Hai H. Wang. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst., 10(3):333–355, 2006

work page 2006
[51]

Yang Zhang, Nirvana Meratnia, and Paul J. M. Havinga. Outlier detection techniques for wireless sensor networks: A survey. IEEE Commun. Surv. Tutorials, 12(2):159–170, 2010. 13 Appendix A Future Work Our findings suggest that outlier detection and robust clustering should be studied in a unified framework, leading to several immediate directions for future work:

work page 2010
[52]

Our current analysis of the KNN heuristic is not tight

Tighten the analysis of KNN. Our current analysis of the KNN heuristic is not tight. Establishing sharper bounds or identifying matching lower bounds remains an open problem

work page
[53]

Outlier detection for robust clustering. It would be interesting to analyze other widely used outlier detection heuristics, particularly variants of the KNN heuristic, and investigate whether they admit strong theoretical guarantees or empirical performance for the Robust k-Means objective

work page
[54]

Another promising direction is to adapt advanced tech- niques from the robust clustering literature to design more effective outlier detection algorithms

Robust clustering for outlier detection. Another promising direction is to adapt advanced tech- niques from the robust clustering literature to design more effective outlier detection algorithms

work page
[55]

true inliers

Beyond worst-case analysis. Our results focus on worst-case approximation guarantees, where both cluster structure and outliers are adversarial. Exploring refined analyses under additional, practically motivated assumptions on the data generation process may yield significantly stronger guarantees. B Deferred Proofs B.1 Full Proof of Theorem 4.1 We extend...

work page