Recognition: 2 theorem links
· Lean TheoremSimple KNN-Based Outlier Detection Achieves Robust Clustering
Pith reviewed 2026-05-11 02:27 UTC · model grok-4.3
The pith
Removing points with large K-nearest-neighbor distances reduces robust k-means to standard k-means with a constant-factor approximation guarantee.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a practical assumption on the optimal cluster sizes, simply removing the z points with the largest K-nearest-neighbor distances allows one to solve robust k-means by applying a standard k-means algorithm to the remaining points, yielding a constant-factor approximation without introducing additional centers or discarding extra outliers.
What carries the argument
KNN-based outlier detection via large K-nearest-neighbor distances, which performs the reduction from robust k-means to standard k-means.
If this is right
- Delivers a constant-factor approximation guarantee for robust k-means using only standard k-means on cleaned data.
- Requires no additional centers or removal of more outliers than specified.
- Matches or exceeds the clustering cost of more sophisticated algorithms on real-world data.
- Offers improved runtime efficiency compared to complex robust clustering methods.
Where Pith is reading between the lines
- Similar simple heuristics might work for other robust optimization problems in machine learning.
- Testing on synthetic data with varying cluster size imbalances could reveal the assumption's necessity.
- May inspire analysis of why KNN distances correlate with outlier status in clustered data.
Load-bearing premise
The approach depends on the optimal clusters having sizes that meet a practical condition, without which the KNN distances may not accurately identify the outliers.
What would settle it
A counterexample dataset with optimal clusters violating the size assumption where the KNN removal method produces a clustering cost significantly worse than the optimal robust k-means solution.
Figures
read the original abstract
Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that, under a practical assumption on the sizes of the optimal clusters, simply removing the z points with the largest K-nearest-neighbor distances yields a constant-factor approximation for the robust k-means problem by reducing it directly to standard k-means (without extra centers or additional outlier discards). It supports the claim with a theoretical argument and shows empirically that the method matches or outperforms several more sophisticated robust clustering algorithms on real-world datasets in terms of clustering cost and runtime.
Significance. If the reduction holds with a reasonable assumption, the result is significant: it shows that a classic, parameter-light outlier-detection heuristic can match the approximation guarantees of prior robust k-means work while simplifying the pipeline. This bridges outlier detection and clustering theory and offers a practical, easy-to-implement baseline. The empirical competitiveness further suggests the approach may be useful even when the assumption is only approximately satisfied.
major comments (2)
- [main theoretical result / Assumption on cluster sizes] The central constant-factor reduction (abstract and main theorem) rests on an unstated or only informally described assumption on optimal cluster sizes (likely that each cluster is at least some multiple of z/k in size so that inlier KNN distances are strictly smaller than outlier distances). The manuscript must state this assumption precisely (with explicit constants), derive the resulting approximation factor, and show that the assumption is satisfied in the regimes where the guarantee is claimed; without these details the reduction cannot be verified and may fail for small or unbalanced clusters.
- [proof of the reduction] § on the KNN removal argument: it is unclear whether the procedure removes exactly the z outliers or may discard inliers from the smallest optimal cluster or retain some outliers when cluster sizes vary near the threshold. A concrete example or lemma showing that the z-th largest inlier KNN distance is always smaller than the smallest outlier KNN distance under the assumption is needed.
minor comments (2)
- [Experiments] The value of K used for the KNN distances should be stated explicitly (or shown to be insensitive) in the experimental section, along with any preprocessing steps applied to the real-world datasets.
- [Introduction / Related Work] Notation for the robust k-means objective and the KNN distance should be introduced once and used consistently; a short table comparing the new method to prior approaches (number of centers, extra discards, assumption strength) would improve readability.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on our manuscript. The suggestions for greater precision in the theoretical development are helpful, and we will revise the paper to incorporate them.
read point-by-point responses
-
Referee: [main theoretical result / Assumption on cluster sizes] The central constant-factor reduction (abstract and main theorem) rests on an unstated or only informally described assumption on optimal cluster sizes (likely that each cluster is at least some multiple of z/k in size so that inlier KNN distances are strictly smaller than outlier distances). The manuscript must state this assumption precisely (with explicit constants), derive the resulting approximation factor, and show that the assumption is satisfied in the regimes where the guarantee is claimed; without these details the reduction cannot be verified and may fail for small or unbalanced clusters.
Authors: We agree that the assumption on optimal cluster sizes should be stated with full precision, including explicit constants, to allow verification of the constant-factor reduction. In the revised manuscript we will formally define the assumption (each optimal cluster size at least a fixed multiple of z/k sufficient to separate inlier and outlier KNN distances), derive the resulting approximation factor explicitly, and discuss the regimes in which the assumption holds, including the behavior for balanced versus unbalanced clusters. revision: yes
-
Referee: [proof of the reduction] § on the KNN removal argument: it is unclear whether the procedure removes exactly the z outliers or may discard inliers from the smallest optimal cluster or retain some outliers when cluster sizes vary near the threshold. A concrete example or lemma showing that the z-th largest inlier KNN distance is always smaller than the smallest outlier KNN distance under the assumption is needed.
Authors: We will add a dedicated lemma establishing that, under the cluster-size assumption, the z-th largest inlier KNN distance is strictly smaller than the smallest outlier KNN distance, thereby proving that the procedure removes exactly the z outliers. The revised section will also contain a concrete numerical example that illustrates the separation, including the case of cluster sizes near the threshold. revision: yes
Circularity Check
No circularity: proof under explicit assumption is self-contained
full rationale
The paper states a mathematical proof that, under a stated practical assumption on optimal cluster sizes, KNN-distance outlier removal yields a constant-factor reduction from robust k-Means to standard k-Means without extra centers or discards. No equations, parameters, or steps in the abstract reduce the guarantee to a fitted input, self-definition, or self-citation chain; the assumption is introduced as a precondition rather than derived from the method. The central claim therefore stands as an independent derivation rather than a renaming or tautological mapping of its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption practical assumption on the optimal cluster sizes
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 (Restatement of Theorem 1.1). ... each cluster induced by C∗ has size at least cz, then ... fz(X,C)≤Φ(c)β·fz(X,C∗)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 4.2 ... fz(X,C∗)≥|Y|T²·max{λ²,(1−λ)²}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Clustering what matters: Optimal approximation for clustering with outliers
Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers. J. Artif. Intell. Res., 78:143–166, 2023
work page 2023
-
[2]
Better guarantees for k-means and euclidean k-median by primal-dual algorithms
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In FOCS, pages 61–72. IEEE Computer Society, 2017
work page 2017
-
[3]
k-clustering with fair outliers
Matteo Almanza, Alessandro Epasto, Alessandro Panconesi, and Giuseppe Re. k-clustering with fair outliers. In WSDM, pages 5–15. ACM, 2022
work page 2022
-
[4]
Approximation schemes for euclidean k-medians and related problems
Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k-medians and related problems. In STOC, pages 106–113. ACM, 1998
work page 1998
-
[5]
k-means++ the advantages of careful seeding
David Arthur and Sergei Vassilvitskii. k-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035, 2007
work page 2007
-
[6]
Greedy sampling for approximate clustering in the presence of outliers
Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In NeurIPS, pages 11146–11155, 2019
work page 2019
-
[7]
Rae, Erich Elsen, and Laurent Sifre
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...
work page 2022
-
[8]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, 2009
work page 2009
-
[9]
Moses Charikar, Sudipto Guha, ´Eva Tardos, and David B. Shmoys. A constant-factor approxi- mation algorithm for the k-median problem (extended abstract). In STOC, pages 1–10. ACM, 1999
work page 1999
-
[10]
Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642–651. ACM/SIAM, 2001
work page 2001
-
[11]
An improved greedy approximation for (metric) k-means
Moses Charikar, Vincent Cohen-Addad, Ruiquan Gao, Fabrizio Grandoni, Euiwoong Lee, and Ernest van Wijland. An improved greedy approximation for (metric) k-means. In FOCS, pages 233–240. IEEE, 2025
work page 2025
-
[12]
k-means-: A unified approach to clustering and outlier detection
Sanjay Chawla and Aristides Gionis. k-means-: A unified approach to clustering and outlier detection. In SDM, pages 189–197. SIAM, 2013
work page 2013
-
[13]
A constant factor approximation algorithm for k-median clustering with outliers
Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826–835. SIAM, 2008
work page 2008
-
[14]
k-median/means with outliers revisited: A simple fpt approximation
Xianrun Chen, Lu Han, Dachuan Xu, Yicheng Xu, and Yong Zhang. k-median/means with outliers revisited: A simple fpt approximation. In COCOON (2), volume 14423 of Lecture Notes in Computer Science , pages 295–302. Springer, 2023
work page 2023
-
[15]
Vincent Cohen-Addad, Hossein Esfandiari, Vahab S. Mirrokni, and Shyam Narayanan. Im- proved approximations for euclidean k-means and k-median, via nested quasi-independent sets. In STOC, pages 1621–1628. ACM, 2022. 10
work page 2022
-
[16]
A (2+ ϵ)-approximation algorithm for metric k-median
Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, Chris Schwiegelshohn, and Ola Svensson. A (2+ ϵ)-approximation algorithm for metric k-median. In STOC, pages 615–624. ACM, 2025
work page 2025
-
[17]
Amit Deshpande, Praneeth Kacham, and Rameshwar Pratap. Robust k-means++. In UAI, volume 124 of Proceedings of Machine Learning Research, pages 799–808. AUAI Press, 2020
work page 2020
-
[18]
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazar´e, Maria Lomeli, Lucas Hosseini, and Herv´e J´egou. The faiss library. IEEE Transactions on Big Data, 2025
work page 2025
-
[19]
Prerau, Leonid Portnoy, and Salvatore J
Eleazar Eskin, Andrew Arnold, Michael J. Prerau, Leonid Portnoy, and Salvatore J. Stolfo. A geometric framework for unsupervised anomaly detection. In Applications of Data Mining in Computer Security, Advances in Information Security, pages 77–101. Springer, 2002
work page 2002
-
[20]
Improved algorithms for clustering with outliers
Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Improved algorithms for clustering with outliers. In ISAAC, volume 149 of LIPIcs, pages 61:1–61:12. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019
work page 2019
-
[21]
Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1–26:26, 2019
work page 2019
-
[22]
Fast mining of distance-based outliers in high-dimensional datasets
Amol Ghoting, Srinivasan Parthasarathy, and Matthew Eric Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349–364, 2008
work page 2008
-
[23]
Kishen N. Gowda, Thomas W. Pensyl, Aravind Srinivasan, and Khoa Trinh. Improved bi-point rounding algorithms and a golden barrier for k-median. In SODA, pages 987–1011. SIAM, 2023
work page 2023
-
[24]
Adapting k-means algorithms for outliers
Christoph Grunau and V´aclav Rozhon. Adapting k-means algorithms for outliers. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 7845–7886. PMLR, 2022
work page 2022
-
[25]
Accelerating large-scale inference with anisotropic vector quantization
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 3887–3896. PMLR, 2020
work page 2020
-
[26]
Structural iterative rounding for general- ized k-median problems
Anupam Gupta, Benjamin Moseley, and Rudy Zhou. Structural iterative rounding for general- ized k-median problems. In ICALP, volume 198 of LIPIcs, pages 77:1–77:18. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2021
work page 2021
-
[27]
Local search methods for k-means with outliers
Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757–768, 2017
work page 2017
-
[28]
Charles R. Harris, K. Jarrod Millman, St ´efan J. van der Walt, Ralf Gommers, Pauli Vir- tanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal- dane, Jaime Fern´andez del R´ıo, Mark Wiebe, Pearu Peterson, Pierre G´erard-Marchant, Ke...
-
[29]
Near-linear time approximation algorithms for k-means with outliers
Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Near-linear time approximation algorithms for k-means with outliers. In Proceedings of the 41st International Conference on Machine Learning, pages 19723–19756, 2024
work page 2024
-
[30]
Fast noise removal for k-means clustering
Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, Xiaorui Sun, and Rudy Zhou. Fast noise removal for k-means clustering. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pages 456–466. PMLR, 2020
work page 2020
-
[31]
Clustering what matters in constrained settings: Improved outlier to outlier-free reductions
Ragesh Jaiswal and Amit Kumar. Clustering what matters in constrained settings: Improved outlier to outlier-free reductions. In ISAAC, volume 283 of LIPIcs, pages 41:1–41:16. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2023. 11
work page 2023
-
[32]
Diskann: Fast accurate billion-point nearest neighbor search on a single node
Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in neural information processing Systems , 32, 2019
work page 2019
-
[33]
Billion-scale similarity search with gpus
Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3):535–547, 2021
work page 2021
-
[34]
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y . Wu. A local search approximation algorithm for k-means clustering.Comput. Geom., 28(2-3):89–112, 2004
work page 2004
-
[35]
Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392–403. Morgan Kaufmann, 1998
work page 1998
-
[36]
Spatial weighted outlier detection
Yufeng Kou, Chang-Tien Lu, and Dechang Chen. Spatial weighted outlier detection. In SDM, pages 614–618. SIAM, 2006
work page 2006
-
[37]
Constant approximation for k-median and k-means with outliers via iterative rounding
Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In STOC, pages 646–659. ACM, 2018
work page 2018
-
[38]
The diskann library: Graph-based indices for fast, fresh and filtered vector search
Ravishankar Krishnaswamy, Magdalen Dobson Manohar, and Harsha Vardhan Simhadri. The diskann library: Graph-based indices for fast, fresh and filtered vector search. IEEE Data Eng. Bull., 48(3):20–42, 2024
work page 2024
-
[39]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS, 2020
work page 2020
-
[40]
Yihua Liao and V . Rao Vemuri. Use of k-nearest neighbor classifier for intrusion detection. Comput. Secur ., 21(5):439–448, 2002
work page 2002
-
[41]
Yury A. Malkov and Dmitry A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020
work page 2020
-
[42]
On approximate geometric k-clustering
Jir´ı Matousek. On approximate geometric k-clustering. Discret. Comput. Geom., 24(1):61–84, 2000
work page 2000
-
[43]
Fast distributed outlier detection in mixed-attribute data sets
Matthew Eric Otey, Amol Ghoting, and Srinivasan Parthasarathy. Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov., 12(2-3):203–228, 2006
work page 2006
-
[44]
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011
work page 2011
-
[45]
Efficient algorithms for mining outliers from large data sets
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD Conference, pages 427–438. ACM, 2000
work page 2000
-
[46]
Mining distance-based outliers from large databases in any metric space
Yufei Tao, Xiaokui Xiao, and Shuigeng Zhou. Mining distance-based outliers from large databases in any metric space. In KDD, pages 394–403. ACM, 2006
work page 2006
-
[47]
Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow., 14(11):1964–1978, 2021
work page 1964
-
[48]
HOT: hypergraph-based outlier test for categorical data
Li Wei, Weining Qian, Aoying Zhou, Wen Jin, and Jeffrey Xu Yu. HOT: hypergraph-based outlier test for categorical data. In PAKDD, volume 2637 of Lecture Notes in Computer Science , pages 399–410. Springer, 2003
work page 2003
-
[49]
Outlier detection by sampling with accuracy guarantees
Mingxi Wu and Chris Jermaine. Outlier detection by sampling with accuracy guarantees. In KDD, pages 767–772. ACM, 2006. 12
work page 2006
-
[50]
Ji Zhang and Hai H. Wang. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst., 10(3):333–355, 2006
work page 2006
-
[51]
Yang Zhang, Nirvana Meratnia, and Paul J. M. Havinga. Outlier detection techniques for wireless sensor networks: A survey. IEEE Commun. Surv. Tutorials, 12(2):159–170, 2010. 13 Appendix A Future Work Our findings suggest that outlier detection and robust clustering should be studied in a unified framework, leading to several immediate directions for future work:
work page 2010
-
[52]
Our current analysis of the KNN heuristic is not tight
Tighten the analysis of KNN. Our current analysis of the KNN heuristic is not tight. Establishing sharper bounds or identifying matching lower bounds remains an open problem
-
[53]
Outlier detection for robust clustering. It would be interesting to analyze other widely used outlier detection heuristics, particularly variants of the KNN heuristic, and investigate whether they admit strong theoretical guarantees or empirical performance for the Robust k-Means objective
-
[54]
Robust clustering for outlier detection. Another promising direction is to adapt advanced tech- niques from the robust clustering literature to design more effective outlier detection algorithms
-
[55]
Beyond worst-case analysis. Our results focus on worst-case approximation guarantees, where both cluster structure and outliers are adversarial. Exploring refined analyses under additional, practically motivated assumptions on the data generation process may yield significantly stronger guarantees. B Deferred Proofs B.1 Full Proof of Theorem 4.1 We extend...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.