DRtool: An Interactive Tool for Analyzing High-Dimensional Clusterings

Julia Fukuyama; Justin Lin

arxiv: 2509.04603 · v3 · submitted 2025-09-04 · 📊 stat.AP · cs.LG

DRtool: An Interactive Tool for Analyzing High-Dimensional Clusterings

Justin Lin , Julia Fukuyama This is my paper

Pith reviewed 2026-05-18 19:24 UTC · model grok-4.3

classification 📊 stat.AP cs.LG

keywords high-dimensional clusteringdimension reductioncluster validationinteractive toolR packagedata visualizationover-clusteringfalse structures

0 comments

The pith

DRtool supplies interactive analytical plots to distinguish false clusters created by nonlinear dimension reduction from genuine ones in high-dimensional data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Nonlinear dimension reduction methods, commonly used to visualize high-dimensional data, can distort space and generate false cluster structures that are hard to detect. The paper introduces DRtool, an interactive R package, to counter this by offering multiple analytical plots that together give views of the data's global organization and the relationships between individual clusters. A sympathetic reader would care because these false structures often lead to incorrect conclusions about the underlying data patterns and representative samples. The tool aims to make cluster validation more accessible by focusing on visual and relational analysis rather than relying solely on the distorted views.

Core claim

DRtool is an interactive tool that empowers analysts to distinguish false clusters and better interpret their high-dimensional clustering results. The tool uses various analytical plots to provide a multi-faceted perspective on the data's global structure as well as local inter-cluster relationships.

What carries the argument

DRtool's collection of analytical plots that examine global data structure alongside local inter-cluster relationships to assess cluster legitimacy.

Load-bearing premise

The analytical plots provided are sufficient on their own for users to reliably tell false clusters apart from true ones, even without additional domain knowledge or external checks.

What would settle it

A controlled experiment using synthetic data with known true clusters and induced false ones, checking if tool users correctly flag the false clusters at higher rates than non-users.

Figures

Figures reproduced from arXiv: 2509.04603 by Julia Fukuyama, Justin Lin.

**Figure 2.** Figure 2: Power analysis of the MST test. The estimated size was well-below 5%. At each number of dimensions, the size experiment was simulated 100 times. Under the null hypothesis, the test never returned significant at 5, 10, and 20 dimensions. At 50 and 100 dimensions, the test returned significant only once each time. The test is conservative because the size must not exceed 5% for any member of the composite … view at source ↗

**Figure 3.** Figure 3: UMAP embedding of the MNIST data set colored according to k-means cluster [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Path projection plot of classes 1 and 2. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Path projection plot of class 4 clusters. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Path projection plot of class 9 and remainder of cluster. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: UMAP embedding of the Wong data set colored according to k-means clustering. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Path projection plot of class 4 clusters. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Path projection plot of class 8 clusters. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

When faced with new data, we often conduct a cluster analysis to obtain a better understanding of the data's structure and the archetypical samples present in the data. This process often includes visualization of the data, either as a way to discover or verify clusters. However, the increases in data complexity and dimensionality has made this step very tricky. To visualize data, nonlinear dimension reduction methods are the de facto standard for their ability to non-uniformly stretch and shrink space in order to preserve local clusters. Because this process requires a drastic manipulation of space, however, nonlinear dimension reduction methods are known to produce false structures, especially when mishandled. A common consequence that often goes undetected by the untrained eye is over-clustering of the data. In efforts to deal with this phenomenon, we developed an interactive tool that empowers analysts to distinguish false clusters and better interpret their high-dimensional clustering results. The tool uses various analytical plots to provide a multi-faceted perspective on the data's global structure as well as local inter-cluster relationships, helping users determine the legitimacy of their high-dimensional clustering results. The tool is available via an R package named DRtool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces DRtool, an R package offering an interactive tool with analytical plots that examine global data structure and local inter-cluster relationships in high-dimensional clusterings, with the aim of helping users detect false clusters that can arise from nonlinear dimension reduction methods applied prior to clustering.

Significance. If the described plots prove effective, the tool could address a genuine practical need for analysts interpreting clustering results in high-dimensional settings where visualization artifacts are common. The open R package format supports reproducibility and community use, which is a clear strength for a software contribution.

major comments (1)

[Abstract] Abstract and overall tool description: the central claim that the supplied plots provide a multi-faceted perspective sufficient to determine the legitimacy of clustering results and distinguish false clusters is unsupported, as the manuscript contains no benchmark datasets with known ground-truth clusters, no quantitative detection metrics, no controlled user studies, and no comparison against standard validation approaches.

minor comments (1)

[Usage/Implementation] Clarify the exact set of plot types and their intended diagnostic roles with a concise table or enumerated list in the usage section to improve readability for practitioners.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying this key issue regarding the strength of the claims in the abstract and tool description. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and overall tool description: the central claim that the supplied plots provide a multi-faceted perspective sufficient to determine the legitimacy of clustering results and distinguish false clusters is unsupported, as the manuscript contains no benchmark datasets with known ground-truth clusters, no quantitative detection metrics, no controlled user studies, and no comparison against standard validation approaches.

Authors: We agree that the manuscript does not contain benchmark datasets with ground-truth labels, quantitative detection metrics, controlled user studies, or systematic comparisons to existing validation indices. DRtool is presented as an interactive software tool for exploratory analysis rather than a new validation methodology, and the manuscript focuses on describing the plots and their intended use in complementing standard practices. The multi-faceted perspective arises from combining global structure diagnostics with local inter-cluster relationship plots, drawing on known properties of nonlinear dimension reduction artifacts. To address the referee's concern, we will revise the abstract and relevant sections to clarify that the tool aids analysts in assessing potential false clusters rather than claiming the plots are sufficient by themselves to determine legitimacy. We will also add one or two detailed illustrative examples using synthetic data with known structure to demonstrate practical application. This constitutes a partial revision, as a full empirical validation study lies outside the scope of this software contribution. revision: partial

Circularity Check

0 steps flagged

No circularity: software tool description with no derivations or self-referential claims

full rationale

The manuscript describes an R package (DRtool) and its analytical plots for inspecting high-dimensional clusterings after nonlinear dimension reduction. It contains no equations, fitted parameters, predictions, or derivation steps that could reduce to inputs by construction. The central claim is a descriptive statement about the tool's intended use rather than a mathematical result derived from prior results or self-citations. No load-bearing self-citations, ansatzes, or uniqueness theorems appear. The work is self-contained as a software tool paper; absence of controlled evaluations is a separate issue of empirical support, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are introduced; the contribution is a descriptive software package for an existing analysis workflow.

pith-pipeline@v0.9.0 · 5724 in / 906 out tokens · 26903 ms · 2026-05-18T19:24:36.973088+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

McInnes, J

Becht, E., L. McInnes, J. Healy, C.-A. Dutertre, I. Kwok, L. Ng, F. Ginhoux, and E. Newell (2019). Dimensionality reduction for visualizing single-cell data using umap.Nature Biotechnology 37, 38–44

work page 2019
[2]

Bhattacharya, B. (2019). A general asymptotic framework for distribution-free graph- based two-sample tests.Journal of the Royal Statistical Society Series B: Statistical Methodology 81(3), 575–602

work page 2019
[3]

Chen, and Y

Chen, H., X. Chen, and Y. Su (2018). A weighted edge-count two-sample test for mul- tivariate and object data.Journal of the American Statistical Association 113(523), 1146–1155

work page 2018
[4]

Chen, H. and J. Friedman (2017). A new graph-based two-sample test for multivariate and object data.Journal of the American Statistical Association 112(517), 397–409

work page 2017
[5]

Coenen, A. and A. Pearce (2024). Understanding umap.https://pair-code.github. io/understanding-umap/

work page 2024
[6]

Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine 29(6)

work page 2012
[7]

Diaconis, P. and D. Freedman (1984). Asymptotics of graphical projection pursuit.The Annals of Statistics 12(3), 783–815

work page 1984
[8]

Friedman, J. and L. Rafsky (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests.Annals of Statistics 7(4), 697–717

work page 1979
[9]

Gower, J. and G. Ross (1969). Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society Series C: Applied Statistics 18(1), 54–64. 26

work page 1969
[10]

King, B. and B. Tidor (2009). MIST: Maximum information spanning trees for dimension reduction of biological data sets.Bioinformatics 25(9), 1156–1172

work page 2009
[11]

Healy, N

McInnes, L., J. Healy, N. Saul, and L. Großberger (2018). UMAP: Uniform manifold approximation and projection.The Journal of Open Source Software 3(3)

work page 2018
[12]

and J.-L

Probst, D. and J.-L. Reymond (2020). Visualizing very large high-dimensional data sets as minimum spanning trees.Journal of Cheminformatics 12(12)

work page 2020
[13]

Robinson, D. and L. Foulds (1981). Comparison of phylogenetic trees.Mathematical Biosciences 53, 131–147. Roz´ al, G. and J. Hartigan (1994). The MAP test for multimodality.Journal of Classifica- tion 11, 5–36

work page 1981
[14]

Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample.Journal of Classification 20, 25–47

work page 2003
[15]

de Silva, and J

Tenenbaum, J., V. de Silva, and J. Langfor (2000). A global geometric framework for nonlinear dimensional reduction.Science 290(2319)

work page 2000
[16]

Tozzi, and T

Tuzhilina, E., L. Tozzi, and T. Hastie (2023). Canonical correlation analysis in high di- mensions with structured regularization.Statistical Modelling 23(3), 203–227

work page 2023
[17]

Vi´ egas, and I

Wattenberg, M., F. Vi´ egas, and I. Johnson (2016). How to use t-SNE effectively.https: //distill.pub/2016/misread-tsne/

work page 2016
[18]

Wong, M., D. Ong, F. Lim, K. Teng, N. McGovern, S. Narayanan, W. Ho, D. Cerny, H. Tan, R. Anicete, B. Tan, T. Lim, C. Chan, P. Cheow, S. Lee, A. Takano, E.-H. Tan, J. Tam, E. Tan, J. Chan, and E. Newell (2016). A high-dimensional atlas of human T cell diversity reveals tissue-specific trafficking and cytokine signatures.ScienceDirect 45(2), 442–456. 27 Ap...

work page 2016

[1] [1]

McInnes, J

Becht, E., L. McInnes, J. Healy, C.-A. Dutertre, I. Kwok, L. Ng, F. Ginhoux, and E. Newell (2019). Dimensionality reduction for visualizing single-cell data using umap.Nature Biotechnology 37, 38–44

work page 2019

[2] [2]

Bhattacharya, B. (2019). A general asymptotic framework for distribution-free graph- based two-sample tests.Journal of the Royal Statistical Society Series B: Statistical Methodology 81(3), 575–602

work page 2019

[3] [3]

Chen, and Y

Chen, H., X. Chen, and Y. Su (2018). A weighted edge-count two-sample test for mul- tivariate and object data.Journal of the American Statistical Association 113(523), 1146–1155

work page 2018

[4] [4]

Chen, H. and J. Friedman (2017). A new graph-based two-sample test for multivariate and object data.Journal of the American Statistical Association 112(517), 397–409

work page 2017

[5] [5]

Coenen, A. and A. Pearce (2024). Understanding umap.https://pair-code.github. io/understanding-umap/

work page 2024

[6] [6]

Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine 29(6)

work page 2012

[7] [7]

Diaconis, P. and D. Freedman (1984). Asymptotics of graphical projection pursuit.The Annals of Statistics 12(3), 783–815

work page 1984

[8] [8]

Friedman, J. and L. Rafsky (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests.Annals of Statistics 7(4), 697–717

work page 1979

[9] [9]

Gower, J. and G. Ross (1969). Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society Series C: Applied Statistics 18(1), 54–64. 26

work page 1969

[10] [10]

King, B. and B. Tidor (2009). MIST: Maximum information spanning trees for dimension reduction of biological data sets.Bioinformatics 25(9), 1156–1172

work page 2009

[11] [11]

Healy, N

McInnes, L., J. Healy, N. Saul, and L. Großberger (2018). UMAP: Uniform manifold approximation and projection.The Journal of Open Source Software 3(3)

work page 2018

[12] [12]

and J.-L

Probst, D. and J.-L. Reymond (2020). Visualizing very large high-dimensional data sets as minimum spanning trees.Journal of Cheminformatics 12(12)

work page 2020

[13] [13]

Robinson, D. and L. Foulds (1981). Comparison of phylogenetic trees.Mathematical Biosciences 53, 131–147. Roz´ al, G. and J. Hartigan (1994). The MAP test for multimodality.Journal of Classifica- tion 11, 5–36

work page 1981

[14] [14]

Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample.Journal of Classification 20, 25–47

work page 2003

[15] [15]

de Silva, and J

Tenenbaum, J., V. de Silva, and J. Langfor (2000). A global geometric framework for nonlinear dimensional reduction.Science 290(2319)

work page 2000

[16] [16]

Tozzi, and T

Tuzhilina, E., L. Tozzi, and T. Hastie (2023). Canonical correlation analysis in high di- mensions with structured regularization.Statistical Modelling 23(3), 203–227

work page 2023

[17] [17]

Vi´ egas, and I

Wattenberg, M., F. Vi´ egas, and I. Johnson (2016). How to use t-SNE effectively.https: //distill.pub/2016/misread-tsne/

work page 2016

[18] [18]

Wong, M., D. Ong, F. Lim, K. Teng, N. McGovern, S. Narayanan, W. Ho, D. Cerny, H. Tan, R. Anicete, B. Tan, T. Lim, C. Chan, P. Cheow, S. Lee, A. Takano, E.-H. Tan, J. Tam, E. Tan, J. Chan, and E. Newell (2016). A high-dimensional atlas of human T cell diversity reveals tissue-specific trafficking and cytokine signatures.ScienceDirect 45(2), 442–456. 27 Ap...

work page 2016