UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

James Melville; John Healy; Leland McInnes

arxiv: 1802.03426 · v3 · submitted 2018-02-09 · 📊 stat.ML · cs.CG· cs.LG

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes , John Healy , James Melville This is my paper

Pith reviewed 2026-05-10 17:06 UTC · model grok-4.3

classification 📊 stat.ML cs.CGcs.LG

keywords dimension reductionmanifold learningdata visualizationUMAPt-SNEmachine learningtopological methods

0 comments

The pith

UMAP matches t-SNE visualization quality with faster runtime and better global structure preservation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UMAP as a new manifold learning technique for dimension reduction. It derives the method from a framework in Riemannian geometry and algebraic topology to produce a practical and scalable algorithm for real data. A sympathetic reader would care because the approach promises effective visualization of complex datasets along with use in general machine learning tasks where output dimension is not restricted.

Core claim

UMAP is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.

What carries the argument

The UMAP algorithm, which constructs a topological model of the data manifold from local geometric information for projection into lower dimensions.

If this is right

It can replace t-SNE for visualization tasks on large datasets while running faster.
It supports dimension reduction to any number of dimensions without added computational cost.
It serves as a general preprocessing step in machine learning pipelines for high-dimensional data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fields handling very large datasets such as single-cell biology could gain new exploratory capabilities.
The method might combine with supervised learning models to improve feature extraction.
Tests on streaming data could show whether the approach extends beyond static datasets.

Load-bearing premise

The theoretical framework based in Riemannian geometry and algebraic topology can be translated into a practical scalable algorithm that achieves the claimed performance advantages over existing methods like t-SNE.

What would settle it

Benchmark runs on standard high-dimensional datasets where UMAP produces visualizations with less cluster separation than t-SNE or requires more computation time.

read the original abstract

UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UMAP gives a clean, usable algorithm for dimension reduction that is faster than t-SNE and keeps more global structure, derived from a Riemannian-plus-fuzzy-simplicial-set construction.

read the letter

UMAP is worth your time if you need a dimension reduction step that scales better than t-SNE and does not collapse global layout as often. The paper turns local nearest-neighbor distances into a fuzzy simplicial set using a Riemannian metric approximation, then minimizes cross-entropy to a low-dimensional embedding. That construction is spelled out in section 2 and turned into concrete steps in section 3, which is the part that actually matters for use.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces UMAP, a dimension-reduction algorithm derived from a Riemannian-geometry and algebraic-topology framework. Local manifold structure is approximated by k-nearest-neighbor graphs that are converted into fuzzy simplicial sets; a cross-entropy objective is then minimized to obtain a low-dimensional embedding. The authors claim that the resulting method matches t-SNE visualization quality, preserves global structure more faithfully, runs faster, and admits arbitrary embedding dimensions, thereby serving as a general-purpose ML preprocessing tool.

Significance. If the performance claims are substantiated, UMAP supplies a theoretically grounded, scalable alternative to t-SNE that is immediately useful for visualization of large data sets and for dimension reduction prior to downstream learning tasks. The explicit construction of the fuzzy simplicial set and the provision of both the derivation (Section 2) and the implementable algorithm (Section 3) constitute a clear strength.

minor comments (3)

[Section 4.1] Section 4.1: the quantitative comparison tables would benefit from reporting both mean and standard deviation over multiple random seeds rather than single-run results.
[Figure 3] Figure 3 caption: the precise values of the UMAP hyperparameters (n_neighbors, min_dist, etc.) used for each panel should be stated explicitly.
[Section 2.2] Section 2.2: the notation for the fuzzy simplicial set membership strengths could be introduced with a short reminder of the exponential kernel definition to aid readers unfamiliar with the topological construction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, assessment of significance, and recommendation to accept the manuscript.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The UMAP construction begins from an explicit Riemannian manifold approximation via local k-NN distance estimates converted to fuzzy simplicial sets (Section 2), followed by a cross-entropy minimization objective in the target embedding space (Section 3). These steps are derived from algebraic topology and geometry without reducing to fitted parameters renamed as predictions or to self-citations that carry the central claim. Empirical comparisons in Section 4 are presented as validation rather than as the source of the algorithm itself. No load-bearing step equates the output to the input by construction, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; full details of any parameters or assumptions not available.

axioms (1)

domain assumption A theoretical framework based in Riemannian geometry and algebraic topology can be used to construct a practical dimension reduction algorithm.
Directly stated in the abstract as the construction basis for UMAP.

pith-pipeline@v0.9.0 · 5380 in / 1188 out tokens · 88976 ms · 2026-05-10T17:06:36.445113+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

UMAP uses local manifold approximations and patches together their local fuzzy simplicial set representations to construct a topological representation of the high dimensional data... optimize the layout... to minimize the cross-entropy
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We seek to address the issue of uniform data distributions on manifolds through a combination of Riemannian geometry and the work of David Spivak in category theoretic approaches to geometric realization of fuzzy simplicial sets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GPT-Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment
cs.CV 2026-04 unverdicted novelty 8.0

The first public dataset of 10,217 GPT-Image-2 generated images sourced from Twitter in the week after release, with CLIP taxonomy, OCR, face detection, clustering analyses, and a finding that C2PA provenance data is ...
On the continuum limit of t-SNE for data visualization
stat.ML 2026-04 unverdicted novelty 8.0

t-SNE converges in the large-data limit to a non-convex variational energy with attraction and repulsion terms that admits a unique smooth minimizer but infinitely many discontinuous ones in one dimension.
Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation
cs.CV 2026-04 unverdicted novelty 8.0

Adversarial smuggling attacks encode harmful content into human-readable visuals that evade MLLM detection, achieving over 90% attack success rates on models like GPT-5 and Qwen3-VL via the new SmuggleBench benchmark.
Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks
cs.CR 2026-01 unverdicted novelty 8.0

FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Discovering Language Model Behaviors with Model-Written Evaluations
cs.CL 2022-12 unverdicted novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
#PraCegoVer: A Large Dataset for Image Captioning in Portuguese
cs.CV 2021-03 unverdicted novelty 8.0

The paper introduces #PraCegoVer, the first large-scale image captioning dataset in Portuguese sourced from Instagram posts with single user-generated captions per image.
Contrastive self-supervised convolutional autoencoder for core-collapse supernova gravitational-wave detection
gr-qc 2026-05 unverdicted novelty 7.0

A contrastive self-supervised convolutional autoencoder detects core-collapse supernova gravitational waves with performance comparable to supervised CNNs, better generalization to unseen waveforms, and ~120 kpc sensi...
MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space
cs.CV 2026-05 unverdicted novelty 7.0

MAPS provides 2618 validated 3D meshes and a controllable rendering pipeline to attribute vision model recognition failures to specific scene parameters, finding camera distance and elevation as the dominant failure f...
Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches
cs.CL 2026-05 unverdicted novelty 7.0

A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political co...
Continual Learning of Domain-Invariant Representations
cs.LG 2026-05 unverdicted novelty 7.0

Introduces replay-based continual learning with sequential invariance alignment to learn domain-invariant representations, outperforming baselines on generalization to unseen domains across six datasets in vision, med...
Determining star formation histories and age-metallicity relations with convolutional neural networks
astro-ph.GA 2026-05 unverdicted novelty 7.0

A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spec...
PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users
cs.CL 2026-05 unverdicted novelty 7.0

Preference fine-tuning outperforms prompting for personalisation but amplifies sycophancy and relationship-seeking, while simulated users recover aggregate rankings yet show far lower self-consistency and different to...
Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Spectral Gradient Surgery disentangles class-discriminative and domain-specific signals in distribution-matching distilled datasets by analyzing gradient agreement in the spectral domain, yielding better out-of-distri...
scShapeBench: Discovering geometry from high dimensional scRNAseq data
cs.LG 2026-05 unverdicted novelty 7.0

scShapeBench supplies synthetic and real annotated single-cell datasets across four shape categories, with scReebTower outperforming PAGA and Mapper on topology-aware metrics.
Much of Geospatial Web Search Is Beyond Traditional GIS
cs.IR 2026-05 unverdicted novelty 7.0

Analysis of 1.01 million unfiltered Bing queries identifies 18% as geospatial, dominated by transactional categories like costs (15.3%) that exceed traditional GIS scope.
Quantifying the Reconstructability of Astrophysical Methods with Large Language Models and Information Theory: A Case Study in Spectral Reconstruction
astro-ph.IM 2026-05 unverdicted novelty 7.0

LLMs prompted with increasing levels of text on TNO spectral reconstruction from photometry reveal an entropy floor where implementation variance persists, showing text alone cannot capture all tacit expert knowledge ...
An Experimental Method to Study Opinion Diffusion in Human-AI Hybrid Societies
cs.SI 2026-05 unverdicted novelty 7.0

Hybrid human-AI networks in 5x5 grids reached lower final polarization than human-only networks after eight rounds of opinion revision on polarizing topics.
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
cs.CV 2026-05 unverdicted novelty 7.0

A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts
cs.CV 2026-05 unverdicted novelty 7.0

eX2L improves robustness to distribution shifts by penalizing similarity between Grad-CAM maps of a label classifier and a confounder classifier, reaching new SOTA average and worst-group accuracy on the Spawrious benchmark.
Knowing when to trust machine-learned interatomic potentials
cs.LG 2026-05 unverdicted novelty 7.0

PROBE recasts MLIP uncertainty quantification as selective classification by training a compact discriminative classifier on frozen per-atom backbone embeddings, yielding a reliability probability that tracks actual e...
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
cs.CV 2026-04 unverdicted novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support
cs.CL 2026-04 unverdicted novelty 7.0

A cross-cultural survey finds LLM emotional support adoption ranges from 20% to 59% by country, with positive perceptions strongest among higher-SES, religious, married adults aged 25-44 and in English-speaking nations.
From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support
cs.CL 2026-04 unverdicted novelty 7.0

Cross-cultural survey of 4,641 participants shows LLM emotional support adoption varies widely by country and demographics, with socioeconomic status as strongest predictor of trust and use, and English-speaking natio...
GPT-Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment
cs.CV 2026-04 accept novelty 7.0

The first public dataset of 10,217 GPT-image-2 AI-generated images from Twitter, with CLIP taxonomy, OCR, face detection, and clustering analyses, plus the finding that C2PA credentials are stripped by the platform.
The Platform Is Mostly Not a Platform: Token Economies and Agent Discourse on Moltbook
cs.CY 2026-04 unverdicted novelty 7.0

Moltbook operates as two largely separate layers: a dominant transactional token economy using protocols like MBC-20 and a thinner discursive conversation layer with only 3.6% agent overlap.
Participatory provenance as representational auditing for AI-mediated public consultation
cs.AI 2026-04 unverdicted novelty 7.0

Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment
cs.CV 2026-04 unverdicted novelty 7.0

RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.
Neighbor Embedding for High-Dimensional Sparse Poisson Data
stat.ML 2026-04 unverdicted novelty 7.0

p-SNE embeds sparse Poisson count data into low dimensions by using KL divergence between Poisson distributions to measure pairwise dissimilarity and Hellinger distance to optimize the layout.
Physics-informed, Generative Adversarial Design of Funicular Shells
cs.CE 2026-04 unverdicted novelty 7.0

A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
cs.CL 2026-04 unverdicted novelty 7.0

MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
cs.CL 2026-04 unverdicted novelty 7.0

Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
L-fuzzy simplicial homology
math.AT 2026-04 unverdicted novelty 7.0

L-fuzzy simplicial homology generalizes simplicial homology to L-fuzzy subcomplexes by assigning values from a completely distributive lattice L to simplices and deriving associated homology modules.
Emotion Concepts and their Function in a Large Language Model
cs.AI 2026-04 unverdicted novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Dynamic Context Evolution for Scalable Synthetic Data Generation
cs.CL 2026-04 conditional novelty 7.0

Dynamic Context Evolution prevents cross-batch mode collapse in LLMs by combining model self-assessment for idea filtering, embedding-based deduplication, and evolving prompts, yielding zero collapse and consistently ...
Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification
cs.CV 2026-04 unverdicted novelty 7.0

A new diagnostic framework using inpainted context ratios and laterality checks on a Pantanal jaguar benchmark reveals whether re-ID models depend on coat patterns or spurious background evidence.
Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning
cs.CR 2026-03 unverdicted novelty 7.0

SABLE shows that semantics-aware natural triggers enable effective backdoor attacks in federated learning against multiple aggregation rules while preserving benign accuracy.
A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing Data
q-bio.GN 2026-03 unverdicted novelty 7.0

A large benchmark finds traditional imputation methods for scRNA-seq data generally outperform deep learning ones, but numerical recovery does not reliably improve biological downstream analyses and no method wins acr...
Large Language Lovers: Lived Experiences of Negotiating Agency and Platform Control in AI Companionship
cs.HC 2026-01 accept novelty 7.0

Users form AI companion relationships by negotiating perceived companion agency against platform constraints and use steering tactics like custom instructions or platform switching to cope with model updates that disr...
D-MODD: A Diffusion Model of Opinion Dynamics Derived from Online Data
physics.soc-ph 2026-01 unverdicted novelty 7.0

D-MODD is a data-derived Langevin stochastic differential equation whose transition kernel reproduces the one-step opinion change probabilities observed in social media data on a polarized climate topic.
Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives
cs.CL 2025-12 conditional novelty 7.0

Gemini Pro LLMs outperformed mental health professionals overall (65.48% vs 43.57%) on BPD and NPD diagnosis from personal stories but severely underdiagnosed NPD (F1 6.7 vs 50.0) due to reluctance toward the term narcissism.
Language-Conditioned Safe Trajectory Generation for Spacecraft Rendezvous
cs.RO 2025-12 unverdicted novelty 7.0

SAGES translates natural-language commands into constraint-respecting spacecraft trajectories, achieving over 90% semantic-behavioral consistency in proximity operations and robotic tests.
VIDEOP2R: Video Understanding from Perception to Reasoning
cs.CV 2025-11 conditional novelty 7.0

VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
Scaling Vision Transformers for Functional MRI with Flat Maps
cs.CV 2025-10 conditional novelty 7.0

CortexMAE adapts Vision Transformers to fMRI via cortical flat maps, shows power-law scaling on 2.1K hours of data, and outperforms priors on cognitive state decoding while failing to beat a simple functional connecti...
Evalet: Evaluating Large Language Models through Functional Fragmentation
cs.HC 2025-09 conditional novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
cs.AI 2024-06 conditional novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Geodesic Learning via Unsupervised Decision Forests
stat.ML 2019-07 unverdicted novelty 7.0

URerF uses unsupervised decision forests on sparse linear feature combinations to estimate geodesic distances robustly under high-dimensional noise, outperforming Isomap, UMAP, and FLANN on simulated and connectome data.
When One Point Is Not Enough: Addressing Ambiguous Instances in Dimensionality Reduction by Splitting
cs.LG 2026-05 unverdicted novelty 6.0

A graph-based technique splits ambiguous instances into multiple points in DR projections to reduce partial neighborhood embedding and reveal hidden memberships.
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
cs.CV 2026-05 unverdicted novelty 6.0

AnyMo uses physics-grounded IMU simulation over dense body placements, graph encoder pre-training, and LLM alignment to enable setup-agnostic motion modeling, reporting gains on zero-shot HAR, retrieval, and captionin...
The General Theory of Localization Methods
cs.LG 2026-05 unverdicted novelty 6.0

The localization method unifies kernel methods, local learning algorithms, MeanShift, Hopfield networks, and Transformers through local models, localization tricks, and hierarchical extensions.
Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites
cs.CV 2026-05 unverdicted novelty 6.0

p-ResNet-50 adds a prototype layer with anchor- and medoid-based regularizations to ResNet-50, achieving ROC-AUC 0.994 and accuracy 0.957 on ~12k XCT patches while supplying case-based explanations aligned to expert c...
Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South
cs.CY 2026-05 unverdicted novelty 6.0

A participatory red-teaming project in the Global South created the PLACES dataset of 26k T2I failure examples that reveal unique cultural and linguistic harms missed by existing safety frameworks.
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
cs.CL 2026-05 conditional novelty 6.0

A benchmark of commercial ASR on four code-switching language pairs finds ElevenLabs Scribe v2 best at 13.2% WER overall, with BERTScore recommended over WER for Arabic and Persian due to transliteration issues.
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
cs.CL 2026-05 conditional novelty 6.0

New benchmark evaluates commercial ASR on four code-switching language pairs with public dataset, finding ElevenLabs Scribe v2 lowest WER at 13.2% overall and highest BERTScore at 0.936.
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
cs.CV 2026-05 unverdicted novelty 6.0

SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47....
Topo-GS: Continuous Volumetric Embedding of High-Dimensional Data via Topological Gaussian Splatting
cs.GR 2026-05 unverdicted novelty 6.0

Topo-GS repurposes 3D Gaussian Splatting with local geometric constraints and topology-aware losses to produce continuous volumetric embeddings of high-dimensional data.
DiLA: Disentangled Latent Action World Models
cs.CV 2026-05 unverdicted novelty 6.0

DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.
Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection
cs.CV 2026-05 unverdicted novelty 6.0

Contrastive-SDXL augments daytime images into realistic night-time versions using SDXL-Turbo with LoRA and multi-level DINOv2 contrastive losses, yielding 6-7% lower miss rate on pedestrian detection versus daytime-on...

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 200 Pith papers · 1 internal anchor

[1]

Pen-based recognition of handwrit- ten digits data set

E Alpaydin and Fevzi Alimoglu. Pen-based recognition of handwrit- ten digits data set. university of california, irvine. Machine Learning Repository. Irvine: University of California , 4(2), 1998

work page 1998
[2]

Bloodspot: a database of healthy and malignant haematopoiesis updated with pu- ri/f_ied and single cell mrna sequencing pro/f_iles.Nucleic Acids Research, 2018

Frederik Otzen Bagger, Savvas Kinalis, and Nicolas Rapin. Bloodspot: a database of healthy and malignant haematopoiesis updated with pu- ri/f_ied and single cell mrna sequencing pro/f_iles.Nucleic Acids Research, 2018

work page 2018
[3]

Fuzzy set theory and topos theory

Michael Barr. Fuzzy set theory and topos theory. Canad. Math. Bull , 29(4):501–508, 1986

work page 1986
[4]

Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell

Etienne Becht, Charles-Antoine Dutertre, Immanuel W.H. Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Evaluation of umap as an alternative to t-sne for single-cell data. bioRxiv, 2018

work page 2018
[5]

Dimensionality reduction for visualizing single-cell data using umap

Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38, 2019

work page 2019
[6]

Laplacian eigenmaps and spec- tral techniques for embedding and clustering

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spec- tral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002

work page 2002
[7]

Laplacian eigenmaps for dimen- sionality reduction and data representation

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimen- sionality reduction and data representation. Neural computation , 15(6):1373–1396, 2003

work page 2003
[8]

A Survey on Metric Learning for Feature Vectors and Structured Data

Aur ´elien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data.arXiv preprint arXiv:1306.6709, 2013

work page Pith review arXiv 2013
[9]

Omip-018: Chemokine receptor expression on human t helper cells

Tess Brodie, Elena Brenna, and Federica Sallusto. Omip-018: Chemokine receptor expression on human t helper cells. Cytometry Part A, 83(6):530–532, 2013

work page 2013
[10]

API design for machine learning so/f_tware: experiences from the scikit-learn project

Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Pre/t_tenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨el Varoquaux. API design for machine learning so/f_tware: experiences from the scikit-learn project. InECML PKDD Workshop: ...

work page 2013
[11]

A molecular census of arcuate hypothalamus and median eminence cell types

John N Campbell, Evan Z Macosko, Henning Fenselau, Tune H Pers, Anna Lyubetskaya, Danielle Tenen, Melissa Goldman, Anne MJ Ver- stegen, Jon M Resch, Steven A McCarroll, et al. A molecular census of arcuate hypothalamus and median eminence cell types. Nature neu- roscience, 20(3):484, 2017

work page 2017
[12]

/T_he single-cell transcriptional land- scape of mammalian organogenesis

Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M Ibrahim, Andrew J Hill, Fan Zhang, Stefan Mundlos, Lena Chris- tiansen, Frank J Steemers, et al. /T_he single-cell transcriptional land- scape of mammalian organogenesis. Nature, page 1, 2019

work page 2019
[13]

Classifying clustering schemes

Gunnar Carlsson and Facundo M ´emoli. Classifying clustering schemes. Foundations of Computational Mathematics , 13(2):221–252, 2013

work page 2013
[14]

Activation atlas

Shan Carter, Zan Armstrong, Ludwig Schubert, Ian John- son, and Chris Olah. Activation atlas. Distill, 2019. h/t_tps://distill.pub/2019/activation-atlas

work page 2019
[15]

Comprehensive analysis of retinal development at single cell resolution identi/f_ies n/f_i factors as essential for mitotic exit and speci/f_ication of late-born cells

Brian Clark, Genevieve Stein-O’Brien, Fion Shiau, Gabrielle Can- non, Emily Davis, /T_homas Sherman, Fatemeh Rajaii, Rebecca James- Esposito, Richard Gronostajski, Elana Fertig, et al. Comprehensive analysis of retinal development at single cell resolution identi/f_ies n/f_i factors as essential for mitotic exit and speci/f_ication of late-born cells. bio...

work page 2018
[16]

Diﬀusion maps

Ronald R Coifman and St ´ephane Lafon. Diﬀusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006

work page 2006
[17]

Re- vealing multi-scale population structure in large cohorts

Alex Diaz-Papkovich, Luke Anderson-Trocme, and Simon Gravel. Re- vealing multi-scale population structure in large cohorts. bioRxiv, page 423632, 2018

work page 2018
[18]

Eﬃcient k-nearest neighbor graph construction for generic similarity measures

Wei Dong, Charikar Moses, and Kai Li. Eﬃcient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web , WWW ’11, pages 577–586, New York, NY, USA, 2011. ACM

work page 2011
[19]

(self- a/t_tentive) autoencoder-based universal language representation for machine translation

Carlos Escolano, Marta R Costa-juss `a, and Jos ´e AR Fonollosa. (self- a/t_tentive) autoencoder-based universal language representation for machine translation. arXiv preprint arXiv:1810.06351, 2018

work page arXiv 2018
[20]

Deep learn- ing multidimensional projections

Mateus Espadoto, Nina ST Hirata, and Alexandru C Telea. Deep learn- ing multidimensional projections. arXiv preprint arXiv:1902.07958 , 2019. 59

work page arXiv 1902
[21]

Visual analytics of multidimensional projections for construct- ing classi/f_ier decision boundary maps

Mateus Espadoto, Francisco Caio M Rodrigues, and Alexandru C Telea. Visual analytics of multidimensional projections for construct- ing classi/f_ier decision boundary maps

work page
[22]

Survey article: an elementary illustrated intro- duction to simplicial sets

Greg Friedman et al. Survey article: an elementary illustrated intro- duction to simplicial sets. Rocky Mountain Journal of Mathematics , 42(2):353–423, 2012

work page 2012
[23]

Data-driven design: Exploring new structural forms using machine learning and graphic statics

Lukas Fuhrimann, Vahid Moosavi, Patrick Ole Ohlbrock, and Pierluigi Dacunto. Data-driven design: Exploring new structural forms using machine learning and graphic statics. arXiv preprint arXiv:1809.08660, 2018

work page arXiv 2018
[24]

Gaussian mixture models with wasserstein distance

Benoit Gaujac, Ilya Feige, and David Barber. Gaussian mixture models with wasserstein distance. arXiv preprint arXiv:1806.04465, 2018

work page arXiv 2018
[25]

Simplicial homotopy theory

Paul G Goerss and John F Jardine. Simplicial homotopy theory . Springer Science & Business Media, 2009

work page 2009
[26]

Graph laplacians and their convergence on random neighborhood graphs

Ma/t_thias Hein, Jean-Yves Audibert, and Ulrike von Luxburg. Graph laplacians and their convergence on random neighborhood graphs. Journal of Machine Learning Research , 8(Jun):1325–1368, 2007

work page 2007
[27]

Analysis of a complex of statistical variables into principal components

Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology , 24(6):417, 1933

work page 1933
[28]

/T_he art of using t-sne for single-cell transcriptomics

Dmitry Kobak and Philipp Berens. /T_he art of using t-sne for single-cell transcriptomics. Nature communications, 10(1):1–14, 2019

work page 2019
[29]

Umap does not preserve global structure any be/t_ter than t-sne when using the same initializa- tion

Dmitry Kobak and George C Linderman. Umap does not preserve global structure any be/t_ter than t-sne when using the same initializa- tion. bioRxiv, 2019

work page 2019
[30]

J. B. Kruskal. Multidimensional scaling by optimizing goodness of /f_it to a nonmetric hypothesis. Psychometrika, 29(1):1–27, Mar 1964

work page 1964
[31]

Numba: A llvm- based python jit compiler

Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm- based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC , LLVM ’15, pages 7:1–7:6, New York, NY, USA, 2015. ACM

work page 2015
[32]

/T_he MNIST database of handwri/t_ten digits

Yann Lecun and Corinna Cortes. /T_he MNIST database of handwri/t_ten digits

work page
[33]

Shi/f_t-invariant similarities circum- vent distance concentration in stochastic neighbor embedding and variants

John A Lee and Michel Verleysen. Shi/f_t-invariant similarities circum- vent distance concentration in stochastic neighbor embedding and variants. Procedia Computer Science, 4:538–547, 2011. 60

work page 2011
[34]

Mani- fold learning of four-dimensional scanning transmission electron mi- croscopy

Xin Li, Ondrej E Dyck, Mark P Oxley, Andrew R Lupini, Leland McInnes, John Healy, Stephen Jesse, and Sergei V Kalinin. Mani- fold learning of four-dimensional scanning transmission electron mi- croscopy. npj Computational Materials, 5(1):5, 2019

work page 2019
[35]

M. Lichman. UCI machine learning repository, 2013

work page 2013
[36]

George Linderman. Fit-sne. https://github.com/KlugerLab/ FIt-SNE, 2018

work page 2018
[37]

Eﬃcient algorithms for t-distributed stochastic neighborhood embedding

George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval Kluger. Eﬃcient algorithms for t-distributed stochastic neighborhood embedding. arXiv preprint arXiv:1712.09005, 2017

work page arXiv 2017
[38]

Clustering with t-sne, provably

George C Linderman and Stefan Steinerberger. Clustering with t-sne, provably. SIAM Journal on Mathematics of Data Science , 1(2):313–332, 2019

work page 2019
[39]

Categories for the working mathematician , vol- ume 5

Saunders Mac Lane. Categories for the working mathematician , vol- ume 5. Springer Science & Business Media, 2013

work page 2013
[40]

Simplicial objects in algebraic topology , volume 11

J Peter May. Simplicial objects in algebraic topology , volume 11. Uni- versity of Chicago Press, 1992

work page 1992
[41]

Distributed representations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeﬀ Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing sys- tems, pages 3111–3119, 2013

work page 2013
[42]

Visualizing structure and transitions in high-dimensional biological data

Kevin R Moon, David van Dijk, Zheng Wang, Sco/t_t Gigante, Daniel B Burkhardt, William S Chen, Kristina Yim, Antonia van den Elzen, Ma/t_thew J Hirn, Ronald R Coifman, et al. Visualizing structure and transitions in high-dimensional biological data. Nature biotechnology, 37(12):1482–1492, 2019

work page 2019
[43]

Nene, Shree K

Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (coil-20. Technical report, 1996

work page 1996
[44]

Nene, Shree K

Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. object image library (coil-100. Technical report, 1996

work page 1996
[45]

Human bone marrow assessment by single cell rna sequencing, mass cytometry and /f_low cytometry

Karolyn A Oetjen, Katherine E Lindblad, Meghali Goswami, Gege Gui, Pradeep K Dagur, Catherine Lai, Laura W Dillon, J Philip McCoy, and Christopher S Hourigan. Human bone marrow assessment by single cell rna sequencing, mass cytometry and /f_low cytometry. bioRxiv, 2018. 61

work page 2018
[46]

Fast batch alignment of single cell transcriptomes uni/f_ies multiple mouse cell atlases into an integrated landscape.bioRxiv, page 397042, 2018

Jong-Eun Park, Krzysztof Polanski, Kerstin Meyer, and Sarah A Te- ichmann. Fast batch alignment of single cell transcriptomes uni/f_ies multiple mouse cell atlases into an integrated landscape.bioRxiv, page 397042, 2018

work page 2018
[47]

Simplicial autoencoders

Jose Daniel Gallego Posada. Simplicial autoencoders. 2018

work page 2018
[48]

A leisurely introduction to simplicial sets

Emily Riehl. A leisurely introduction to simplicial sets. Unpublished expository article available online at h/t_tp://www. math. harvard. edu/˜ eriehl, 2011

work page 2011
[49]

Category theory in context

Emily Riehl. Category theory in context . Courier Dover Publications, 2017

work page 2017
[50]

A nonlinear mapping for data structure analysis

John W Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on computers , 100(5):401–409, 1969

work page 1969
[51]

Flowrepository: A resource of annotated /f_low cy- tometry datasets associated with peer-reviewed publications

Josef Spidlen, Karin Breuer, Chad Rosenberg, Nikesh Kotecha, and Ryan R Brinkman. Flowrepository: A resource of annotated /f_low cy- tometry datasets associated with peer-reviewed publications. Cytom- etry Part A, 81(9):727–731, 2012

work page 2012
[52]

Metric realization of fuzzy simplicial sets

David I Spivak. Metric realization of fuzzy simplicial sets. Self pub- lished notes, 2012

work page 2012
[53]

Largevis

Jian Tang. Largevis. https://github.com/lferry007/LargeVis, 2016

work page 2016
[54]

Visualizing large-scale and high-dimensional data

Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and high-dimensional data. InProceedings of the 25th Inter- national Conference on World Wide Web, pages 287–297. International World Wide Web Conferences Steering Commi/t_tee, 2016

work page 2016
[55]

Tenenbaum

Joshua B. Tenenbaum. Mapping a manifold of perceptual observa- tions. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors,Advances in Neural Information Processing Systems 10 , pages 682–688. MIT Press, 1998

work page 1998
[56]

A global geometric framework for nonlinear dimensionality reduction.science, 290(5500):2319–2323, 2000

Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction.science, 290(5500):2319–2323, 2000

work page 2000
[57]

Multicore-tsne

Dmitry Ulyanov. Multicore-tsne. https://github.com/ DmitryUlyanov/Multicore-TSNE, 2016

work page 2016
[58]

Accelerating t-sne using tree-based algo- rithms

Laurens van der Maaten. Accelerating t-sne using tree-based algo- rithms. Journal of machine learning research , 15(1):3221–3245, 2014. 62

work page 2014
[59]

Visualizing data using t-sne

Laurens van der Maaten and Geoﬀrey Hinton. Visualizing data using t-sne. Journal of machine learning research , 9(Nov):2579–2605, 2008

work page 2008
[60]

Visualizing data using t-SNE

Laurens van der Maaten and Geoﬀrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research , 9:2579–2605, 2008

work page 2008
[61]

What do numbers look like? https://johnhw

John Williamson. What do numbers look like? https://johnhw. github.io/umap_primes/index.md.html, 2018

work page 2018
[62]

Comparison between umap and t-sne for multiplex-immuno/f_luorescence derived single-cell data from tissue sections

Duoduo Wu, Joe Yeong, Grace Tan, Marion Chevrier, Josh Loh, Tony Lim, and Jinmiao Chen. Comparison between umap and t-sne for multiplex-immuno/f_luorescence derived single-cell data from tissue sections. bioRxiv, page 549659, 2019

work page 2019
[63]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017

work page internal anchor Pith review arXiv 2017
[64]

Distance metric learning: A comprehensive survey

Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2(2):4, 2006

work page 2006
[65]

Information and control.Fuzzy sets, 8(3):338–353, 1965

Lo/f_ti A Zadeh. Information and control.Fuzzy sets, 8(3):338–353, 1965. 63

work page 1965

[1] [1]

Pen-based recognition of handwrit- ten digits data set

E Alpaydin and Fevzi Alimoglu. Pen-based recognition of handwrit- ten digits data set. university of california, irvine. Machine Learning Repository. Irvine: University of California , 4(2), 1998

work page 1998

[2] [2]

Bloodspot: a database of healthy and malignant haematopoiesis updated with pu- ri/f_ied and single cell mrna sequencing pro/f_iles.Nucleic Acids Research, 2018

Frederik Otzen Bagger, Savvas Kinalis, and Nicolas Rapin. Bloodspot: a database of healthy and malignant haematopoiesis updated with pu- ri/f_ied and single cell mrna sequencing pro/f_iles.Nucleic Acids Research, 2018

work page 2018

[3] [3]

Fuzzy set theory and topos theory

Michael Barr. Fuzzy set theory and topos theory. Canad. Math. Bull , 29(4):501–508, 1986

work page 1986

[4] [4]

Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell

Etienne Becht, Charles-Antoine Dutertre, Immanuel W.H. Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Evaluation of umap as an alternative to t-sne for single-cell data. bioRxiv, 2018

work page 2018

[5] [5]

Dimensionality reduction for visualizing single-cell data using umap

Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38, 2019

work page 2019

[6] [6]

Laplacian eigenmaps and spec- tral techniques for embedding and clustering

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spec- tral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002

work page 2002

[7] [7]

Laplacian eigenmaps for dimen- sionality reduction and data representation

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimen- sionality reduction and data representation. Neural computation , 15(6):1373–1396, 2003

work page 2003

[8] [8]

A Survey on Metric Learning for Feature Vectors and Structured Data

Aur ´elien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data.arXiv preprint arXiv:1306.6709, 2013

work page Pith review arXiv 2013

[9] [9]

Omip-018: Chemokine receptor expression on human t helper cells

Tess Brodie, Elena Brenna, and Federica Sallusto. Omip-018: Chemokine receptor expression on human t helper cells. Cytometry Part A, 83(6):530–532, 2013

work page 2013

[10] [10]

API design for machine learning so/f_tware: experiences from the scikit-learn project

Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Pre/t_tenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨el Varoquaux. API design for machine learning so/f_tware: experiences from the scikit-learn project. InECML PKDD Workshop: ...

work page 2013

[11] [11]

A molecular census of arcuate hypothalamus and median eminence cell types

John N Campbell, Evan Z Macosko, Henning Fenselau, Tune H Pers, Anna Lyubetskaya, Danielle Tenen, Melissa Goldman, Anne MJ Ver- stegen, Jon M Resch, Steven A McCarroll, et al. A molecular census of arcuate hypothalamus and median eminence cell types. Nature neu- roscience, 20(3):484, 2017

work page 2017

[12] [12]

/T_he single-cell transcriptional land- scape of mammalian organogenesis

Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M Ibrahim, Andrew J Hill, Fan Zhang, Stefan Mundlos, Lena Chris- tiansen, Frank J Steemers, et al. /T_he single-cell transcriptional land- scape of mammalian organogenesis. Nature, page 1, 2019

work page 2019

[13] [13]

Classifying clustering schemes

Gunnar Carlsson and Facundo M ´emoli. Classifying clustering schemes. Foundations of Computational Mathematics , 13(2):221–252, 2013

work page 2013

[14] [14]

Activation atlas

Shan Carter, Zan Armstrong, Ludwig Schubert, Ian John- son, and Chris Olah. Activation atlas. Distill, 2019. h/t_tps://distill.pub/2019/activation-atlas

work page 2019

[15] [15]

Comprehensive analysis of retinal development at single cell resolution identi/f_ies n/f_i factors as essential for mitotic exit and speci/f_ication of late-born cells

Brian Clark, Genevieve Stein-O’Brien, Fion Shiau, Gabrielle Can- non, Emily Davis, /T_homas Sherman, Fatemeh Rajaii, Rebecca James- Esposito, Richard Gronostajski, Elana Fertig, et al. Comprehensive analysis of retinal development at single cell resolution identi/f_ies n/f_i factors as essential for mitotic exit and speci/f_ication of late-born cells. bio...

work page 2018

[16] [16]

Diﬀusion maps

Ronald R Coifman and St ´ephane Lafon. Diﬀusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006

work page 2006

[17] [17]

Re- vealing multi-scale population structure in large cohorts

Alex Diaz-Papkovich, Luke Anderson-Trocme, and Simon Gravel. Re- vealing multi-scale population structure in large cohorts. bioRxiv, page 423632, 2018

work page 2018

[18] [18]

Eﬃcient k-nearest neighbor graph construction for generic similarity measures

Wei Dong, Charikar Moses, and Kai Li. Eﬃcient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web , WWW ’11, pages 577–586, New York, NY, USA, 2011. ACM

work page 2011

[19] [19]

(self- a/t_tentive) autoencoder-based universal language representation for machine translation

Carlos Escolano, Marta R Costa-juss `a, and Jos ´e AR Fonollosa. (self- a/t_tentive) autoencoder-based universal language representation for machine translation. arXiv preprint arXiv:1810.06351, 2018

work page arXiv 2018

[20] [20]

Deep learn- ing multidimensional projections

Mateus Espadoto, Nina ST Hirata, and Alexandru C Telea. Deep learn- ing multidimensional projections. arXiv preprint arXiv:1902.07958 , 2019. 59

work page arXiv 1902

[21] [21]

Visual analytics of multidimensional projections for construct- ing classi/f_ier decision boundary maps

Mateus Espadoto, Francisco Caio M Rodrigues, and Alexandru C Telea. Visual analytics of multidimensional projections for construct- ing classi/f_ier decision boundary maps

work page

[22] [22]

Survey article: an elementary illustrated intro- duction to simplicial sets

Greg Friedman et al. Survey article: an elementary illustrated intro- duction to simplicial sets. Rocky Mountain Journal of Mathematics , 42(2):353–423, 2012

work page 2012

[23] [23]

Data-driven design: Exploring new structural forms using machine learning and graphic statics

Lukas Fuhrimann, Vahid Moosavi, Patrick Ole Ohlbrock, and Pierluigi Dacunto. Data-driven design: Exploring new structural forms using machine learning and graphic statics. arXiv preprint arXiv:1809.08660, 2018

work page arXiv 2018

[24] [24]

Gaussian mixture models with wasserstein distance

Benoit Gaujac, Ilya Feige, and David Barber. Gaussian mixture models with wasserstein distance. arXiv preprint arXiv:1806.04465, 2018

work page arXiv 2018

[25] [25]

Simplicial homotopy theory

Paul G Goerss and John F Jardine. Simplicial homotopy theory . Springer Science & Business Media, 2009

work page 2009

[26] [26]

Graph laplacians and their convergence on random neighborhood graphs

Ma/t_thias Hein, Jean-Yves Audibert, and Ulrike von Luxburg. Graph laplacians and their convergence on random neighborhood graphs. Journal of Machine Learning Research , 8(Jun):1325–1368, 2007

work page 2007

[27] [27]

Analysis of a complex of statistical variables into principal components

Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology , 24(6):417, 1933

work page 1933

[28] [28]

/T_he art of using t-sne for single-cell transcriptomics

Dmitry Kobak and Philipp Berens. /T_he art of using t-sne for single-cell transcriptomics. Nature communications, 10(1):1–14, 2019

work page 2019

[29] [29]

Umap does not preserve global structure any be/t_ter than t-sne when using the same initializa- tion

Dmitry Kobak and George C Linderman. Umap does not preserve global structure any be/t_ter than t-sne when using the same initializa- tion. bioRxiv, 2019

work page 2019

[30] [30]

J. B. Kruskal. Multidimensional scaling by optimizing goodness of /f_it to a nonmetric hypothesis. Psychometrika, 29(1):1–27, Mar 1964

work page 1964

[31] [31]

Numba: A llvm- based python jit compiler

Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm- based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC , LLVM ’15, pages 7:1–7:6, New York, NY, USA, 2015. ACM

work page 2015

[32] [32]

/T_he MNIST database of handwri/t_ten digits

Yann Lecun and Corinna Cortes. /T_he MNIST database of handwri/t_ten digits

work page

[33] [33]

Shi/f_t-invariant similarities circum- vent distance concentration in stochastic neighbor embedding and variants

John A Lee and Michel Verleysen. Shi/f_t-invariant similarities circum- vent distance concentration in stochastic neighbor embedding and variants. Procedia Computer Science, 4:538–547, 2011. 60

work page 2011

[34] [34]

Mani- fold learning of four-dimensional scanning transmission electron mi- croscopy

Xin Li, Ondrej E Dyck, Mark P Oxley, Andrew R Lupini, Leland McInnes, John Healy, Stephen Jesse, and Sergei V Kalinin. Mani- fold learning of four-dimensional scanning transmission electron mi- croscopy. npj Computational Materials, 5(1):5, 2019

work page 2019

[35] [35]

M. Lichman. UCI machine learning repository, 2013

work page 2013

[36] [36]

George Linderman. Fit-sne. https://github.com/KlugerLab/ FIt-SNE, 2018

work page 2018

[37] [37]

Eﬃcient algorithms for t-distributed stochastic neighborhood embedding

George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval Kluger. Eﬃcient algorithms for t-distributed stochastic neighborhood embedding. arXiv preprint arXiv:1712.09005, 2017

work page arXiv 2017

[38] [38]

Clustering with t-sne, provably

George C Linderman and Stefan Steinerberger. Clustering with t-sne, provably. SIAM Journal on Mathematics of Data Science , 1(2):313–332, 2019

work page 2019

[39] [39]

Categories for the working mathematician , vol- ume 5

Saunders Mac Lane. Categories for the working mathematician , vol- ume 5. Springer Science & Business Media, 2013

work page 2013

[40] [40]

Simplicial objects in algebraic topology , volume 11

J Peter May. Simplicial objects in algebraic topology , volume 11. Uni- versity of Chicago Press, 1992

work page 1992

[41] [41]

Distributed representations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeﬀ Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing sys- tems, pages 3111–3119, 2013

work page 2013

[42] [42]

Visualizing structure and transitions in high-dimensional biological data

Kevin R Moon, David van Dijk, Zheng Wang, Sco/t_t Gigante, Daniel B Burkhardt, William S Chen, Kristina Yim, Antonia van den Elzen, Ma/t_thew J Hirn, Ronald R Coifman, et al. Visualizing structure and transitions in high-dimensional biological data. Nature biotechnology, 37(12):1482–1492, 2019

work page 2019

[43] [43]

Nene, Shree K

Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (coil-20. Technical report, 1996

work page 1996

[44] [44]

Nene, Shree K

Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. object image library (coil-100. Technical report, 1996

work page 1996

[45] [45]

Human bone marrow assessment by single cell rna sequencing, mass cytometry and /f_low cytometry

Karolyn A Oetjen, Katherine E Lindblad, Meghali Goswami, Gege Gui, Pradeep K Dagur, Catherine Lai, Laura W Dillon, J Philip McCoy, and Christopher S Hourigan. Human bone marrow assessment by single cell rna sequencing, mass cytometry and /f_low cytometry. bioRxiv, 2018. 61

work page 2018

[46] [46]

Fast batch alignment of single cell transcriptomes uni/f_ies multiple mouse cell atlases into an integrated landscape.bioRxiv, page 397042, 2018

Jong-Eun Park, Krzysztof Polanski, Kerstin Meyer, and Sarah A Te- ichmann. Fast batch alignment of single cell transcriptomes uni/f_ies multiple mouse cell atlases into an integrated landscape.bioRxiv, page 397042, 2018

work page 2018

[47] [47]

Simplicial autoencoders

Jose Daniel Gallego Posada. Simplicial autoencoders. 2018

work page 2018

[48] [48]

A leisurely introduction to simplicial sets

Emily Riehl. A leisurely introduction to simplicial sets. Unpublished expository article available online at h/t_tp://www. math. harvard. edu/˜ eriehl, 2011

work page 2011

[49] [49]

Category theory in context

Emily Riehl. Category theory in context . Courier Dover Publications, 2017

work page 2017

[50] [50]

A nonlinear mapping for data structure analysis

John W Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on computers , 100(5):401–409, 1969

work page 1969

[51] [51]

Flowrepository: A resource of annotated /f_low cy- tometry datasets associated with peer-reviewed publications

Josef Spidlen, Karin Breuer, Chad Rosenberg, Nikesh Kotecha, and Ryan R Brinkman. Flowrepository: A resource of annotated /f_low cy- tometry datasets associated with peer-reviewed publications. Cytom- etry Part A, 81(9):727–731, 2012

work page 2012

[52] [52]

Metric realization of fuzzy simplicial sets

David I Spivak. Metric realization of fuzzy simplicial sets. Self pub- lished notes, 2012

work page 2012

[53] [53]

Largevis

Jian Tang. Largevis. https://github.com/lferry007/LargeVis, 2016

work page 2016

[54] [54]

Visualizing large-scale and high-dimensional data

Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and high-dimensional data. InProceedings of the 25th Inter- national Conference on World Wide Web, pages 287–297. International World Wide Web Conferences Steering Commi/t_tee, 2016

work page 2016

[55] [55]

Tenenbaum

Joshua B. Tenenbaum. Mapping a manifold of perceptual observa- tions. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors,Advances in Neural Information Processing Systems 10 , pages 682–688. MIT Press, 1998

work page 1998

[56] [56]

A global geometric framework for nonlinear dimensionality reduction.science, 290(5500):2319–2323, 2000

Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction.science, 290(5500):2319–2323, 2000

work page 2000

[57] [57]

Multicore-tsne

Dmitry Ulyanov. Multicore-tsne. https://github.com/ DmitryUlyanov/Multicore-TSNE, 2016

work page 2016

[58] [58]

Accelerating t-sne using tree-based algo- rithms

Laurens van der Maaten. Accelerating t-sne using tree-based algo- rithms. Journal of machine learning research , 15(1):3221–3245, 2014. 62

work page 2014

[59] [59]

Visualizing data using t-sne

Laurens van der Maaten and Geoﬀrey Hinton. Visualizing data using t-sne. Journal of machine learning research , 9(Nov):2579–2605, 2008

work page 2008

[60] [60]

Visualizing data using t-SNE

Laurens van der Maaten and Geoﬀrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research , 9:2579–2605, 2008

work page 2008

[61] [61]

What do numbers look like? https://johnhw

John Williamson. What do numbers look like? https://johnhw. github.io/umap_primes/index.md.html, 2018

work page 2018

[62] [62]

Comparison between umap and t-sne for multiplex-immuno/f_luorescence derived single-cell data from tissue sections

Duoduo Wu, Joe Yeong, Grace Tan, Marion Chevrier, Josh Loh, Tony Lim, and Jinmiao Chen. Comparison between umap and t-sne for multiplex-immuno/f_luorescence derived single-cell data from tissue sections. bioRxiv, page 549659, 2019

work page 2019

[63] [63]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017

work page internal anchor Pith review arXiv 2017

[64] [64]

Distance metric learning: A comprehensive survey

Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2(2):4, 2006

work page 2006

[65] [65]

Information and control.Fuzzy sets, 8(3):338–353, 1965

Lo/f_ti A Zadeh. Information and control.Fuzzy sets, 8(3):338–353, 1965. 63

work page 1965