Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Xin Liao; Yiqun Zhang; Yiu-ming Cheung; Zihua Yang

arxiv: 2601.01162 · v3 · pith:Z4XHBC6Mnew · submitted 2026-01-03 · 💻 cs.LG · cs.AI· cs.CL

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Zihua Yang , Xin Liao , Yiqun Zhang , Yiu-ming Cheung This is my paper

Pith reviewed 2026-05-16 18:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords categorical clusteringsemantic embeddingslarge language modelsrepresentation learningsimilarity measuresunsupervised learningdata augmentation

0 comments

The pith

LLM-generated semantic descriptions of categorical values, when fused into embeddings, measurably improve clustering quality over standard methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Categorical data lack natural distances between values, so most clustering methods treat every pair of distinct values as equally dissimilar. This creates a semantic gap that hides real structure when data samples are few. ARISE lets an LLM write short descriptions of each attribute value and turns those texts into embeddings that are combined with the original categorical representation. The combined space is then used for clustering. On eight standard benchmark datasets the approach raises accuracy 19 to 27 percent above seven existing categorical clustering algorithms.

Core claim

ARISE constructs attention-weighted representations by first prompting a large language model to produce natural-language descriptions of every categorical attribute value, converting those descriptions into vector embeddings, and then linearly combining the embeddings with the original one-hot or frequency-based vectors. The resulting hybrid representation supplies external semantic context that the clustering algorithm can exploit, producing partitions that more closely match latent domain structure than partitions obtained from the unaugmented categorical metric space alone.

What carries the argument

ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), the mechanism that converts LLM text descriptions of attribute values into auxiliary embeddings and merges them with the original categorical vectors before clustering.

If this is right

Clustering quality rises on every tested benchmark when external semantic knowledge is added, even when the original data set is small.
The method works without changing the downstream clustering algorithm; only the input representation is altered.
Performance gains are largest on data sets where co-occurrence statistics alone are sparse or noisy.
The same LLM-augmented representation can be fed to any similarity-based categorical clustering routine.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to mixed numerical-categorical data if the numerical features are first discretized and then described by the LLM.
If the LLM descriptions are generated once and cached, the added cost per new data set becomes only the cost of the embedding lookup and linear combination.
Domain-specific fine-tuned language models could further reduce the risk of generic or hallucinated descriptions in specialized fields such as medicine or finance.

Load-bearing premise

The text descriptions produced by the LLM for each categorical value are accurate, unbiased, and relevant enough to the clustering task that they reliably supplement the original data without adding noise or systematic errors.

What would settle it

Run the same eight datasets through ARISE but replace the LLM descriptions with random or deliberately mismatched sentences; if clustering quality then drops to or below the level of the seven baseline methods, the semantic contribution of the LLM step is refuted.

Figures

Figures reproduced from arXiv: 2601.01162 by Xin Liao, Yiqun Zhang, Yiu-ming Cheung, Zihua Yang.

**Figure 1.** Figure 1: The semantic gap in categorical clustering. (a) Non-semantic representation treats all values as equidistant (d = 1), producing clusters with noticeable overlap. (b) Semantic-aware representation captures latent proximity: “oval” and “round” are highly similar (d = 0.2), and “oval” is closer to “irregular” (d = 0.7) than “round” is (d = 1), yielding improved cluster separation. As an unsupervised approac… view at source ↗

**Figure 2.** Figure 2: Overview of ARISE. The framework integrates a semantic view (top) and an identity view (bottom). The semantic view enriches representations via structured prompting with an LLM followed by attention-weighted encoding. The identity view preserves categorical distinctions via identity encoding. Both views are fused through adaptive feature fusion, where the weight α ∗ is selected based on cluster quality, to… view at source ↗

**Figure 3.** Figure 3: Runtime analysis. Impact of (a) instance count N, (b) attribute count M, and (c) unique value count |V| on execution time (log scale). The runtime in (c) includes offline description generation. 4.3 Ablation Study To isolate component contributions, three variants are evaluated. The variant w/o LLM removes the semantic view and relies solely on categorical identities. The variant w/o Attn replaces attentio… view at source ↗

**Figure 4.** Figure 4: UMAP visualization of cluster structures on MU. Panels (a)–(e) show projections from counterparts; panel (f) displays the ARISE representation. 4.4 Scalability Analysis [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool for pattern discovery. A core difficulty of qualitative-data clustering lies in measuring similarity among attribute values that carry no inherent ordering or distance. To recover such relationships, existing studies typically rely on within-dataset co-occurrence statistics. This statistical route, however, becomes unreliable once the sample size is small, and the semantic context of each value is therefore left underexploited. Motivated by this limitation, this paper proposes BREVE (Balanced Representation via External Value Enrichment), a clustering framework that enriches each qualitative value with extra semantic dimensions drawn from an external knowledge base. That is, every unique value is expanded by a dense embedding that encodes its semantic content. To prevent the original value identity from being diluted by the added dimensions, a lightweight one-hot component is further appended. An adaptive weight, guided by cluster compactness, then determines how strongly the enrichment dimensions enter the final representation. With this design, experiments on eight benchmark datasets yield an average ARI rank of 1.3 against seven representative competitors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARISE uses LLM descriptions of categorical values to improve clustering over co-occurrence baselines, with 19-27% gains on eight datasets, but the integration steps and robustness details remain thin.

read the letter

The main thing to know is that ARISE pulls semantic descriptions from LLMs for each categorical attribute value, builds enhanced embeddings from them, and combines those with the original data to find better clusters. It reports consistent 19-27% lifts over seven baselines across eight benchmarks, and the code is public on GitHub, which makes the claims checkable in practice. That external-knowledge angle is the clearest departure from prior work that stayed inside the dataset's co-occurrence statistics. The experiments cover enough standard datasets and methods to give a reasonable sense of where the gains appear, and releasing the implementation lowers the barrier for others to test it on their own categorical data. The soft spots sit in the missing mechanics. The abstract does not lay out the precise attention-weighting formula, how LLM outputs are prompted or stabilized across runs, or any handling of output variability. No error bars or significance tests are mentioned, so the reported gains are hard to judge for stability. The core bet that LLM descriptions supply reliable, complementary signal without domain mismatch or hallucination effects is plausible but unexamined in the summary. This paper targets practitioners who cluster real categorical tables in healthcare, marketing, or similar areas and are willing to add an LLM step. A reader who wants a working implementation to try on sparse data would get immediate value from the code. It shows enough empirical grounding and a distinct angle to deserve peer review rather than a desk reject, though any review would likely ask for expanded method details and basic robustness checks.

Referee Report

2 major / 2 minor

Summary. The paper proposes ARISE, a method that uses LLMs to generate semantic descriptions of categorical attribute values, enhances representations via attention-weighted integration of these embeddings with the original data, and demonstrates 19-27% clustering improvements over seven baselines on eight benchmark datasets.

Significance. If the empirical results hold under rigorous controls for LLM variability, the approach offers a practical way to inject external semantic knowledge into categorical clustering where intra-dataset co-occurrence statistics are sparse, with direct applicability to domains like healthcare and bioinformatics. The public code release is a strength that enables direct verification of the reported gains.

major comments (2)

[§4 Experiments] §4 Experiments: performance gains of 19-27% are reported without error bars, standard deviations across multiple LLM runs, or statistical significance tests (e.g., paired t-tests on NMI/ARI scores), which is load-bearing because LLM outputs are stochastic and the central claim of consistent improvement cannot be evaluated without these controls.
[§3 Method] §3 Method: the integration of LLM-generated embeddings with the original categorical metric space is described only at a high level (attention-weighted combination) without explicit equations or pseudocode for the fusion step, making it impossible to assess whether the semantic signal is truly complementary or simply reweighting existing features.

minor comments (2)

[Abstract] Abstract and §1: the acronym ARISE is expanded but the precise meaning of 'Attention-weighted Representation with Integrated Semantic Embeddings' is not tied to a specific equation or algorithm step, leaving readers to infer the weighting mechanism.
[§4 Experiments] §4: the seven baseline methods are listed but no reference is given for their exact implementations or hyperparameter settings used in the comparison, which affects reproducibility even with the released code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional statistical controls and methodological details.

read point-by-point responses

Referee: [§4 Experiments] §4 Experiments: performance gains of 19-27% are reported without error bars, standard deviations across multiple LLM runs, or statistical significance tests (e.g., paired t-tests on NMI/ARI scores), which is load-bearing because LLM outputs are stochastic and the central claim of consistent improvement cannot be evaluated without these controls.

Authors: We agree that the stochastic nature of LLM outputs requires explicit controls to substantiate the reported gains. In the revised manuscript we will rerun the embedding generation and clustering pipeline across multiple independent LLM calls (varying random seeds and sampling temperatures), report mean and standard deviation for NMI and ARI on all datasets, and include paired t-tests against each baseline to establish statistical significance of the improvements. revision: yes
Referee: [§3 Method] §3 Method: the integration of LLM-generated embeddings with the original categorical metric space is described only at a high level (attention-weighted combination) without explicit equations or pseudocode for the fusion step, making it impossible to assess whether the semantic signal is truly complementary or simply reweighting existing features.

Authors: We acknowledge that the fusion mechanism is currently described at a conceptual level. In the revision we will add the explicit mathematical equations for the attention-weighted integration of the LLM semantic embeddings with the original categorical feature representations, together with pseudocode for the complete ARISE procedure, so that readers can verify how the external semantic signal is combined with intra-dataset statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation relies on external LLM-generated descriptions of categorical attribute values, which are combined with the original data metric space. No equation or step reduces by construction to a fitted parameter, self-citation chain, or renamed input. The central claim (improved clustering via semantic embeddings) is tested empirically on eight independent benchmarks against seven baselines, with the method remaining falsifiable via released code and external LLM calls. This is the standard non-circular case for an externally-augmented representation approach.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that external LLM knowledge supplies useful semantic context beyond dataset-internal patterns; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Large language models can generate accurate and useful semantic descriptions for categorical attribute values that improve clustering when combined with original data.
Invoked when stating that LLM is adopted to describe attribute values for representation enhancement.

pith-pipeline@v0.9.0 · 5509 in / 1056 out tokens · 39815 ms · 2026-05-16T18:02:03.643604+00:00 · methodology

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)