pith. sign in
Pith Number

pith:Z4XHBC6M

pith:2026:Z4XHBC6MV64DVJTY3ARM5MU5UA
not attested not anchored not stored refs pending

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Xin Liao, Yiqun Zhang, Yiu-ming Cheung, Zihua Yang

LLM-generated semantic descriptions of categorical values, when fused into embeddings, measurably improve clustering quality over standard methods.

arxiv:2601.01162 v3 · 2026-01-03 · cs.LG · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{Z4XHBC6MV64DVJTY3ARM5MU5UA}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%.

C2weakest assumption

That LLM-generated descriptions of attribute values provide reliable, unbiased semantic knowledge that meaningfully complements the original categorical metric space without introducing hallucinations or domain mismatches.

C3one line summary

ARISE integrates LLM-generated semantic embeddings with categorical data representations to bridge the similarity gap, achieving 19-27% gains over seven baselines on eight benchmark datasets.

Receipt and verification
First computed 2026-05-29T01:05:01.000829Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

cf2e708bccafb83aa678d822ceb29da009baef00a24ca785070d7d0e0e3ce8ab

Aliases

arxiv: 2601.01162 · arxiv_version: 2601.01162v3 · doi: 10.48550/arxiv.2601.01162 · pith_short_12: Z4XHBC6MV64D · pith_short_16: Z4XHBC6MV64DVJTY · pith_short_8: Z4XHBC6M
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/Z4XHBC6MV64DVJTY3ARM5MU5UA \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: cf2e708bccafb83aa678d822ceb29da009baef00a24ca785070d7d0e0e3ce8ab
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "49903dc327e67d5c8da8597bfd9a4a11ded5cd5e0024530f2bb8519115d29c73",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-01-03T11:37:46Z",
    "title_canon_sha256": "6f686b97d2f886e94fc24a8c91bc3d4f95a1849657e1dd435ab6b20bbc83a3d8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.01162",
    "kind": "arxiv",
    "version": 3
  }
}