pith. machine review for the scientific record. sign in

arxiv: 2603.28816 · v2 · submitted 2026-03-28 · 💻 cs.DL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:47 UTC · model grok-4.3

classification 💻 cs.DL cs.AI
keywords art-technology institutionsconceptual axestext embeddingsunsupervised clusteringinstitutional mappingcultural institutionsclustering analysislatent topics
0
0 comments X

The pith

An eight-axis framework combined with text embeddings clusters 78 art-technology institutions into ten coherent groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ASTRA, a method to map the diverse world of art-technology institutions using qualitative descriptions along eight conceptual axes. These descriptions are turned into numerical vectors through sentence embeddings and then grouped using dimensionality reduction and clustering algorithms. The resulting clusters show clear patterns, such as art-science centers around ZKM and academic computing conferences. A sympathetic reader would care because it offers a data-driven way to navigate and understand the connections in this growing field, potentially aiding curators and policymakers in seeing the bigger picture.

Core claim

The ASTRA methodology applies an eight-axis conceptual framework to characterize 78 art-technology institutions, encodes the qualitative descriptions using E5-large-v2 embeddings, reduces dimensions with UMAP, and clusters them with average-linkage agglomerative clustering at k=10. This produces a composite score of 0.825, silhouette coefficient of 0.803, and high Calinski-Harabasz index, yielding coherent groupings including an art-science hub anchored by ZKM, an innovation cluster with Ars Electronica, an ACM academic cluster, and an electronic music cluster.

What carries the argument

The eight-axis conceptual framework (Curatorial Philosophy, Territorial Relation, Knowledge Production Mode, Institutional Genealogy, Temporal Orientation, Ecosystem Function, Audience Relation, and Disciplinary Positioning) combined with E5-large-v2 sentence embeddings and UMAP-based agglomerative clustering.

If this is right

  • Curators and researchers can explore institutional similarities and cross-disciplinary connections using the interactive React-based tool.
  • Neighbor-cluster entropy identifies boundary institutions that bridge multiple thematic communities.
  • Non-negative matrix factorization extracts ten latent topics from the encoded descriptions.
  • The pipeline yields specific coherent clusters including art-science hubs anchored by ZKM and an ACM academic cluster.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adapting the eight-axis framework to institutions in adjacent fields such as scientific research labs could generate comparable maps.
  • Tracking how new or evolving institutions enter or move between clusters over successive years would reveal shifts in the overall landscape.
  • The identified groupings could inform targeted collaboration or funding strategies by highlighting both similar peers and bridging organizations.

Load-bearing premise

The eight conceptual axes and the qualitative descriptions collected for each institution capture the multidimensional characteristics without significant omission or bias.

What would settle it

Re-running the embedding and clustering steps on the same qualitative descriptions but obtaining substantially different cluster assignments or markedly lower validation scores such as a silhouette coefficient below 0.7 would indicate the groupings are not stable or coherent.

Figures

Figures reproduced from arXiv: 2603.28816 by Joonhyung Bae.

Figure 1
Figure 1. Figure 1: Overview of the ASTRA processing pipeline. Qualitative axis descriptions for 78 institutions are encoded through E5-large-v2 sentence embeddings, quantized via a word-level codebook, and clustered using UMAP and agglomerative clustering. NMF topic modeling and entropy-based boundary analysis are applied post-clustering, and results are served through an interactive web visualization. covers institutions fo… view at source ↗
Figure 2
Figure 2. Figure 2: 2D UMAP scatter plots comparing Agglomerative Average (𝑘=10, left) and DBSCAN (𝑘=2, right). Agglomerative clustering yields 10 interpretable groups, while DBSCAN produces only two effective clusters with 27.5% of institutions classified as noise (gray). dual-model configurations, a lightweight Sentence-BERT model, and traditional baselines (Word2Vec, TF-IDF); (2) a leave-one-axis-out analysis measuring eac… view at source ↗
Figure 3
Figure 3. Figure 3: Axis contribution analysis (leave-one-axis-out). Each bar represents the change in the respective metric when the named axis is removed. confirming the substantial advantage of modern learned embeddings over traditional represen￾tations. The codebook size 𝑘=7 is used in the main pipeline; E5-large-v2 with 𝑘=5 attains the sweep maximum (0.845), while 𝑘=7 with Average linkage yields the selected configuratio… view at source ↗
Figure 4
Figure 4. Figure 4: 2D UMAP scatter plot of 78 institutions colored by cluster membership (𝑘=10). (a) Full view showing the spatial separation of Cluster 4; (b) zoomed view of the nine remaining clusters with representative institution labels. peer-review, publication, and reproducibility ethos of the ACM community. Cluster 5: Electronic music and media. This cluster groups seven institutions centered on sound, music, and med… view at source ↗
Figure 5
Figure 5. Figure 5: Cluster–topic heatmap. Each cell shows the mean NMF weight of the topic (column) within the cluster (row). Darker cells indicate stronger thematic associations. cluster (8), reflecting its extensive programmatic scope as a museum, research center and production house. More broadly, high neighbor-cluster entropy admits two interpretations: (a) cross-pollinator institutions that intentionally span multiple d… view at source ↗
Figure 6
Figure 6. Figure 6: Screenshot of the APESuite Explorer web interface, showing the 2D scatter plot, selected institution detail panel with thematic profile, and similarity links. vation & industry cluster (1), while CTM Festival and MUTEK form a distinct electronic music cluster (5), reflecting shared curatorial philosophies rather than organizational format. A codebook-level examination of the transmediale–SXSW pairing shows… view at source ↗
read the original abstract

The global landscape of art-technology institutions, including festivals, biennials, research labs, conferences, and hybrid organizations, has grown increasingly diverse, yet systematic frameworks for analyzing their multidimensional characteristics remain scarce. This paper proposes ASTRA (Art-technology Institution Spatial Taxonomy and Relational Analysis), a computational methodology combining an eight-axis conceptual framework (Curatorial Philosophy, Territorial Relation, Knowledge Production Mode, Institutional Genealogy, Temporal Orientation, Ecosystem Function, Audience Relation, and Disciplinary Positioning) with a text-embedding and clustering pipeline to map 78 cultural-technology institutions into a unified analytical space. Each institution is characterized through qualitative descriptions along the eight axes, encoded via E5-large-v2 sentence embeddings and quantized through a word-level codebook into TF-IDF feature vectors. Dimensionality reduction using UMAP, followed by agglomerative clustering (Average linkage, k=10), yields a composite score of 0.825, a silhouette coefficient of 0.803, and a Calinski-Harabasz index of 11196. Non-negative matrix factorization extracts ten latent topics, and a neighbor-cluster entropy measure identifies boundary institutions bridging multiple thematic communities. An interactive React-based tool enables curators, researchers, and policymakers to explore institutional similarities and cross-disciplinary connections. Results reveal coherent groupings such as an art-science hub cluster anchored by ZKM and ArtScience Museum, an innovation and industry cluster including Ars Electronica, transmediale, and Sonar, an ACM academic cluster comprising TEI, DIS, and NIME, and an electronic music cluster including CTM Festival, MUTEK, and Sonic Acts. Code and data: https://github.com/joonhyungbae/astra

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ASTRA, a pipeline that codes 78 art-technology institutions along eight conceptual axes (Curatorial Philosophy, Territorial Relation, Knowledge Production Mode, Institutional Genealogy, Temporal Orientation, Ecosystem Function, Audience Relation, Disciplinary Positioning), embeds the resulting descriptions with the E5-large-v2 model, applies UMAP dimensionality reduction, and performs average-linkage agglomerative clustering with k=10. It reports strong internal clustering metrics (composite score 0.825, silhouette 0.803, Calinski-Harabasz 11196), extracts topics via NMF, identifies boundary institutions, and provides an interactive React tool for exploration, revealing clusters such as an art-science hub around ZKM and an ACM academic cluster.

Significance. If the central claims hold, the work offers a reproducible computational approach to mapping the diverse landscape of art-technology institutions, facilitating analysis of cross-disciplinary connections for curators, researchers, and policymakers. The public release of code and data on GitHub strengthens the contribution by enabling independent verification and extension of the mappings.

major comments (2)
  1. Section 3 (Qualitative coding of institutions): The qualitative descriptions along the eight axes are generated by the authors without reported inter-rater reliability metrics, multiple independent coders, or validation against institutional self-descriptions or expert review. Because the TF-IDF vectors, UMAP embeddings, and clustering results (including the silhouette coefficient of 0.803) are derived directly from these descriptions, the observed cluster coherence may primarily reflect the consistency of the authors' framing rather than robust, intrinsic structures in the data. This assumption is load-bearing for the claim that the pipeline produces coherent and insightful groupings.
  2. Section 4 (Clustering and validation): No sensitivity analysis is presented for the choice of k=10 or for variations in the axis definitions and descriptions; the high Calinski-Harabasz index of 11196 is reported only for the selected configuration, limiting assessment of robustness to the unsupervised pipeline.
minor comments (2)
  1. Abstract and Section 4: The composite score of 0.825 is mentioned but not defined; please clarify its calculation from the individual metrics (silhouette, Calinski-Harabasz, etc.) in the main text.
  2. Results section and figures: Ensure cluster visualizations include clear legends, axis labels, and institution labels for interpretability; the GitHub link should be repeated in the main text for accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments point by point below, outlining the revisions we intend to make to improve methodological transparency and robustness.

read point-by-point responses
  1. Referee: Section 3 (Qualitative coding of institutions): The qualitative descriptions along the eight axes are generated by the authors without reported inter-rater reliability metrics, multiple independent coders, or validation against institutional self-descriptions or expert review. Because the TF-IDF vectors, UMAP embeddings, and clustering results (including the silhouette coefficient of 0.803) are derived directly from these descriptions, the observed cluster coherence may primarily reflect the consistency of the authors' framing rather than robust, intrinsic structures in the data. This assumption is load-bearing for the claim that the pipeline produces coherent and insightful groupings.

    Authors: We recognize the validity of this concern regarding the subjectivity of the qualitative coding. The conceptual axes were derived from an extensive literature review on art-technology institutions, and the descriptions aim to reflect publicly available information about each institution. To strengthen this aspect, we will revise Section 3 to provide greater transparency: including a table or appendix with sample codings for representative institutions across axes, and explicitly discussing the potential influence of author perspective. Additionally, we will conduct and report a sensitivity analysis by generating alternative descriptions for a subset of institutions and re-evaluating the clustering metrics. While we cannot retroactively introduce multiple independent coders for the original dataset, this will help demonstrate that the cluster structures are not overly sensitive to specific phrasings. We will also add a limitations section acknowledging this. revision: partial

  2. Referee: Section 4 (Clustering and validation): No sensitivity analysis is presented for the choice of k=10 or for variations in the axis definitions and descriptions; the high Calinski-Harabasz index of 11196 is reported only for the selected configuration, limiting assessment of robustness to the unsupervised pipeline.

    Authors: We agree that presenting sensitivity analyses would better support the robustness of our findings. In the revised manuscript, we will include additional experiments in Section 4: (1) varying the number of clusters k from 6 to 14 and reporting the corresponding silhouette, Calinski-Harabasz, and composite scores to justify k=10; (2) testing variations in the embedding model or slight modifications to axis descriptions to assess impact on the final clusters. These analyses will be presented with tables and figures showing metric stability, thereby addressing the limitation of reporting metrics only for the selected configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; clustering derives directly from independent qualitative inputs

full rationale

The paper defines the eight conceptual axes a priori, generates qualitative descriptions for each of the 78 institutions along those axes, encodes the descriptions with a fixed pre-trained embedding model, and applies standard unsupervised dimensionality reduction and clustering. Cluster validity metrics are computed on the transformed embeddings without any fitted parameter being defined in terms of the resulting clusters, without self-citation chains supporting core claims, and without renaming or smuggling prior results. The pipeline is a straightforward computational mapping of author-provided text inputs; the coherence scores reflect structure within those inputs rather than a definitional loop. This is the normal, non-circular case for descriptive unsupervised analysis.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the validity of the eight invented axes and the assumption that sentence embeddings faithfully encode the qualitative descriptions; no free parameters are fitted to the final clusters beyond the choice of k=10 and the embedding model.

free parameters (2)
  • number of clusters k
    Set to 10 for agglomerative clustering; chosen to produce interpretable groups rather than derived from data.
  • E5-large-v2 embedding model
    Pre-trained model selected for encoding descriptions; its parameters are fixed from prior training.
axioms (2)
  • domain assumption Sentence embeddings from E5-large-v2 capture semantic distinctions relevant to the eight conceptual axes
    Invoked when converting qualitative descriptions to TF-IDF vectors after quantization.
  • domain assumption UMAP followed by average-linkage agglomerative clustering yields meaningful partitions of cultural institutions
    Used to justify the reported silhouette and Calinski-Harabasz scores as evidence of coherent groupings.
invented entities (1)
  • Eight conceptual axes (Curatorial Philosophy, Territorial Relation, Knowledge Production Mode, Institutional Genealogy, Temporal Orientation, Ecosystem Function, Audience Relation, Disciplinary Posit no independent evidence
    purpose: To provide a multidimensional characterization framework for art-technology institutions
    Newly defined in the paper with no independent prior validation cited; used as the basis for all qualitative coding.

pith-pipeline@v0.9.0 · 5603 in / 1705 out tokens · 62324 ms · 2026-05-14T21:47:21.676224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Breath1024.lean period8 / 8-tick periodicity echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    An eight-axis conceptual framework (Curatorial Philosophy, Territorial Relation, Knowledge Production Mode, Institutional Genealogy, Temporal Orientation, Ecosystem Function, Audience Relation, and Disciplinary Positioning) ... agglomerative clustering (Average linkage, k=10)

  • IndisputableMonolith/Cost/FunctionalEquation.lean Jcost / washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    E5-large-v2 sentence embeddings ... word-level codebook ... TF-IDF feature vectors ... composite score of 0.825, silhouette 0.803

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 6 internal anchors

  1. [1]

    Components of Game Experience: An Automatic Text Analysis of Online Reviews,

    X. Wang and D. H.-L. Goh, “Components of Game Experience: An Automatic Text Analysis of Online Reviews,”Entertainment Computing, vol. 33, p. 100338, 2020. Article (CrossRef Link)

  2. [2]

    Manovich,Cultural Analytics

    L. Manovich,Cultural Analytics. Cambridge, MA: MIT Press, 2020

  3. [3]

    Cultural Cartography with Word Embeddings,

    D. S. Stoltz and M. A. Taylor, “Cultural Cartography with Word Embeddings,”Poetics, vol. 88, p. 101567, 2021.Article (CrossRef Link)

  4. [4]

    Analyzing Cross-Platform Gaming Experi- ences Using Topic Modeling,

    Y. Sim, T.-S. Chung, and I. Park, “Analyzing Cross-Platform Gaming Experi- ences Using Topic Modeling,”Entertainment Computing, vol. 54, p. 100946, 2025. Article (CrossRef Link)

  5. [5]

    Sentiment Analysis of Animated Film Reviews Using Intelligent Machine Learning,

    C. Chen, B. Xu, J.-H. Yang, and M. Liu, “Sentiment Analysis of Animated Film Reviews Using Intelligent Machine Learning,”Computational Intelligence and Neuroscience, vol. 2022, 2022.Article (CrossRef Link)

  6. [6]

    Beyond Skill Rating: Advanced Matchmaking in Ghost Recon Online,

    O. Delalleau, E. Contal, E. Thibodeau-Laufer, R. C. Ferrari, Y. Bengio, and F. Zhang, “Beyond Skill Rating: Advanced Matchmaking in Ghost Recon Online,”IEEE Trans. Comput. Intell. AI Games, vol. 4, no. 3, pp. 167–177, Sep. 2012.Article (CrossRef Link) KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 3, NO. 6, DECEMBER 20XX 20

  7. [7]

    BioArtlas: Computational Clustering of Multi-Dimensional Complexity in Bioart,

    J. Bae, “BioArtlas: Computational Clustering of Multi-Dimensional Complexity in Bioart,” in Proc. 39th Conf. Neural Information Processing Systems (NeurIPS), 2025, Creative AI Track

  8. [8]

    A. K. Yetisen, J. Daviset al., “Bioart,”Trends in Biotechnology, vol. 33, no. 12, pp. 724–734, Dec. 2015.Article (CrossRef Link)

  9. [9]

    NewYork: Columbia University Press, 1993

    P.Bourdieu,The Field of Cultural Production: Essays on Art and Literature. NewYork: Columbia University Press, 1993

  10. [10]

    TheFormsofCapital,

    P.Bourdieu,“TheFormsofCapital,”inThe Sociology of Economic Life,3rded.,M.Gra- novetter and R. Swedberg, Eds. New York: Routledge, 2018, pp. 78–92

  11. [11]

    TheIronCageRevisited: InstitutionalIsomorphismand Collective Rationality in Organizational Fields,

    P.J.DiMaggioandW.W.Powell,“TheIronCageRevisited: InstitutionalIsomorphismand Collective Rationality in Organizational Fields,”American Sociological Review, vol. 48, no. 2, pp. 147–160, 1983.Article (CrossRef Link)

  12. [12]

    Arts Festivals and the City,

    B. Quinn, “Arts Festivals and the City,”Urban Studies, vol. 42, no. 5-6, pp. 927–943, 2005.Article (CrossRef Link)

  13. [13]

    Festivalisation: Patterns and Limits,

    E. Négrier, “Festivalisation: Patterns and Limits,” inFocus on Festivals: Contemporary European Case Studies and Perspectives,C.Newbold,C.Maughan,J.Jordan,andF.Bian- chini, Eds. Oxford: Goodfellow Publishers, 2015, pp. 18–27.Article (CrossRef Link)

  14. [14]

    Knowledge Cultures in New Media Art,

    R. C. Hoetzlein, “Knowledge Cultures in New Media Art,”Artnodes, no. 31, pp. 1–9, 2023.Article (CrossRef Link)

  15. [15]

    Grau, Ed.,MediaArtHistories

    O. Grau, Ed.,MediaArtHistories. Cambridge, MA: MIT Press, 2007

  16. [16]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,”inProc.2019Conf.EmpiricalMethodsinNaturalLanguageProcessing (EMNLP-IJCNLP), 2019, pp. 3982–3992.Article (CrossRef Link)

  17. [17]

    C-Pack: Packed Resources For General Chinese Embeddings

    S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-Pack: Packaged Resources To Advance General Chinese Embedding,”arXiv preprint arXiv:2309.07597, 2023. [Online]. Available: https://arxiv.org/abs/2309.07597

  18. [18]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self- Knowledge Distillation,”arXiv preprint arXiv:2402.03216, 2024. [Online]. Available: https://arxiv.org/abs/2402.03216

  19. [19]

    mGTE:GeneralizedLong-ContextTextRepresentation and Reranking Models for Multilingual Text Retrieval,

    X. Zhang, Y. Zhang, D. Long, W. Xie, Z. Dai, J. Tang, H. Lin, B. Yang, P. Xie, F. Huang, M.Zhang,W.Li,andM.Zhang,“mGTE:GeneralizedLong-ContextTextRepresentation and Reranking Models for Multilingual Text Retrieval,” in Proc. 2024 Conf. Empiri- cal Methods in Natural Language Processing: Industry Track, 2024, pp. 1393–1412. Article (CrossRef Link)

  20. [20]

    Learning the Parts of Objects by Non-Negative Matrix Factorization,

    D. D. Lee and H. S. Seung, “Learning the Parts of Objects by Non-Negative Matrix Factorization,”Nature, vol. 401, pp. 788–791, 1999.Article (CrossRef Link)

  21. [21]

    Neural Discrete Representation Learning

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 6306–6315. [Online]. Available: https://arxiv.org/abs/1711.00937

  22. [22]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,”arXiv preprint arXiv:1802.03426, 2018. [Online]. Available: https://arxiv.org/abs/1802.03426

  23. [23]

    Visualizing Data using t-SNE,

    L. van der Maaten and G. Hinton, “Visualizing Data using t-SNE,”Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. [Online]. Available: https://jmlr.org/papers/v9/vandermaaten08a.html

  24. [24]

    Visualization of Cultural Heritage Collection Data: State of the Art and Future Chal- lenges,

    F. Windhager, P. Federico, G. Schreder, K. Glinka, M. Dork, S. Miksch, and E. Mayr, “Visualization of Cultural Heritage Collection Data: State of the Art and Future Chal- lenges,”IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 6, pp. 2311–2330, 2019.Article (CrossRef Link)

  25. [25]

    The Cultural Mapping and Pattern Analysis (CMAP) Visualization Toolkit: Open Source Text Analysis for Qualitative and Computational Social Science,

    C. M. Abramson and Y. Nian, “The Cultural Mapping and Pattern Analysis (CMAP) Visualization Toolkit: Open Source Text Analysis for Qualitative and Computational Social Science,”arXiv preprint arXiv:2510.16140, 2025, Under review at Journal of KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 3, NO. 6, DECEMBER 20XX 21 Open Source Software (JOSS)....

  26. [26]

    The Population Ecology of Organizations,

    M. T. Hannan and J. Freeman, “The Population Ecology of Organizations,”American Journal of Sociology, vol. 82, no. 5, pp. 929–964, 1977.Article (CrossRef Link)

  27. [27]

    ArtintheInformationAge: TechnologyandConceptualArt,

    E.A.Shanken,“ArtintheInformationAge: TechnologyandConceptualArt,”Leonardo, vol. 35, no. 4, pp. 433–438, 2002.Article (CrossRef Link)

  28. [28]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text Embeddings by Weakly-Supervised Contrastive Pre-training,”arXiv preprint arXiv:2212.03533, 2022. [Online]. Available: https://arxiv.org/abs/2212.03533

  29. [29]

    Distributed Representations of Words and Phrases and their Compositionality

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 26, 2013. [Online]. Available: https://arxiv.org/abs/1310.4546

  30. [30]

    ADensity-BasedAlgorithmforDiscovering Clusters in Large Spatial Databases with Noise,

    M.Ester,H.-P.Kriegel,J.Sander,andX.Xu,“ADensity-BasedAlgorithmforDiscovering Clusters in Large Spatial Databases with Noise,” in Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD), 1996, pp. 226–231

  31. [31]

    OPTICS: Ordering Points To Identify the Clustering Structure,

    M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” in Proc. ACM SIGMOD Int. Conf. Management of Data, 1999, pp. 49–60.Article (CrossRef Link)

  32. [32]

    Estimating the Number of Clusters in a Data Set via the Gap Statistic,

    R. Tibshirani, G. Walther, and T. Hastie, “Estimating the Number of Clusters in a Data Set via the Gap Statistic,”J. Royal Statistical Society: Series B (Statistical Methodology), vol. 63, no. 2, pp. 411–423, 2001.Article (CrossRef Link)

  33. [33]

    Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

    Z. Nussbaum, J. X. Morris, B. Duderstadt, and A. Mulyar, “Nomic Embed: Training a Reproducible Long Context Text Embedder,”arXiv preprint arXiv:2402.01613, 2024. [Online]. Available: https://arxiv.org/abs/2402.01613 Author Profile JoonhyungBaereceivedtheB.F.A.degreeinArt&DesignfromKorea University, Seoul, South Korea, in 2019, and the M.S. degree in Cul- ...