Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents

Mihael Arcan

arxiv: 2601.08841 · v2 · submitted 2025-12-19 · 💻 cs.CL · cs.AI· cs.DL

Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents

Mihael Arcan This is my paper

Pith reviewed 2026-05-16 20:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DL

keywords scientific document classificationknowledge triplesembeddingsclusteringarXivsubject predictionknowledge infusion

0 comments

The pith

Abstract-only inputs outperform knowledge-infused triples for classifying scientific documents at 0.923 accuracy and macro-F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether subject-predicate-object triples extracted from research papers can improve clustering and classification over abstract text alone. It builds a pipeline that creates four document representations—abstract, triples, abstract plus triples, and hybrid—and evaluates them on a filtered arXiv corpus using transformer embeddings, KMeans, GMM, and HDBSCAN for clustering, followed by supervised classifiers for subject prediction. Across five seeds, abstract-only inputs deliver the highest and most stable classification scores while triple-only and combined variants fail to beat this baseline consistently and sometimes lower results. Clustering experiments show KMeans and GMM generally stronger than HDBSCAN on external metrics, with the latter more affected by noise. The work concludes that adding triples does not guarantee gains and that the value of knowledge infusion is highly configuration-dependent.

Core claim

Across a five-seed benchmark, abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1 mean, while triple-only and knowledge-infused variants do not consistently outperform this baseline. In clustering, KMeans and GMM generally outperform HDBSCAN on external validity metrics, though HDBSCAN shows higher noise sensitivity. Adding extracted triples naively does not guarantee gains and can reduce performance depending on representation choice.

What carries the argument

A modular pipeline that converts documents into four representations (abstract, triples, abstract+triples, hybrid) and feeds them to transformer embeddings for joint evaluation in unsupervised clustering and supervised subject classification.

Load-bearing premise

The extracted triples accurately capture relevant knowledge from the papers without introducing noise and the filtered arXiv corpus is representative for general scientific document tasks.

What would settle it

Repeating the experiments with a higher-precision triple extractor on the same corpus or on a new corpus where any triple-infused representation exceeds 0.923 accuracy would falsify the claim that abstract-only inputs are superior.

read the original abstract

The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we investigate whether structured knowledge, specifically, subject-predicate-object triples-improves clustering and classification of scientific papers. We present a modular pipeline that combines unsupervised clustering and supervised classification across four document representations: abstract, triples, abstract+triples, and hybrid. Using a filtered arXiv corpus, we evaluate four transformer embeddings (MiniLM, MPNet, SciBERT, SPECTER) with KMeans, GMM, and HDBSCAN, and then train downstream classifiers for subject prediction. Across a five-seed benchmark (seeds 40-44), abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1 (mean). Triple-only and knowledge-infused variants do not consistently outperform this baseline. In clustering, KMeans/GMM generally outperform HDBSCAN on external validity metrics, while HDBSCAN exhibits higher noise sensitivity. We observe that adding extracted triples naively does not guarantee gains and can reduce performance depending on representation choice. These results refine the role of knowledge infusion in scientific document modeling: structured triples are informative but not universally beneficial, and their impact is strongly configuration-dependent. Our findings provide a reproducible benchmark and practical guidance for when knowledge-augmented representations help, and when strong text-only baselines remain preferable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abstract-only embeddings top the results at 0.923 accuracy on this arXiv subject task while triples add no consistent lift, but the extraction step lacks any quality check.

read the letter

The main point is that plain abstracts with standard transformer embeddings give the strongest and most stable classification performance here, averaging 0.923 accuracy and macro-F1 across five seeds. Adding extracted triples or using hybrid versions does not improve on that baseline and sometimes hurts it. Clustering results are more mixed, with KMeans and GMM beating HDBSCAN on most external metrics. The paper runs a controlled set of experiments on a filtered arXiv corpus, testing four embeddings (MiniLM, MPNet, SciBERT, SPECTER) and four input representations across clustering and downstream classification. That setup with multiple seeds is straightforward and lets you see the stability of the abstract win. It also shows clearly that naive knowledge infusion is config-dependent, which is a useful practical reminder. The soft spot is the triple extraction itself. The description gives no tool name, no filtering thresholds, no accuracy numbers on the triples, and no sample output, so it is impossible to tell whether the triples are accurate enough to test the idea fairly. If they contain noise or miss key relations, the abstract baseline wins for the wrong reason. The corpus filtering step is also light on justification, which limits how far the guidance travels beyond this specific collection. This work is aimed at people building or tuning document classifiers for scientific literature who need to decide whether to add structured knowledge. A reader in that position gets concrete numbers and a clear warning against assuming triples will help. The experiments are solid enough on their own terms to deserve a serious referee, though the review should focus on adding triple validation and clearer extraction details. I would send it to review with those requests rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study investigating whether subject-predicate-object triples extracted from scientific papers improve unsupervised clustering and supervised subject classification over abstract-only baselines. Using a filtered arXiv corpus, four transformer embeddings (MiniLM, MPNet, SciBERT, SPECTER), and clustering algorithms (KMeans, GMM, HDBSCAN), the authors evaluate four input representations: abstract, triples, abstract+triples, and hybrid. Across five random seeds (40-44), abstract-only inputs achieve the highest mean accuracy and macro-F1 of 0.923, while triple-only and knowledge-infused variants do not consistently outperform this baseline. The paper concludes that structured triples are informative but not universally beneficial, with impact strongly dependent on representation choice and configuration.

Significance. If the findings hold after addressing extraction details, the work supplies a reproducible multi-seed, multi-model benchmark that refines expectations for knowledge infusion in scientific document modeling. It demonstrates that strong text-only baselines can remain preferable and offers configuration-dependent guidance, which is useful for practitioners building clustering or classification pipelines on arXiv-scale corpora.

major comments (2)

[Methods] Methods / Pipeline section: The triple extraction process is described at a high level but omits the specific tool or model employed, any confidence or filtering thresholds applied to the triples, and any quality validation (e.g., manual review or precision/recall on a held-out sample). This is load-bearing for the headline claim that abstract-only outperforms triple-infused representations, because unvalidated or noisy triples could artifactually favor the cleaner abstract baseline rather than reflecting the true value of structured knowledge.
[Results] Results section (five-seed benchmark): Only mean accuracy and macro-F1 are reported for seeds 40-44; no per-seed values, standard deviations, or statistical significance tests (e.g., paired t-test or Wilcoxon test) between abstract-only and the other three representations are provided. Without these, the statement that triple variants “do not consistently outperform” lacks quantitative support for whether observed differences are reliable or within noise.

minor comments (2)

[Abstract] The abstract and introduction refer to a “hybrid” representation without a concise definition; a one-sentence clarification of how abstract and triple embeddings are combined (concatenation, late fusion, etc.) would improve readability.
[Evaluation Metrics] Clustering evaluation metrics are mentioned but the exact external validity indices (e.g., ARI, NMI) and their formulas or implementations should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. We address each major point below and have updated the manuscript to incorporate the requested details and statistical information.

read point-by-point responses

Referee: [Methods] Methods / Pipeline section: The triple extraction process is described at a high level but omits the specific tool or model employed, any confidence or filtering thresholds applied to the triples, and any quality validation (e.g., manual review or precision/recall on a held-out sample). This is load-bearing for the headline claim that abstract-only outperforms triple-infused representations, because unvalidated or noisy triples could artifactually favor the cleaner abstract baseline rather than reflecting the true value of structured knowledge.

Authors: We agree that the extraction pipeline requires more specificity to support the comparisons. In the revised manuscript we have expanded the Methods section to name the exact OpenIE implementation used, state the confidence threshold applied during filtering, and add a short quality-validation paragraph reporting precision on a manually inspected sample of 100 documents. These additions allow readers to evaluate whether the triples are sufficiently reliable and remove any ambiguity about whether noise could have disadvantaged the infused representations. revision: yes
Referee: [Results] Results section (five-seed benchmark): Only mean accuracy and macro-F1 are reported for seeds 40-44; no per-seed values, standard deviations, or statistical significance tests (e.g., paired t-test or Wilcoxon test) between abstract-only and the other three representations are provided. Without these, the statement that triple variants “do not consistently outperform” lacks quantitative support for whether observed differences are reliable or within noise.

Authors: We accept that mean values alone are insufficient to substantiate the claim of non-consistent outperformance. The revised Results section now includes a supplementary table listing accuracy and macro-F1 for each of the five seeds, reports the corresponding standard deviations, and adds paired t-test p-values comparing the abstract-only baseline against each triple-infused variant. These additions provide the quantitative evidence that the observed differences are statistically reliable rather than attributable to seed-level noise. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons on held-out data

full rationale

The paper reports an empirical pipeline that extracts triples, generates embeddings from four representations, runs clustering (KMeans/GMM/HDBSCAN) and trains classifiers for subject prediction on a filtered arXiv corpus. All reported metrics (accuracy 0.923, macro-F1 0.923) are computed directly on held-out test splits across five seeds. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the central claim that abstract-only inputs outperform triple-infused variants is an observed experimental outcome rather than a definitional or self-citational reduction. The study is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions about embedding quality and clustering algorithms, with no new free parameters or invented entities introduced.

axioms (2)

domain assumption Transformer embeddings capture semantic information from text and triples.
Assumed in using MiniLM, MPNet, SciBERT, SPECTER for representations.
domain assumption KMeans, GMM, HDBSCAN are appropriate for clustering document embeddings.
Standard choice in the field.

pith-pipeline@v0.9.0 · 5552 in / 1203 out tokens · 31663 ms · 2026-05-16T20:28:26.817033+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a modular pipeline that combines unsupervised clustering and supervised classification over multiple document representations: raw abstracts, extracted triples, and hybrid formats... abstract-only inputs provide the strongest and most stable classification performance, reaching 0.923 accuracy and 0.923 macro-F1
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extract subject–predicate–object triples from the abstract text... linearize them into simplified natural language statements

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Op- tuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InThe 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623–2631, 2019

work page 2019
[2]

SciBERT: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620,...

work page 2019
[3]

Clement, Matthew Bierbaum, Kevin P

Colin B. Clement, Matthew Bierbaum, Kevin P. O’Keeffe, and Alexander A. Alemi. On the use of arxiv as a dataset, 2019

work page 2019
[4]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. SPECTER: Document-level Representation Learning using Citation-informed Transformers. InACL, 2020

work page 2020
[5]

J. Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960

work page 1960
[6]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977

work page 1977
[7]

A simple generalisation of the area under the roc curve for multiple class classification problems.Machine learning, 45(2):171–186, 2001

David J Hand and Robert J Till. A simple generalisation of the area under the roc curve for multiple class classification problems.Machine learning, 45(2):171–186, 2001

work page 2001
[8]

Deep learning meets knowledge graphs for scholarly data classification

Fabian Hoppe, Danilo Dess` ı, and Harald Sack. Deep learning meets knowledge graphs for scholarly data classification. InCompanion Proceedings of the Web Conference 2021 (WWW ’21 Companion). ACM, 2021

work page 2021
[9]

Node level graph autoencoder: Unified pretraining for textual graph learning, 2024

Wenbin Hu, Huihao Jing, Qi Hu, Haoran Li, and Yangqiu Song. Node level graph autoencoder: Unified pretraining for textual graph learning, 2024

work page 2024
[10]

A new equivalence statistic for partition comparison

Lawrence J Hubert and Patrick Arabie. A new equivalence statistic for partition comparison. Journal of Classification, 2(1):421–428, 1985

work page 1985
[11]

A classification framework for scientific documents to support knowledge graph population

Angelika Kaplan, Jan Keim, Lukas Greiner, Anne Koziolek, and Ralf Reussner. A classification framework for scientific documents to support knowledge graph population. Gesellschaft f¨ ur Informatik, Bonn, 2025

work page 2025
[12]

Mpnet: Masked & permutationed language modeling for pre-training natural language under- standing

Yinhan Liu, Moin Nadeem, Kuzman Ganchev, Zhiyuan Liu, Yanghui Rao, and Jie Tang. Mpnet: Masked & permutationed language modeling for pre-training natural language under- standing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9121–9128, 2020

work page 2020
[13]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 20

work page 2019
[14]

Article classification with graph neural networks and multigraphs, 2024

Khang Ly, Yury Kashnitsky, Savvas Chamezopoulos, and Valeria Krzhizhanovskaya. Article classification with graph neural networks and multigraphs, 2024

work page 2024
[15]

Multivariate observations

J MacQueen. Multivariate observations. InProceedings ofthe 5th Berkeley Symposium on Mathematical Statisticsand Probability, volume 1, pages 281–297, 1967

work page 1967
[16]

B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, 1975

work page 1975
[17]

Many faces of entropy.Journal of Machine Learning Research, 8(1):195–20 entropy, 2007

Marina Meila. Many faces of entropy.Journal of Machine Learning Research, 8(1):195–20 entropy, 2007

work page 2007
[18]

A note on the analysis of the adjusted rand index

Glenn W Milligan and Douglas M Cooper. A note on the analysis of the adjusted rand index. Psychometrika, 51(3):495–499, 1986

work page 1986
[19]

Neigh- borhood contrastive learning for scientific document representations with citation embeddings

Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, and Georg Rehm. Neigh- borhood contrastive learning for scientific document representations with citation embeddings. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Confer- ence on Empirical Methods in Natural Language Processing, pages 11670–11688, Abu ...

work page 2022
[20]

Hdbscan: Density based clustering over location based services, 2016

Md Farhadur Rahman, Weimo Liu, Saad Bin Suhaim, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. Hdbscan: Density based clustering over location based services, 2016

work page 2016
[21]

Automated research article classification and recommendation using nlp and ml, 2025

Shadikur Rahman, Hasibul Karim Shanto, Umme Ayman Koana, and Syed Muhammad Dan- ish. Automated research article classification and recommendation using nlp and ml, 2025

work page 2025
[22]

Gaussian mixture models

Douglas Reynolds. Gaussian mixture models. In StanZ. Li and Anil Jain, editors,Encyclopedia of Biometrics, pages 659–663. Springer US, 2009

work page 2009
[23]

Rousseeuw

Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987

work page 1987
[24]

Sampaio and Helene Maxcici

Phillipe R. Sampaio and Helene Maxcici. Unsupervised document and template clustering using multimodal embeddings, 2025

work page 2025
[25]

Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language

Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman Meuschke, and Bela Gipp. Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language. InProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL ’20, page 137–146, New York, NY, USA, 2020. As...

work page 2020
[26]

Assessing scientific research pa- pers with knowledge graphs

Kexuan Sun, Zhiqiang Qiu, Abel Salinas, Yuzhong Huang, Dong-Ho Lee, Daniel Benjamin, Fred Morstatter, Xiang Ren, Kristina Lerman, and Jay Pujara. Assessing scientific research pa- pers with knowledge graphs. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 2467–2472, 2022. 21

work page 2022

[1] [1]

Op- tuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Op- tuna: A next-generation hyperparameter optimization framework. InThe 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623–2631, 2019

work page 2019

[2] [2]

SciBERT: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620,...

work page 2019

[3] [3]

Clement, Matthew Bierbaum, Kevin P

Colin B. Clement, Matthew Bierbaum, Kevin P. O’Keeffe, and Alexander A. Alemi. On the use of arxiv as a dataset, 2019

work page 2019

[4] [4]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. SPECTER: Document-level Representation Learning using Citation-informed Transformers. InACL, 2020

work page 2020

[5] [5]

J. Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960

work page 1960

[6] [6]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977

work page 1977

[7] [7]

A simple generalisation of the area under the roc curve for multiple class classification problems.Machine learning, 45(2):171–186, 2001

David J Hand and Robert J Till. A simple generalisation of the area under the roc curve for multiple class classification problems.Machine learning, 45(2):171–186, 2001

work page 2001

[8] [8]

Deep learning meets knowledge graphs for scholarly data classification

Fabian Hoppe, Danilo Dess` ı, and Harald Sack. Deep learning meets knowledge graphs for scholarly data classification. InCompanion Proceedings of the Web Conference 2021 (WWW ’21 Companion). ACM, 2021

work page 2021

[9] [9]

Node level graph autoencoder: Unified pretraining for textual graph learning, 2024

Wenbin Hu, Huihao Jing, Qi Hu, Haoran Li, and Yangqiu Song. Node level graph autoencoder: Unified pretraining for textual graph learning, 2024

work page 2024

[10] [10]

A new equivalence statistic for partition comparison

Lawrence J Hubert and Patrick Arabie. A new equivalence statistic for partition comparison. Journal of Classification, 2(1):421–428, 1985

work page 1985

[11] [11]

A classification framework for scientific documents to support knowledge graph population

Angelika Kaplan, Jan Keim, Lukas Greiner, Anne Koziolek, and Ralf Reussner. A classification framework for scientific documents to support knowledge graph population. Gesellschaft f¨ ur Informatik, Bonn, 2025

work page 2025

[12] [12]

Mpnet: Masked & permutationed language modeling for pre-training natural language under- standing

Yinhan Liu, Moin Nadeem, Kuzman Ganchev, Zhiyuan Liu, Yanghui Rao, and Jie Tang. Mpnet: Masked & permutationed language modeling for pre-training natural language under- standing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9121–9128, 2020

work page 2020

[13] [13]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 20

work page 2019

[14] [14]

Article classification with graph neural networks and multigraphs, 2024

Khang Ly, Yury Kashnitsky, Savvas Chamezopoulos, and Valeria Krzhizhanovskaya. Article classification with graph neural networks and multigraphs, 2024

work page 2024

[15] [15]

Multivariate observations

J MacQueen. Multivariate observations. InProceedings ofthe 5th Berkeley Symposium on Mathematical Statisticsand Probability, volume 1, pages 281–297, 1967

work page 1967

[16] [16]

B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, 1975

work page 1975

[17] [17]

Many faces of entropy.Journal of Machine Learning Research, 8(1):195–20 entropy, 2007

Marina Meila. Many faces of entropy.Journal of Machine Learning Research, 8(1):195–20 entropy, 2007

work page 2007

[18] [18]

A note on the analysis of the adjusted rand index

Glenn W Milligan and Douglas M Cooper. A note on the analysis of the adjusted rand index. Psychometrika, 51(3):495–499, 1986

work page 1986

[19] [19]

Neigh- borhood contrastive learning for scientific document representations with citation embeddings

Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, and Georg Rehm. Neigh- borhood contrastive learning for scientific document representations with citation embeddings. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Confer- ence on Empirical Methods in Natural Language Processing, pages 11670–11688, Abu ...

work page 2022

[20] [20]

Hdbscan: Density based clustering over location based services, 2016

Md Farhadur Rahman, Weimo Liu, Saad Bin Suhaim, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. Hdbscan: Density based clustering over location based services, 2016

work page 2016

[21] [21]

Automated research article classification and recommendation using nlp and ml, 2025

Shadikur Rahman, Hasibul Karim Shanto, Umme Ayman Koana, and Syed Muhammad Dan- ish. Automated research article classification and recommendation using nlp and ml, 2025

work page 2025

[22] [22]

Gaussian mixture models

Douglas Reynolds. Gaussian mixture models. In StanZ. Li and Anil Jain, editors,Encyclopedia of Biometrics, pages 659–663. Springer US, 2009

work page 2009

[23] [23]

Rousseeuw

Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987

work page 1987

[24] [24]

Sampaio and Helene Maxcici

Phillipe R. Sampaio and Helene Maxcici. Unsupervised document and template clustering using multimodal embeddings, 2025

work page 2025

[25] [25]

Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language

Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman Meuschke, and Bela Gipp. Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language. InProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL ’20, page 137–146, New York, NY, USA, 2020. As...

work page 2020

[26] [26]

Assessing scientific research pa- pers with knowledge graphs

Kexuan Sun, Zhiqiang Qiu, Abel Salinas, Yuzhong Huang, Dong-Ho Lee, Daniel Benjamin, Fred Morstatter, Xiang Ren, Kristina Lerman, and Jay Pujara. Assessing scientific research pa- pers with knowledge graphs. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 2467–2472, 2022. 21

work page 2022