pith. sign in

arxiv: cond-mat/0109006 · v1 · submitted 2001-09-01 · ❄️ cond-mat.dis-nn

Resampling methods for document clustering

classification ❄️ cond-mat.dis-nn
keywords clusteringalgorithmswordsappliedcategorizationdictionarydifferentdiscriminative
0
0 comments X
read the original abstract

We compare the performance of different clustering algorithms applied to the task of unsupervised text categorization. We consider agglomerative clustering algorithms, principal direction divisive partitioning and (for the first time) superparamagnetic clustering with several distance measures. The algorithms have been applied to test databases extracted from the Reuters-21578 text categorization test database. We find that simple application of the different clustering algorithms yields clustering solutions of comparable quality. In order to achieve considerable improvements of the clustering results it is crucial to reduce the dictionary of words considered in the representation of the documents. Significant improvements of the quality of the clustering can be obtained by identifying discriminative words and filtering out indiscriminative words from the dictionary. We present two methods, each based on a resampling scheme, for selecting discriminative words in an unsupervised way.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.