SimDoc: Topic Sequence Alignment based Document Similarity Framework

Gaurav Maheshwari; Harshita Sahijwani; Jens Lehmann; Kunal Jha; Priyansh Trivedi; Sourish Dasgupta

arxiv: 1611.04822 · v2 · pith:IV5ZMSJ6new · submitted 2016-11-15 · 💻 cs.CL

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Gaurav Maheshwari , Priyansh Trivedi , Harshita Sahijwani , Kunal Jha , Sourish Dasgupta , Jens Lehmann This is my paper

classification 💻 cs.CL

keywords documentsimilaritysemanticsimdocalignmentclusteringdocumentsestimating

0 comments

read the original abstract

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

This paper has not been read by Pith yet.

SimDoc: Topic Sequence Alignment based Document Similarity Framework

discussion (0)