Dealing with Sparse Document and Topic Representations: Lab Report for CHiC 2012

Andias Wira-Alam; Daniel Hienert; Frank Sawitzki; Philipp Schaer; Thomas L\"uke

arxiv: 1208.3952 · v1 · pith:6LPUSM7Enew · submitted 2012-08-20 · 💻 cs.IR

Dealing with Sparse Document and Topic Representations: Lab Report for CHiC 2012

Philipp Schaer , Daniel Hienert , Frank Sawitzki , Andias Wira-Alam , Thomas L\"uke This is my paper

classification 💻 cs.IR

keywords documentsparsechicdescriptioneuropeanafirstreportrepresentations

0 comments

read the original abstract

We will report on the participation of GESIS at the first CHiC workshop (Cultural Heritage in CLEF). Being held for the first time, no prior experience with the new data set, a document dump of Europeana with ca. 23 million documents, exists. The most prominent issues that arose from pretests with this test collection were the very unspecific topics and sparse document representations. Only half of the topics (26/50) contained a description and the titles were usually short with just around two words. Therefore we focused on three different term suggestion and query expansion mechanisms to surpass the sparse topical description. We used two methods that build on concept extraction from Wikipedia and on a method that applied co-occurrence statistics on the available Europeana corpus. In the following paper we will present the approaches and preliminary results from their assessments.

This paper has not been read by Pith yet.

Dealing with Sparse Document and Topic Representations: Lab Report for CHiC 2012

discussion (0)