A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Jamie Callan; Keyang Xu; Kyle Yingkai Gao

arxiv: 1804.02734 · v1 · pith:OBIVX6GTnew · submitted 2018-04-08 · 💻 cs.IR

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Keyang Xu , Kyle Yingkai Gao , Jamie Callan This is my paper

classification 💻 cs.IR

keywords crawlingefficientlymediasocialsourceunsupervisedcrawldescribes

0 comments

read the original abstract

Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.

This paper has not been read by Pith yet.

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

discussion (0)