pith. sign in

arxiv: 1804.02734 · v1 · pith:OBIVX6GTnew · submitted 2018-04-08 · 💻 cs.IR

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

classification 💻 cs.IR
keywords crawlingefficientlymediasocialsourceunsupervisedcrawldescribes
0
0 comments X
read the original abstract

Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.