Automatic Detection of Webpages that Share the Same Web Template

David Insa (Universitat Polit\`ecnica de Val\`encia; Josep Silva (Universitat Polit\`ecnica de Val\`encia; Juli\'an Alarte (Universitat Polit\`ecnica de Val\`encia; Madrid; Salvador Tamarit (Universidad Polit\'ecnica de Madrid; Spain); Valencia

arxiv: 1409.2590 · v1 · pith:5ZML44CAnew · submitted 2014-09-09 · 💻 cs.IR

Automatic Detection of Webpages that Share the Same Web Template

Juli\'an Alarte (Universitat Polit\`ecnica de Val\`encia , Valencia , Spain) , David Insa (Universitat Polit\`ecnica de Val\`encia , Josep Silva (Universitat Polit\`ecnica de Val\`encia , Salvador Tamarit (Universidad Polit\'ecnica de Madrid , Madrid This is my paper

classification 💻 cs.IR

keywords templatewebpagesextractiondetectionidentifyingsameanalysisanalyze

0 comments

read the original abstract

Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a website that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence.

This paper has not been read by Pith yet.

Automatic Detection of Webpages that Share the Same Web Template

discussion (0)