Detección de Web Spam basada en la Recuperación Automática de Enlaces

  1. Araujo, Lourdes
  2. Martínez Romo, Juan
Aldizkaria:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Argitalpen urtea: 2009

Zenbakia: 42

Orrialdeak: 39-46

Mota: Artikulua

Beste argitalpen batzuk: Procesamiento del lenguaje natural

Laburpena

Nowadays, Web Spam is a war between search engines, trying to ensure that the results are relevant to the user, and a community that tries to mislead the search engine to attract to the former ones to its pages. In this work, we present a preliminary study about several features that can be useful for building a novel web spam detection system. Some of these features are obtained from a system for automatic recovery of broken Web links. This system uses several sources of information from the analyzed page to extract useful data that are used later to perform a query to a typical search engine, as Google or Yahoo!. Afterwards, retrieved pages are ordered based on its content, using information retrieval techniques. Finally, the recovery links degree is used, along with other features, as an indicator of Spam.