Information search and similarity based on Web 2.0 and semantic technologies

  1. Fuentes Lorenzo, Damaris
unter der Leitung von:
  1. Norberto Fernández García Doktorvater/Doktormutter
  2. Luis Sánchez Fernández Doktorvater/Doktormutter

Universität der Verteidigung: Universidad Carlos III de Madrid

Fecha de defensa: 26 von Mai von 2015

Gericht:
  1. Asunción Gómez Pérez Präsident/in
  2. Mario Muñoz Organero Sekretär/in
  3. Anselmo Peñas Padilla Vocal

Art: Dissertation

Zusammenfassung

The World Wide Web provides a huge amount of information described in natural language at the current society’s disposal. Web search engines were born from the necessity of finding a particular piece of that information. Their ease of use and their utility have turned these engines into one of the most used web tools at a daily basis. To make a query, users just have to introduce a set of words - keywords - in natural language and the engine answers with a list of ordered resources which contain those words. The order is given by ranking algorithms. These algorithms use basically two types of features: dynamic and static factors. The dynamic factor has into account the query; that is, those documents which contain the keywords used to describe the query are more relevant for that query. The hyperlinks structure among documents is an example of a static factor of most current algorithms. For example, if most documents link to a particular document, this document may have more relevance than others because it is more popular. Even though currently there is a wide consensus on the good results that the majority of web search engines provides, these tools still suffer from some limitations, basically 1) the loneliness of the searching activity itself; and 2) the simple recovery process, based mainly on offering the documents that contains the exact terms used to describe the query. Considering the first problem, there is no doubt in the lonely and time-consuming process of searching relevant information in the World Wide Web. There are thousands of users out there that repeat previously executed queries, spending time in taking decisions of which documents are relevant or not; decisions that may have been taken previously and that may be do the job for similar or identical queries for other users. Considering the second problem, the textual nature of the current Web makes the reasoning capability of web search engines quite restricted; queries and web resources are described in natural language that, in some cases, can lead to ambiguity or other semantic-related difficulties. Computers do not know text; however, if semantics is incorporated to the text, meaning and sense is incorporated too. This way, queries and web resources will not be mere sets of terms, but lists of well-defined concepts. This thesis proposes a semantic layer, known as Itaca, which joins simplicity and effectiveness in order to endow with semantics both the resources stored in the World Wide Web and the queries used by users to find those resources. This is achieved through collaborative annotations and relevance feedback made by the users themselves, which describe both the queries and the web resources by means of Wikipedia concepts. Itaca extends the functional capabilities of current web search engines, providing a new ranking algorithm without dispensing traditional ranking models. Experiments show that this new architecture offers more precision in the final results obtained, keeping the simplicity and usability of the web search engines existing so far. Its particular design as a layer makes feasible its inclusion to current engines in a simple way.