Word sense disambiguation in multilingual contexts

Duque Fernández, Andrés

Word sense disambiguation in multilingual contexts

Duque Fernández, Andrés

Dirigée par:

Lourdes Araujo Directrice
Juan Martínez Romo Directeur

Université de défendre: UNED. Universidad Nacional de Educación a Distancia

Fecha de defensa: 17 février 2017

Jury:

Julio Gonzalo Arroyo President
Eneko Agirre Bengoa Secrétaire
Ahmet Aker Rapporteur

Type: Thèses

Teseo: 557578 DIALNET e-spacio editor

Résumé

Word Sense Disambiguation (WSD) can be defined as the process of identifying the sense adopted by a polysemic word, that is, a word with different possible meanings, in a particular context within a sentence. This process represents a key aspect of any Natural Language Processing task, given the need of determining without ambiguity the correct meaning of all the words within a text, for an automatic system to be able to understand it and work with it. In this thesis, we present a research focused on Word Sense Disambiguation in scenarios in which it is possible to make use of information written in different languages. Considering those scenarios, we divide the thesis into two lines of study, depending on the specific WSD tasks that are tackled: Cross-Lingual Word Sense Disambiguation, and multilingual Word Sense Disambiguation in the biomedical domain. In the first task, the main aim is to find the most suitable translation for an ambiguous target term written in a source language (typically English) into a target one. The biomedical WSD task is based on finding the most suitable sense of a term that can refer to many different biomedical concepts. In order to address the proposed tasks, we use a novel technique based on co-occurrence graphs: through that technique, we are able to transform the unstructured information available in different corpora, into a structured base of knowledge that will be subsequently used for performing the disambiguation itself. This knowledge base is a graph in which nodes represent concepts from a given corpus, and the links between those nodes contain information related to the statistical significance of their co-occurrence, that is, of the appearance of both concepts in the same document of the corpus. Regarding the first task, multilingual information is inherent to the problem itself, since the objective is to find the most suitable translations of words between different languages. For addressing it, our system makes use of the co-occurrence graphs for representing the knowledge in the target language. Then, the contexts of the ambiguous terms, written in the source language and translated through an automatically created bilingual dictionary, are used as source of information for the co-occurrence graph to perform the disambiguation step. In this line of research we also present a study on the possible bilingual dictionaries needed in this kind of tasks. Considering the biomedical WSD task, in our research multilinguality is used as an additional evidence for testing whether it is possible to improve the performance of monolingual systems addressing the task. For that purpose, we initially adapt our system for tackling the task under a monolingual perspective (in which the co-occurrence graph is built from a corpus written in a single language). After that, we enhance the graph with information from additional languages, in order to study whether this enhancement leads to an improvement of the results obtained by the system. It is a pioneering research in this field, since no similar studies have been found in the literature that make use of multilingual information for performing WSD in the biomedical domain. We have explored many different monolingual and multilingual corpora along the development of this thesis, both written with general purposes and related to a specific domain (in particular, the biomedical domain). We have also studied and compared different algorithms that make use of the co-occurrence graph as a structured knowledge base for performing the final disambiguation. The mathematical hypothesis in which the construction of our co-occurrence graph is based, has been compared to similar techniques, offering better results. Similarly, for each of the considered tasks (Cross-Lingual WSD and biomedical WSD), our system has been compared with other state-of-the-art techniques, obtaining very competitive results.