Seeking robustness in a multilingual worldfrom pipelines to embeddings

  1. Doval, Yerai
Zuzendaria:
  1. Manuel Vilares Ferro Zuzendaria
  2. Jesús Vilares Zuzendarikidea

Defentsa unibertsitatea: Universidade da Coruña

Fecha de defensa: 2019(e)ko abendua-(a)k 17

Epaimahaia:
  1. Lourdes Araujo Presidentea
  2. Miguel Á. Alonso Idazkaria
  3. Pavel Brazdil Kidea

Mota: Tesia

Teseo: 608758 DIALNET lock_openRUC editor

Laburpena

In this dissertation, we study two approaches to overcome the challenges posed by processing user-generated non-standard multilingual text content as it is found on the Web nowadays. Firstly, we present a traditional discrete pipeline approach where we preprocess the input text so that it can be more easily handled later by other systems. This implies dealing first with the multilinguality concern by identifying the language of the input and, next, managing the language-specific non-standard writing phenomena involved by means of text normalization and word (re-)segmentation techniques. Secondly, we analyze the inherent limitations of this type of discrete models, taking us to an approach centred on the use of continuous word embedding models. In this case, the explicit preprocessing of the input is replaced by the encoding of the linguistic characteristics and other nuances of non-standard texts in the embedding space. We aim to obtain continuous models that not only overcome the limitations of discrete models but also align with the current state of the art in Natural Language Processing (NLP), dominated by systems based on neural networks. The results obtained after extensive experimentation showcase the capabilities of word embeddings to effectively support the multilingual and non-standard phenomena of usergenerated texts. Furthermore, all this is accomplished within a conceptually simple and modular framework which does not sacrifice system integration. Such embedding models can be readily used as a fundamental building block for state-of-the-art neural networks which are, in turn, used in virtually any NLP task.