Biomedical Information ExtractionExploring new entities and relationships

  1. Fabregat Marcos, Hermenegildo
Supervised by:
  1. Lourdes Araujo Director
  2. Juan Martínez Romo Director

Defence university: UNED. Universidad Nacional de Educación a Distancia

Fecha de defensa: 16 September 2021

Committee:
  1. Isabel Segura Bedmar Chair
  2. Víctor Fresno Fernández Secretary
  3. Arkaitz Zubiaga Committee member
Departament: Lenguajes y Sistemas Informáticos
Faculty: Escuela Técnica Superior de Ingeniería Informática

Type: Thesis

e-spacio. Repositorio Institucional de la UNED: lock_openOpen access Handle

Abstract

The different processes of digitization and dissemination of information that the society is currently experiencing have led to an increase of the available information, especially in the biomedical domain. Due to the effort required to process this volume of information, a research line that has been notably active in the last decade is the exploration of natural language processing and machine learning techniques for the extraction of information from unstructured documents. These techniques represent major milestones in the biomedical domain, especially in some information extraction tasks such as named entity recognition and relation extraction. In this thesis we present a research focused on the automatic analysis of biomedical documents, deepening in the processing of documents about disabilities and functional impairments. These disorders have a significant impact on the social impact, since they affect to the daily life of a large part of the population, leading in some cases to serious limitations on the autonomy of the affected people. In addition, several rare diseases are associated with a wide range of disabilities, so they are frequently used to define them and they can represent very useful features for the diagnosis of these diseases, for which, and due to their nature, not much information is usually available. The main objective of this thesis is the exploration of documents from the biomedical domain for the recognition of mentions to disabilities and the identification of their relationships with rare diseases. The processing of these entities involves specific difficulties, such as the lack of formal concretions for the definition of disability, and the wide range of ways to express the same disability. In order to address this objective, it was necessary to collect and annotate different datasets, including documents written in different languages. After the generation of these resources, we proceeded with the exploration of entity recognition systems for the identification of mentions of rare diseases and disabilities, and with the study of systems for the extraction of relationships between disabilities and rare diseases. Deepening in the analysis of these entities, we advanced on the exploration of the challenges for the generation of automatic systems oriented to the recognition of disabilities by proposing an evaluation task. The different lessons learned during the evaluation task were used for the development and enhancement of an automatic system for disability recognition based on deep learning techniques. The developed system is based on the mixed use of different types of recurrent networks and it presented improvements over current state-of-theart systems. At the same time, this system served as an initial architecture for the exploration of joint entity recognition and relation extraction systems. The study of the synergy between both tasks led to significant improvements. Finally, in order to explore the effects of negation on information extraction systems, we analyzed several approaches for the automatic processing of negation in Spanish and English documents. During this analysis we examined the performance of proposals for the detection of negation triggers and their scopes, obtaining performance improvements over state-of-the-art proposals for the processing of Spanish documents. The results obtained for negation processing also led to interesting improvements on relation extraction and entity recognition.