Recent Advances in Ontology-based Semantic Similarity Measures and Information Content Models based on WordNet

Lastra Díaz, Juan José

Recent Advances in Ontology-based Semantic Similarity Measures and Information Content Models based on WordNet

Lastra Díaz, Juan José

unter der Leitung von:

Ana M. García Serrano Doktormutter

Universität der Verteidigung: UNED. Universidad Nacional de Educación a Distancia

Fecha de defensa: 30 von November von 2016

Gericht:

Julio Gonzalo Arroyo Präsident
David Sánchez Ruenes Sekretär/in
Sébastien Harispe Vocal

Art: Dissertation

Teseo: 553713 DIALNET Open Access editor

Zusammenfassung

Human similarity judgments between concepts underlie most of cognitive capabilities, such as categorization, memory, decision-making and reasoning. Thus, the proposal for concept similarity models to estimate the degree of similarity between word and concept pairs has been a very active line of research with many applications in the fields of cognitive sciences, artifcial intelligence, Information Retrieval (IR) and genomics, among others. The most successful approach to estimate human similarity judgements is set by the family of ontology-based semantic similarity measures based on WordNet for general domain applications, or MeSH and SNOMED for biomedical applications, as well as the Gene Ontology (GO) for genomics. The advent of the Semantic Web has encouraged the emergence of a novel family of IR models and semantic search systems based on ontologies. In this latter scenario, the ontologies have also been extensively used as semantic conceptual spaces with the aim of indexing and representing large collections of documents and other types of semantically-annotated information. This thesis introduces two new families of ontology-based semantic similarity measures and Information Content (IC) models based on WordNet together with the largest experimental surveys reported in the literature. Our experiments are based on our software implementation of most methods reported in the literature. In addition, this thesis introduces several significant contributions into the reproducibility of word similarity benchmarks, ontology-based semantic similarity measures and IC models as follows: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and IC models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in all our previous works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist in the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measure libraries. Our novel family of ontology-based semantic similarity measures is based on two previously unconsidered notions as follows: a generalization of the classic Jiang-Conrath (J&C) distance to any type of taxonomy which is based on an IC-based weighted graph derived from the conditional probabilities between child and parent concepts, and a non-linear normalization function that converts the ontology-based semantic distances into similarity functions. Likewise, our new family of intrinsic and corpus-based IC models is based on two previously unconsidered notions as follows: the preservation of the probabilistic structure of the taxonomy associated to the conditional probabilities between child and parent concepts, and the explicit consideration of a cognitive similarity notion in the definition of the IC model. Our new IC-based similarity measures outperform the state-of-the-art measures in a statistically significant manner, whilst our new family of IC models obtains rivaling results as regards the state-of-the-art methods and sets an open framework for the derivation of novel intrinsic IC models based on alternative methods for the estimation of the conditional probability between child and parent concepts. On the other hand, PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. HESML outperforms the state-of-the-art semantic measure libraries by several orders of magnitude and shows that it is possible to improving their performance and scalability significantly without caching using PosetHERep. Our large experimental surveys, including most similarity measures and IC models based on WordNet reported in the literature, also led us to be on the lookout for several reproducibility problems in the replication of methods and experiments previously reported in the literature, as well as the discovery of contradictory results. Likewise, our experimental surveys allow us to refute two common beliefs held among the research community: (1) a wrong belief about the outperformance of intrinsic IC models over those based on a corpus that is refuted by our results, and (2) another wrong belief about the overall outperformance of the classic IC-based similarity measures on the family of path-based semantic measures, which is refuted by our conclusion that only a small set of similarity measures based on recent hybrid IC-based measures obtain a statistically significant higher Spearman correlation value than the family of path-based similarity measures. This latter fact explains some unexpected results in information retrieval applications based on similarity measures in which several authors point out that there is no a statistically signicant difference between the performance obtained by the families of classic semantic similarity measures based on IC models and other classic measures based on the length of the shortest path between concepts when the Spearman correlation metric is used.