The Observational Representation Framework and its Implications in Document Similarity, Feature Aggregation and Ranking Fusion

  1. Giner Martínez, Fernando
Supervised by:
  1. Enrique Amigó Director

Defence university: UNED. Universidad Nacional de Educación a Distancia

Fecha de defensa: 23 September 2021

Committee:
  1. Fermín Moscoso del Prado Martín Chair
  2. Víctor Fresno Fernández Secretary
  3. Julian Urbano Merino Committee member

Type: Thesis

Abstract

Document representation is a core issue in information access tasks. Representing documents requires managing features in terms of three aspects: weighting, redundancy and scaling (i.e., quantitative vs. discrete features). In supervised scenarios, this is done by maximizing effectiveness over specific tasks and training data. However, in this thesis, we focus on non-supervised scenarios, in which document representation is guided by how features are distributed throughout a document collection. Based on an analysis of the literature, we claim in this thesis that traditional representation approaches are not able to capture weighting, redundancy and quantitativity simultaneously. In this thesis, we present the Observational Representation Framework (ORF), which overcomes this limitation. The ORF integrates aspects of representation models based on vector spaces, feature sets and information theory. In addition, we explore the theoretical and practical implications of the ORF in three ways. In the first study, we exploit ORF as a formal framework for document similarity. In this study, we identify the strengths and weaknesses of existing similarity functions based on metric spaces (cosine distance, Euclidean distance, etc.), feature sets (Jaccard distance, Dice distance, etc.) and information theory (pointwise mutual information (PMI), Lin's similarity, conditional probability, etc.). To overcome the limitations observed in this analysis, we define the Information Contrast Model (ICM), which is a parametrized generalization of the PMI. In the second study, we empirically check the ability of the ORF to integrate heterogeneous features (i.e., features with discrete and continuous values) without requiring supervision. We perform experiments in the context of message clustering for online reputation management. Finally, in the third study, we analyse the ORF as a formal basis for ranking fusion. Our formal analysis shows that the ORF can accommodate different ranking fusion algorithms depending on the assumptions adopted, such as averaging schemes and the Borda, Copeland and Unanimous Improvement Ratio (UIR) algorithms. Our experiments on six ranking fusion datasets shed light on which aspects of the scenarios at hand determine the suitability of different assumptions and ranking fusion algorithms.