Entity-based filtering and topic detection For online reputation monitoring in Twitter

Spina, Damiano

Entity-based filtering and topic detection For online reputation monitoring in Twitter

Spina, Damiano

Supervised by:

Julio Gonzalo Arroyo Director
Enrique Amigó Director

Defence university: UNED. Universidad Nacional de Educación a Distancia

Fecha de defensa: 25 September 2014

Committee:

María Felisa Verdejo Maíllo Chair
Pablo Castells Azpilicueta Secretary
Manos Tsagkias Committee member

Type: Thesis

Teseo: 378354 DIALNET

Abstract

With the rise of social media channels such as Twitter �the most popular microblogging service� the control of what is said about entities �companies, people or products� online has been shifted from them to users and consumers. This has generated the necessity of monitoring the reputation of those entities online. In this context, it is only natural to witness a significant growth of demand for text mining software for Online Reputation Monitoring: automatic tools that help processing, understanding and aggregating large streams of facts and opinions about a company or individual. Despite the variety of Online Reputation Monitoring tools on the market, there is no standard evaluation framework yet �a widely accepted set of task definitions, evaluation measures and reusable test collections to tackle this problem. In fact, there is even no consensus on what the tasks carried out during the Online Reputation Monitoring process are, on which a system should minimize the effort of the user. In the context of a collective effort to identify and formalize the main challenges in the Online Reputation Monitoring process in Twitter, we have participated in the definition of tasks and subsequent creation of suitable test collections (WePS-3, RepLab 2012 and RepLab 2013 evaluation campaigns) and we have studied in depth two of the identified challenges: filtering (Is a tweet related to a given entity of interest?) �modeled as a binary classification task� and topic detection (What is being said about an entity in a given tweet stream?), that consists of clustering tweets according to topics. Compared to previous studies on Twitter, our problem lies in its long tail: except for a few exceptions, the volume of information related to a specific entity (organization or company) at a given time is orders of magnitude smaller than Twitter trending topics, making the problem much more challenging than identifying Twitter trends. We rely on three building blocks to propose different approaches to tackle these two tasks : the use of filter keywords, external resources (such as Wikipedia, representative pages of the entity of interest, etc.) and the use of entity-specific training data when available. We have found that the notion of filter keywords �expressions that, if present in a tweet, indicate a high probability that it is either related or unrelated to the entity of interest� can be effectively used to tackle the filtering task. Here, (i) specificity of a term to the tweet stream of the entity is a useful feature to identify keywords, and (ii) the association between a term and the entity�s Wikipedia page is useful to differentiate positive vs. negative filter keywords, especially when it is averaged by considering its most co-occurrent terms. In addition, exploring the nature of filter keywords also led us to the conclusion that there is a gap between the vocabulary that characterizes a company in Twitter and the vocabulary associated to the company in its homepage, in Wikipedia, and even in the Web at large. We have also found that, when entity-specific training data is available �as in the known-entity scenario� it is more cost effective to use a simple Bag-of-Words classifier. When enough training data is available (around 700 tweets per entity), Bag-of-Words classifiers can be effectively used for the filtering task. Moreover, they can be used effectively in an active learning scenario, where the system updates its classification model with the stream of annotations and interactions with the system made by the reputation expert along the monitoring process. In this context, we found that by selecting the tweets to be labeled as those on which the classifier is less confident (margin sampling), the cost of creating a bulk training set can be reduced by 90% after inspecting 10% of test data. Unlike many other applications of active learning on Natural Language Processing tasks, margin sampling works better than random sampling. As for the topic detection problem, we considered two main strategies: the first is inspired on the notion of filter keywords and works by clustering terms as an intermediate step towards document clustering. The second � and most successful � learns a pairwise tweet similarity function from previously annotated data, using all kinds of content-based and Twitter-based features; and then applies a clustering algorithm on the previously learned similarity function. Our experiments indicate that (i) Twitter signals can be used to improve the topic detection process with respect to using content signals only; (ii) learning a similarity function is a flexible and efficient way of introducing supervision in the topic detection clustering process. The performance of our best system is substantially better than state-of-the-art approaches and gets close to the inter-annotator agreement rate of topic detection annotations in the RepLab 2013 dataset �to our knowledge, the largest dataset available for Online Reputation Monitoring. A detailed qualitative inspection of the data further reveals two types of topics detected by reputation experts: reputation alerts / issues (which usually spike in time) and organizational topics (which are usually stable across time). Along with our contribution to building a standard evaluation framework to study the Online Reputation Monitoring problem from a scientific perspective, we believe that the outcome of our research has practical implications and may help the development of semi-automatic tools to assist reputation experts in their daily work.