Detección de patrones psicolingüísticos para el análisis de lenguaje subjetivo en español

Salas Zarate, Maria Del Pilar

Detección de patrones psicolingüísticos para el análisis de lenguaje subjetivo en español

Salas Zarate, Maria Del Pilar

Dirigida por:

Rafael Valencia García Director/a
Miguel Ángel Rodríguez García Director

Universidad de defensa: Universidad de Murcia

Fecha de defensa: 16 de mayo de 2017

Tribunal:

Jesualdo Tomás Fernández Breis Presidente/a
Alejandro Rodríguez González Secretario/a
José Antonio Miñarro Giménez Vocal

Tipo: Tesis

Teseo: 144043 DIALNET DIGITUM editor

Resumen

OBJETIVOS. La clasificación automática de opiniones requiere un esfuerzo multidisciplinario, donde la linguística y el procesamiento del lenguaje natural juegan un rol importante. Un aspecto importante a considerar en la clasificación de opiniones es el lenguaje figurado tal como la ironía, el sarcasmo y la sátira, ya que el doble sentido expresado en una opinión o comentario puede invertir la polaridad de la opinión. El objetivo principal de esta tesis es la detección de patrones psicolingüísticos para el análisis de lenguaje subjetivo en español. Específicamente, se establecieron 4 objetivos específicos: 1) diseño de un método para la detección de patrones psicolingüísticos para el análisis de sentimientos; 2) diseño de un método para la detección de patrones psicolingüísticos para el análisis de textos satíricos y no satíricos; 3) validación del método para el análisis de sentimientos en diversos dominios como el turístico y películas; 4) validación del método para la detección automática de la sátira en el dominio de noticias. METODOLOGÍA. Para lograr este objetivo, primero se lleva a cabo un estudio del estado del arte que incluye tecnologías de procesamiento de lenguaje natural, análisis de sentimientos y lenguaje subjetivo. Específicamente, los diferentes niveles de procesamiento, principales enfoques del análisis de sentimientos, niveles de procesamiento de la opinión, bases de conocimiento, recursos lingüísticos disponibles y principales técnicas para la detección del lenguaje figurado. Posteriormente, se realiza el diseño e implementación de un método para el análisis de sentimientos y detección de la sátira basados en características psicolingüísticas. Finalmente, la propuesta se valida en diferentes dominios. Concretamente, el método de análisis de sentimientos se aplica al dominio turístico y de películas; y el método de detección de la sátira se aplica en el dominio de noticias en redes sociales. RESULTADOS. Como resultado se obtiene: o Un método para la clasificación de sentimientos y detección de la sátira. Este método permite clasificar opiniones como positivas, negativas, neutras, muy positivas y muy negativas y tweets como satíricos y no satíricos. o Un proceso para el pre-procesamiento de tweets en español. o Un corpus en el dominio del turismo. El corpus contiene 1600 opiniones sobre hoteles, restaurantes, museos, entre otros temas, las cuales son clasificadas con su respectiva polaridad (positivo, negativo, neutro, muy positivo, muy negativo). o Un corpus de tweets satíricos y no satíricos. Este corpus consiste en un conjunto de 10000 tweets etiquetados como satíricos y no satíricos extraídos desde diversas cuentas de Twitter. o Un conjunto de características psicolingüísticas para la clasificación de sentimientos y detección de la sátira. CONCLUSIONES. La clasificación automática de opiniones requiere un esfuerzo donde la linguística y el procesamiento del lenguaje natural juegan un rol importante. Gracias a estas disciplinas fue posible entender de mejor manera el lenguaje humano, clasificar las opiniones y resumir los sentimientos expresados en textos. Por otro lado, el lenguaje figurado es uno de los temas más difíciles del PLN, ya que a diferencia del lenguaje literal, el escritor toma ventaja de diversas figuras lingüísticas tales como la metáfora, la analogía, la ambigüedad, entre otros, para proyectar significados más complejos. Este tipo de lenguaje es difícil de entender no sólo para las computadoras, sino también para el ser humano. Esta tesis describió un método para la detección de patrones psicolingüísticos para el análisis de sentimientos y la detección automática de la sátira. Las características psicolingüísticas, junto con técnicas de procesamiento de lenguaje natural y minería de datos, resultaron ser efectivas para la detección de sentimientos y de la sátira. Además, la validación de los métodos en diversos dominios ha demostrado la efectividad de nuestro enfoque para clasificar opiniones y tweets. AIMS OF THE THESIS. The linguistic and natural language processing play an important role in the automatic classification of opinions. Furthermore, the figurative language is an important aspect to be considered in sentiment analysis, because of the double meaning expressed in the opinion can reverse the polarity of an opinion. The main goal of this thesis is to detect psycholinguistic patterns for the analysis of subjective language in Spanish. Four specific aims are established: 1) design of a method for detecting psycholinguistic patterns for sentiment analysis; 2) design of a method for detecting psycholinguistic patterns for the analysis of satirical texts; 3) validation of the method for sentiment analysis in different contexts, namely, tourism and movies domains; 4) validation of the method for automatic detection of satire in the news domain. METHODOLOGY. The methodology proposed is based on the analysis of the state of the art. This analysis includes technologies such as natural language processing, sentiment analysis, and subjective language. Furthermore, this task involves the analysis of the different levels of natural language processing, sentiment analysis approaches, levels of processing of opinions, knowledge bases, available linguistic resources, and main techniques for the detection of figurative language. Subsequently, a psycholinguistic features-based method for the sentiment analysis and detection of satire is designed and implemented. Finally, the proposal is validated in different domains. Specifically, the method of sentiment analysis is applied to the tourist and movies domain, and the method of satire detection is applied in the news domain in social networks. RESULTS. The main contributions of this work are: o A method for sentiment analysis and detection of satire. This method classifies opinions as positive, negative, neutral, very positive and very negative; and tweets as satirical and non-satirical. o A process for the pre-processing of tweets in Spanish. o A corpus in the tourism domain. The corpus contains 1600 reviews about hotels, restaurants, museums, among other topics, which are classified with their respective polarity (positive, negative, neutral, very positive, very negative). o A corpus of satirical and non-satirical tweets. This corpus consists of 10000 tweets tagged as satirical and non-satirical. These tweets were extracted from different Twitter accounts. o A set of psycholinguistic features for the sentiment analysis and detection of satire. CONCLUSIONS. The automatic classification of opinions requires a multidisciplinary approach where linguist and natural language processing need to be involved. Theses disciplines allow understanding the human language, classify opinions and summarize the sentiment expressed about a product, and other aspects. However, the figurative language expressed in some texts uses linguistic figures such as metaphor, analogy, and ambiguity, among others. This fact makes difficult to understand this kind of language, not only for computers but also by humans. This thesis described a method for the detection of psycholinguistic patterns for sentiment analysis and the automatic detection of satire. The psycholinguistic features, in conjunction with natural language processing and data mining technologies, demonstrated to be effective for the detection of sentiments and satire. In addition, the validation of the method in different domains verified its effectiveness for the classification of opinions and tweets.