Using annotated discourse information of a rst spanish-chinese treebank for translation and language learning tasks

Cao, Shuyuan

Using annotated discourse information of a rst spanish-chinese treebank for translation and language learning tasks

Cao, Shuyuan

Dirigida por:

Iria da Cunha Fanego Directora
Mikel Iruskieta Quintian Codirector/a

Universidad de defensa: Universitat Pompeu Fabra

Fecha de defensa: 09 de noviembre de 2018

Tribunal:

M. Aranzazu Diaz de Ilarraza Sanchez Presidente/a
Mireia Vargas Urpí Secretario/a
Juliano Desiderato Antonio Vocal

Tipo: Tesis

Teseo: 574265 DIALNET TDX editor

Resumen

As one of the essential elements for Natural Language Processing (NLP), discourse has called much attention during recent years. Many studies explore the role of how discourse elements affect in different NLP research areas, such as parsing, sentiment analysis, machine translation evaluation, among others. Besides, along with the discourse analysis development, different treebanks annotated with discourse information for different languages form a great contribution for advancing the NLP researches. Spanish and Chinese are two of the most spoken languages in the world; the language pair occupy an important position for NLP studies. Therefore, this study aims to make a discourse analysis between the two languages in terms of annotating discourse similarities and differences under the theoretical framework of Rhetorical Structure Theory (RST) by Mann and Thompson (1988). Our goal, which is the main objective of this study, based on the annotation results, the study seeks to develop a protocol that includes recommendations for Spanish-Chinese translation. In addition, with a globalized context in the current society, the communication between Spanish and Chinese is more and more intensive. Therefore, another intention of our study is to develop some resources for the language learning between Spanish-Chinese. To achieve our goals, for the development of the protocol, we firstly establish a Spanish-Chinese parallel corpus and annotate the discourse information of the entire corpus. Then we evaluate the annotation results following a qualitative method to guarantee the high quality of the annotation results. Lastly, we conclude the discourse similarities and differences to make the protocol. Regarding the language learning between the two languages, we fully use the manually annotated discourse markers (DM) to develop a question-answering module. In recent years, there have been few contrastive works of Spanish and Chinese for discourse analysis. Therefore, this PhD study aims to partially fill a knowledge gap in the study between Spanish and Chinese.