Automatic classification of sexism in social networks

  1. RODRÍGUEZ SÁNCHEZ, FRANCISCO MIGUEL
Supervised by:
  1. Jorge Carrillo de Albornoz Director
  2. Laura Plaza Morales Director

Defence university: UNED. Universidad Nacional de Educación a Distancia

Fecha de defensa: 28 March 2025

Committee:
  1. Anselmo Peñas Padilla Chair
  2. Mariona Taulé Delor Secretary
  3. Isabel Segura Bedmar Committee member

Type: Thesis

Teseo: 869496 DIALNET lock_openTESEO editor

Abstract

The rapid growth of social networks has facilitated anonymous communication among individuals from diverse backgrounds. While the positive e;ects of this global communication are undeniable, the role of women within online spaces has unfortunately gained attention due to a concerning rise in hate speech and sexist attitudes directed towards them. Exposure to this sexist language is extremely harmful, impacting both women and society as a whole. Companies are continuously reviewing their policies to include additional types of abusive behavior and are creating new ways to eradicate hateful content from their platforms. Despite significant e;orts and the deployment of many human resources, they face challenges managing the vast amount of data generated by users. Natural language processing (NLP) is an essential tool for combating this issue, and the detection and analysis of sexist language have become major areas in this field. This thesis presents a comprehensive approach to automatic sexism detection, focusing on the development of a robust dataset and a series of computational models for sexism detection and categorization. We propose a new sexism categorization adapted to online environments and develop the EXIST dataset, a novel, annotated dataset that categorizes online sexism into various subtypes, including implicit and explicit forms. To promote research in this area, we organized the EXIST 2021 and 2022 challenges, competitions that brought together researchers and practitioners to develop and evaluate their approaches to sexism detection using the EXIST dataset. We provide an in-depth analysis of the results, including an examination of the di;iculty of detecting sexism across di;erent categories and the impact of language-specific aspects on sexism categorization. Furthermore, we develop a novel classification system that employs in-domain unlabeled data through unsupervised task-adaptation techniques and semisupervised learning, employing an e;icient single multilingual transformer model. We also integrate a Sentence-BERT layer to enhance our system with semantically meaningful sentence embeddings, achieving state-of-the-art results in all EXIST tasks and competitions. Finally, we summarize our contributions and suggest future research directions in the area of online sexism research.