Transcripción de periódicos históricosaproximación CLARA-HD

  1. Antonio Menta 1
  2. Eva Sánchez-Salido 1
  3. Ana García-Serrano 1
  1. 1 ETSI Informática, UNED, Madrid, Spain
Book:
SEPLN-PD 2022: Annual Conference of the Spanish Association for Natural Language Processing 2022: Projects and Demonstrations
  1. Miguel A. Alonso (ed. lit.)
  2. Margarita Alonso-Ramos (ed. lit.)
  3. Carlos Gómez-Rodríguez (ed. lit.)
  4. David Vilares (ed. lit.)
  5. Jesús Vilares (ed. lit.)

Publisher: CEUR Workshop Proceedings

Year of publication: 2022

Pages: 70-74

Type: Book chapter

Abstract

The analysis of historical newspapers from the 18th, 19th, and early 20thcenturies requires a certain quality of digitized sources and the use of specific domain orlanguage resources. Any approach using current technologies finds that most of the NLP modelsavailable for transcription or entity recognition are trained with texts in "current languages". If,in addition, the challenge consists of extracting information from historical newspapers inSpanish, the complexity increases since the normalization of Spanish is relatively “modern”and it is necessary to try to refine the NLP models or generate new resources. In thisdemonstration for the corpus built from the BNE Digital Hemeroteca, Diario de Madrid (1788-1825) the steps followed will be shown for its automatic transcription using a defined model(99% performance), within the framework of the CLARA-HD project. Finally, some initialconclusions are included.