Transcripción de periódicos históricosaproximación CLARA-HD
- Antonio Menta 1
- Eva Sánchez-Salido 1
- Ana García-Serrano 1
- 1 ETSI Informática, UNED, Madrid, Spain
- Miguel A. Alonso (ed. lit.)
- Margarita Alonso-Ramos (ed. lit.)
- Carlos Gómez-Rodríguez (ed. lit.)
- David Vilares (ed. lit.)
- Jesús Vilares (ed. lit.)
Publisher: CEUR Workshop Proceedings
Year of publication: 2022
Pages: 70-74
Type: Book chapter
Abstract
The analysis of historical newspapers from the 18th, 19th, and early 20thcenturies requires a certain quality of digitized sources and the use of specific domain orlanguage resources. Any approach using current technologies finds that most of the NLP modelsavailable for transcription or entity recognition are trained with texts in "current languages". If,in addition, the challenge consists of extracting information from historical newspapers inSpanish, the complexity increases since the normalization of Spanish is relatively “modern”and it is necessary to try to refine the NLP models or generate new resources. In thisdemonstration for the corpus built from the BNE Digital Hemeroteca, Diario de Madrid (1788-1825) the steps followed will be shown for its automatic transcription using a defined model(99% performance), within the framework of the CLARA-HD project. Finally, some initialconclusions are included.