Albertia Multilingual Domain Specific Language Model for Poetry Analysis

  1. Ros, Salvador
  2. González Blanco García, Elena
  3. Rosa, Javier de la
  4. Pérez Pozo, Álvaro
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2023

Número: 71

Páginas: 215-225

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

El análisis computacional de la poesía está limitado por la escasez de herramientas para analizar y escandir automáticamente los poemas. En entornos multilingües, el problema se agrava ya que los sistemas de escansión y rima solo existen para idiomas individuales, lo que hace que los estudios comparativos sean muy difíciles de llevar a cabo y consuman mucho tiempo. En este trabajo, presentamos Alberti, el primer modelo de lenguaje multilingüe pre-entrenado para poesía. Usando la técnica de pre-entrenamiento de dominio específico (DSP, de sus siglas en inglés), aumentamos las capacidades del modelo BERT multilingüe empleando un corpus de más de 12 millones de versos en 12 idiomas. Evaluamos su rendimiento en dos tareas estructurales de poesía: clasificación de tipos de estrofas en español y predicción de patrones métricos para español, inglés y alemán. En ambos casos, Alberti supera a BERT multilingüe y a otros modelos basados en transformers de tamaños similares, e incluso logra resultados de estado del arte para el alemán en comparación con los sistemas basados en reglas, lo que demuestra la viabilidad y eficacia del DSP en el dominio de la poesía.

Referencias bibliográficas

  • Agirrezabal, M., I. Alegria, and M. Hulden. 2017. A comparison of feature-based and neural scansion of poetry. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 18–23, Varna, Bulgaria, September. INCOMA Ltd.
  • Algee-Hewitt, M., R. Heuser, M. Kraxenberger, J. Porter, J. Sensenbaugh, and J. Tackett. 2014. The stanford literary lab transhistorical poetry project phase ii: Metrical form. In DH.
  • Anttila, A. and R. Heuser. 2016. Phonological and metrical variation across genres. In Proceedings of the Annual Meetings on Phonology, volume 3.
  • Araújo, J. and J. Mamede. 2002. Classificador de Poemas. CCTE, Lisbon.
  • Bobenhausen, K. 2011. The metricalizer2– automated metrical markup of german poetry. Current Trends in Metrical Analysis, Bern: Peter Lang, pages 119–131.
  • Cañete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.
  • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised crosslingual representation learning at scale. CoRR, abs/1911.02116.
  • De la Rosa, J., Á. Pérez, M. de Sisto, L. Hernández, A. Díaz, S. Ros, and E. González-Blanco. 2021. Transformers analyzing poetry: multilingual metrical pattern prediction with transformerbased language models. Neural Computing and Applications.
  • De la Rosa, J., Á. Pérez Pozo, L. Hernández, S. Ros, and E. González-Blanco. 2020. Rantanplan, fast and accurate syllabification and scansion of spanish poetry. Procesamiento del Lenguaje Natural, 65:83–90.
  • De Sisto, M. 2020. The interaction between phonology and metre: Approaches to Romance and West-Germanic Renaissance metre. Ph.D. thesis, Radboud University Nijmegen.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.
  • Domínguez Caparrós, J. 2014. Métrica española. Editorial UNED.
  • Gervás, P. 2000. A logic programming application for the analysis of spanish verse. In Computational Logic—CL 2000: First International Conference London, UK, July 24–28, 2000 Proceedings, pages 1330–1344. Springer.
  • Gu, Y., R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1), oct.
  • Haider, T., S. Eger, E. Kim, R. Klinger, and W. Menninghaus. 2020. Po-emo: Conceptualization, annotation, and modeling of aesthetic emotions in german and english poetry. arXiv preprint arXiv:2003.07723.
  • Haider, T. and J. Kuhn. 2018. Supervised rhyme detection with siamese recurrent networks. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 81–86.
  • Ibrahim, R. and P. Plecháč. 2011. Toward automatic analysis of czech verse. Formal methods in poetics, pages 295–305.
  • Jauralde Pou, P. 2020. Métrica española. Madrid: Cátedra.
  • Kirszner, L. G. and S. R. Mandell. 2007. Literature: Reading, reacting, writing. Thomson/Wadsworth.
  • Lau, J. H., T. Cohn, T. Baldwin, J. Brooke, and A. Hammond. 2018. Deep-speare: A joint neural model of poetic language, meter and rhyme. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1948–1958, Melbourne, Australia, July. Association for Computational Linguistics.
  • Lennard, J. 2006. The poetry handbook. OUP Oxford.
  • Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. cite arxiv:1907.11692.
  • Manjavacas Arevalo, E. and L. Fonteyn. 2021. MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450-1950). In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 23–36, NIT Silchar, India, December. NLP Association of India (NLPAI).
  • McAleese, W. G. M. 2007. Improving scansion with syntax: An investigation into the effectiveness of a syntactic analysis of poetry by computer using phonological scansion theory. Technical report, Department of Computing Faculty of Mathematics, Computing and Technology The Open University.
  • Navarro-Colorado, B. 2017. A metrical scansion system for fixed-metre spanish poetry. Digital Scholarship in the Humanities, 33(1):112–127.
  • Navarro-Colorado, B., M. R. Lafoz, and N. Sánchez. 2016. Metrical annotation of a large corpus of spanish sonnets: representation, scansion and evaluation. In International Conference on Language Resources and Evaluation, pages 4360–4364.
  • Ormazabal, A., M. Artetxe, M. Agirrezabal, A. Soroa, and E. Agirre. 2022. PoeLM: A meter- and rhyme-controllable language model for unsupervised poetry generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3655–3670, Abu Dhabi, United Arab Emirates, December. Association for Computational Linguistics.
  • Pérez Pozo, A., J. de la Rosa, S. Ros, E. González-Blanco, L. Hernández, and M. de Sisto. 2022. A bridge too far for artificial intelligence?: Automatic classification of stanzas in spanish poetry. Journal of the Association for Information Science and Technology, 73(2):258–267.
  • Quillis, A. 2000. Métrica española. Grupo Planeta (GBS).
  • Schweter, S. and L. März. 2020. Triple e - effective ensembling of embeddings and language models for ner of historical german. In Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September. CEURWS.org.
  • Šel,a, A., P. Plecháč, and A. Lassche. 2022. Semantics of european poetry is shaped by conservative forces: The relationship between poetic meter and meaning in accentual-syllabic verse. Plos one, 17(4):e0266556.
  • Torre, E. 2000. Métrica española comparada, volume 48. Universidad de Sevilla.
  • Tucker, H. F. 2011. Poetic data and the news from poems: A" for better for verse" memoir. Victorian Poetry, 49(2):267–281.
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.