Characterizing Spans for Sequence Labeling: A Case on Anglicism Detection

Álvarez Mellado, Elena; Gonzalo, Julio

Characterizing Spans for Sequence LabelingA Case on Anglicism Detection

Revista:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2024

Número: 73

Páginas: 235-246

Tipo: Artículo

DIALNET GOOGLE SCHOLAR Acceso abierto editor

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

Presentamos un conjunto de dimensiones para caracterizar spans en la evaluación de etiquetado de secuencias y las aplicamos a la tarea de detección de anglicismos en castellano. Los resultados muestran que las dimensiones ayudan a desenmascarar limitaciones que pasaron desapercibidas en la evaluación estándar.

Referencias bibliográficas

Alex, B. 2008. Automatic detection of English inclusions in mixed-lingual data with an application to parsing. PhD Thesis, University of Edinburgh.
Alvarez Mellado, E. 2020. Lázaro: An extractor of emergent anglicisms in spanish newswire. Master’s thesis, Brandeis University.
Alvarez Mellado, E., L. Espinosa Anke, J. Gonzalo Arroyo, C. Lignos, and J. Porta Zamorano. 2021. Overview of ADoBo 2021: Automatic Detection of Unassimilated Borrowings in the Spanish Press. Procesamiento del Lenguaje Natural, 67(0):277–285, September. Number: 0.
Alvarez-Mellado, E. and C. Lignos. 2022. Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3868–3888, Dublin, Ireland, May. Association for Computational Linguistics.
Andersen, G. 2012. Semi-automatic approaches to Anglicism detection in Norwegian corpus data. In C. Furiassi, V. Pulcini, and F. Rodr´ıguez Gonz´alez, editors, The anglicization of European lexis. John Benjamins, pages 111–130.
Bender, E. M. and B. Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6:587–604. Place: Cambridge, MA Publisher: MIT Press.
Bernier-Colborne, G. and P. Langlais. 2020. HardEval: Focusing on Challenging Tokens to Assess Robustness of NER. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1704–1711, Marseille, France, May. European Language Resources Association.
Chiruzzo, L., M. Agüero-Torales, G. Giménez-Lugo, A. Alvarez, Y. Rodríguez, S. Góngora, and T. Solorio. 2023. Overview of GUASPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis. Procesamiento del Lenguaje Natural, 71(0):321–328, September. Number: 0.
Fu, J., P. Liu, and G. Neubig. 2020. Interpretable Multi-dataset Evaluation for Named Entity Recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6058–6069, Online, November. Association for Computational Linguistics.
Furiassi, C. and K. Hofland. 2007. The retrieval of false anglicisms in newspaper texts. In Corpus Linguistics 25 Years On. Brill Rodopi, pages 347–363.
Gorman, K. and S. Bedrick. 2019. We Need to Talk about Standard Splits. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2786–2791, Florence, Italy, July. Association for Computational Linguistics.
Lin, B. Y., W. Gao, J. Yan, R. Moreno, and X. Ren. 2021. RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3728– 3737, Online and Punta Cana, Dominican Republic, November. Association for Computational Linguistics.
Losnegaard, G. S. and G. I. Lyse. 2012. A data-driven approach to anglicism identification in Norwegian. In G. Andersen, editor, Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian. John Benjamins Publishing, pages 131–154.
Papay, S., R. Klinger, and S. Padó. 2020. Dissecting Span Identification Tasks with Performance Prediction. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4881–4895, Online, November. Association for Computational Linguistics.
Real Academia Española. 2011. Ortografia de la Lengua Española. Planeta Publishing, April.
Serigos, J. R. L. 2017. Applying corpus and computational methods to loanword research : new approaches to Anglicisms in Spanish. August.
Søgaard, A., S. Ebert, J. Bastings, and K. Filippova. 2021. We Need To Talk About Random Splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1823–1832, Online, April. Association for Computational Linguistics.
Tsvetkov, Y. and C. Dyer. 2016. Crosslingual bridges with models of lexical borrowing. Journal of Artificial Intelligence Research, 55:63–93.
Tu, J. and C. Lignos. 2021. TMR: Evaluating NER Recall on Tough Mentions. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 155–163, Online, April. Association for Computational Linguistics.
Vajjala, S. and R. Balasubramaniam. 2022. What do we really know about State of the Art NER? In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5983–5993, Marseille, France, June. European Language Resources Association.
Zhou, L., P. A. Moreno-Casares, F. Martínez-Plumed, J. Burden, R. Burnell, L. Cheke, C. Ferri, A. Marcoci, B. Mehrbakhsh, Y. Moros-Daval, S. h´Eigeartaigh, D. Rutar, W. Schellaert, K. Voudouris, and J. Hernández-Orallo. 2023. Predictable Artificial Intelligence, October. arXiv:2310.06167 [cs].

Fuente de los datos: Dialnet