Characterizing Spans for Sequence LabelingA Case on Anglicism Detection

  1. Álvarez Mellado, Elena
  2. Gonzalo, Julio
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2024

Issue: 73

Pages: 235-246

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

We propose a set of formal dimensions to characterize spans in sequence labeling evaluation. We apply them to a dataset and model results obtained for anglicism detection in Spanish. Results show that the best performing system is outperformed by other models on certain types of spans. Our methodology can uncover limitations in performance that go unnoticed with standard evaluation.

Bibliographic References

  • Alex, B. 2008. Automatic detection of English inclusions in mixed-lingual data with an application to parsing. PhD Thesis, University of Edinburgh.
  • Alvarez Mellado, E. 2020. Lázaro: An extractor of emergent anglicisms in spanish newswire. Master’s thesis, Brandeis University.
  • Alvarez Mellado, E., L. Espinosa Anke, J. Gonzalo Arroyo, C. Lignos, and J. Porta Zamorano. 2021. Overview of ADoBo 2021: Automatic Detection of Unassimilated Borrowings in the Spanish Press. Procesamiento del Lenguaje Natural, 67(0):277–285, September. Number: 0.
  • Alvarez-Mellado, E. and C. Lignos. 2022. Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3868–3888, Dublin, Ireland, May. Association for Computational Linguistics.
  • Andersen, G. 2012. Semi-automatic approaches to Anglicism detection in Norwegian corpus data. In C. Furiassi, V. Pulcini, and F. Rodr´ıguez Gonz´alez, editors, The anglicization of European lexis. John Benjamins, pages 111–130.
  • Bender, E. M. and B. Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6:587–604. Place: Cambridge, MA Publisher: MIT Press.
  • Bernier-Colborne, G. and P. Langlais. 2020. HardEval: Focusing on Challenging Tokens to Assess Robustness of NER. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1704–1711, Marseille, France, May. European Language Resources Association.
  • Chiruzzo, L., M. Agüero-Torales, G. Giménez-Lugo, A. Alvarez, Y. Rodríguez, S. Góngora, and T. Solorio. 2023. Overview of GUASPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis. Procesamiento del Lenguaje Natural, 71(0):321–328, September. Number: 0.
  • Fu, J., P. Liu, and G. Neubig. 2020. Interpretable Multi-dataset Evaluation for Named Entity Recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6058–6069, Online, November. Association for Computational Linguistics.
  • Furiassi, C. and K. Hofland. 2007. The retrieval of false anglicisms in newspaper texts. In Corpus Linguistics 25 Years On. Brill Rodopi, pages 347–363.
  • Gorman, K. and S. Bedrick. 2019. We Need to Talk about Standard Splits. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2786–2791, Florence, Italy, July. Association for Computational Linguistics.
  • Lin, B. Y., W. Gao, J. Yan, R. Moreno, and X. Ren. 2021. RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3728– 3737, Online and Punta Cana, Dominican Republic, November. Association for Computational Linguistics.
  • Losnegaard, G. S. and G. I. Lyse. 2012. A data-driven approach to anglicism identification in Norwegian. In G. Andersen, editor, Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian. John Benjamins Publishing, pages 131–154.
  • Papay, S., R. Klinger, and S. Padó. 2020. Dissecting Span Identification Tasks with Performance Prediction. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4881–4895, Online, November. Association for Computational Linguistics.
  • Real Academia Española. 2011. Ortografia de la Lengua Española. Planeta Publishing, April.
  • Serigos, J. R. L. 2017. Applying corpus and computational methods to loanword research : new approaches to Anglicisms in Spanish. August.
  • Søgaard, A., S. Ebert, J. Bastings, and K. Filippova. 2021. We Need To Talk About Random Splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1823–1832, Online, April. Association for Computational Linguistics.
  • Tsvetkov, Y. and C. Dyer. 2016. Crosslingual bridges with models of lexical borrowing. Journal of Artificial Intelligence Research, 55:63–93.
  • Tu, J. and C. Lignos. 2021. TMR: Evaluating NER Recall on Tough Mentions. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 155–163, Online, April. Association for Computational Linguistics.
  • Vajjala, S. and R. Balasubramaniam. 2022. What do we really know about State of the Art NER? In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5983–5993, Marseille, France, June. European Language Resources Association.
  • Zhou, L., P. A. Moreno-Casares, F. Martínez-Plumed, J. Burden, R. Burnell, L. Cheke, C. Ferri, A. Marcoci, B. Mehrbakhsh, Y. Moros-Daval, S. h´Eigeartaigh, D. Rutar, W. Schellaert, K. Voudouris, and J. Hernández-Orallo. 2023. Predictable Artificial Intelligence, October. arXiv:2310.06167 [cs].