Desarrollo de un sistema de aprendizaje automático supervisado para la desambiguación léxica automática utilizando DAMIEN (Data Mining Encountered)

Fredy Núñez Torres; María Beatriz Pérez Cabello de Alba

Desarrollo de un sistema de aprendizaje automático supervisado para la desambiguación léxica automática utilizando DAMIEN (Data Mining Encountered)

Fredy Núñez Torres
María Beatriz Pérez Cabello de Alba

Journal:

RAEL: revista electrónica de lingüística aplicada

ISSN: 1885-9089

Year of publication: 2022

Volume: 21

Issue: 1

Pages: 150-178

Type: Article

DIALNET GOOGLE SCHOLAR Open access editor

More publications in: RAEL: revista electrónica de lingüística aplicada

Abstract

Uno de los mayores desafíos que se nos presentan a la hora de acometer tareas relacionadas con el procesamiento del lenguaje natural y, en particular, con el tratamiento de recursos lingüísticos informatizados, es la ambigüedad léxica. En este trabajo abordamos el tratamiento de la desambiguación léxica dentro del entorno informático DAMIEN (Data Mining ENcountered), una herramienta que integra técnicas de múltiples disciplinas dentro de análisis de texto (i.e. lingüística de corpus, estadística y minería textual) para ayudar en tareas de investigación lingüística (i.e. recolección de datos, extracción de información, clasificación de textos, entre otras). A modo de experimento ilustrativo, llevamos a cabo un estudio de las unidades léxicas polisémicas “cabeza”, “cara” y “carta”, y presentamos los resultados del sistema de desambiguación automática desarrollado con la herramienta DAMIEN. Dentro de los modelos que ofrece el entorno, hemos elegido el método de aprendizaje automático supervisado mediante algoritmo bayesiano ingenuo por tratarse del método que mejores resultados ha dado para la desambiguación léxica automática. Se trata de un modelo matemático que consiste en extraer información de un corpus a partir de conjuntos de datos previamente etiquetados (corpus de entrenamiento) para que la máquina pueda clasificar automáticamente conjuntos de datos nuevos (corpus de prueba). Es importante resaltar la flexibilidad y riqueza del entorno DAMIEN tanto para el tratamiento de recursos lingüísticos informatizados como para el montaje de experimentos del procesamiento del lenguaje natural.

Bibliographic References

Allen, J. (1995). Natural Language Understanding. Redwood City: The Benjamin Cummings Publishing Company.
Aung, N. T., Soe, K. y Thein, N. (2011). A word sense disambiguation system using naï ve Bayesian algorithm for Myanmar language. International Journal of Scientific & Engineering Research, 9, 1-7.
Bauer, L. (2004). English Word-formation. Cambridge: Cambridge Textbooks in Linguistics.
Boudiaf, M., Roni, J., Masud, I., Granger, E., Pedersoli, M., Piantanida, P. y Ben-Ayed, I. (2020). A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. En A. Vedaldi, H. Bischof, T. Brox y J. M. Frahm (Eds.), Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol. 12351 (pp. 548-564). Springer: Cham.
Cantos-Gómez, P. (1996). Lexical Ambiguity, Dictionaries and Corpora. Murcia: Servicio de Publicaciones, Universidad de Murcia.
Carpuat, M. y Wu, D. (2005). Evaluating the Word Sense Disambiguation Performance of Statistical Machine Translation. En n.a. (Eds.), Actas del Second International Joint Conference on Natural Language Processing (IJCNLP) (pp. 120-125). Jefu, Korea: Asian Federation of Natural Language Processing. Sacado de https://aclanthology.org/I05-2021.pdf
Choi, R., Coyner, A., Kalpathy-Cramer, J., Chiang, M. y Campbell, P. (2020). Introduction to machine learning, neural networks, and deep learning. Translational Vision Science & Technology, 9(2), 1-12. doi: 10.1167/tvst.9.2.14
Eberhardt, F. y Danks, D. (2011). Confirmation in the cognitive sciences: the problematic case of Bayesian models. Minds and Machines, 21(3), 389-410. doi: 10.1007/s11023-011-9241-3
Escudero, G., Màrquez, L. y Rigau, G. (2000). A comparison between supervised learning algorithms for Word Sense Disambiguation. En n.a. (Eds.), Actas del Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop (pp. 31-36). doi: 10.3115/1117601.1117609 Sacado de https://www.cs.upc.edu/~escudero/wsd/00-conll.pdf
Espunya i Prat, A. (1994). Computational linguistics: a brief introduction. Links & Letters, 1, 9-23.
Fulmari, A., y Chandak, M. (2014). An approach for Word Sense Disambiguation using modified naïve bayes classifier. International Journal of Innovative Research in Computer and Communication Engineering Organization 2(4), 3867-3870.
Gale, W., Church, K. y Yarowsky, D. (1992). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26, 415-439.
Gamallo, P., Sotelo, S. y Pichel, J. (2014). Comparing ranking-based and naive bayes approaches to language detection on tweets. Artículo presentado en el Workshop
TweetLID: Twitter Language Identification Workshop at SEPLN 2014. Girona, España. 16 de septiembre.
Gosal, G. (2015). A naïve bayes approach for Word Sense Disambiguation. International Journal of Advanced Research in Computer Science and Software Engineering 5(7), 336-340.
Hastie, T., Tibshirani, R. y Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction (2ª ed.). Nueva York: Springer.
James, G., Witten, D., Hastie, T. y Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Nueva York: Springer.
Jiménez Briones, R. y Luzondo-Oyón, A. (2011). Building ontological meaning in a lexico-conceptual knowledge base. Onomázein, 23, 11-40.
Jurafsky, D. y Martin, J. (1998). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Nueva Jersey: Prentice Hall.
Li, Y. y Yang, T. (2018). Word embedding for understanding natural language: a survey. En S. Srinivasan (Ed.), Guide to Big Data Applications. Studies in Big Data, vol 26. (pp. 83-106). Cham: Springer.
Manning, C. y Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: The MIT Press.
Márquez, L., Escudero, G., Martínez, D. y Rigau, G. (2006). Supervised corpus-based methods for WSD. En E. Agirre y P. Edmonds (Eds.), Word Sense Disambiguation: Algorithms and Applications (pp. 167-216). Cham: Springer.
Mooney, R. (1996). Comparative experiments on disambiguating word senses: an illustration of the role of bias in machine learning and bias learning to disambiguate word senses. En E. Brill y K. Church (Eds.), Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-96) (pp. 82-91). Pennsylvania: Universidad de Pennsylvania.
Moor, J. (2006). The Dartmouth College Artificial Intelligence conference: the next fifty years. AI Magazine, 27(4), 87-91. doi: 10.1609/aimag.v27i4.1911
Núñez-Torres, F. (2013). La representación lé xica en el modelo del Lexicón Generativo de James Pustejovsky, Onomázein, Revista de Lingüística, Filología y Traducción de la Pontificia Universidad Católica de Chile, 28, 337-345. doi: 10.7764/onomazein.28.9
Periñán-Pascual, C. (2012). En defensa del procesamiento del lenguaje natural fundamentado en la lingüística teórica. Onomá zein, Revista de Lingüística, Filología y Traducción de la Pontificia Universidad Católica de Chile, 26, 13-48. doi: 10.7764/onomazein.26.01
Periñán-Pascual, C. (2017). Bridging the gap within text-data analytics: a computer environment for data analysis in linguistic research, Revista de Lenguas para Fines Específicos, 23(2), 111-132. doi: 10.20420/rlfe.2017.175
Periñán-Pascual, C. y Arcas-Túnez, F. (2004). Meaning postulates in a lexico-conceptual knowledge base. Artículo presentado en The 15th International Workshop on Databases and Expert Systems Applications. Recuperado de http://www.fungramkb.com/resources/papers/001.pdf
Periñán-Pascual, C. y Arcas-Túnez, F. (2010). The architecture of FunGramKB. En N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner y D. Tapias (Ed.s), Proceedings of the Seventh International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA) (pp. 2667-2674). Valletta, Malta: European Language Resources Association (ERLA).
Raganato, A., Camacho-Collados, J. y Navigli, R. (2017). Word Sense Disambiguation: a unified evaluation framework and empirical comparison. En A. Raganato, J. Camacho-Collados y R. Navigli (Eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, (pp. 99-110). Valencia: Association for Computational Linguistics.
Rish, I. (2001). An empirical study of the naive bayes classifier. The IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 3(22), 41-46.
Veyrat-Charvillon, N. y Standaert, F. (2009). Mutual Information Analysis: How, When and Why? En C. Clavier y K. Gaj (Eds.), Cryptographic Hardware and Embedded Systems - CHES 2009. CHES 2009. Lecture Notes in Computer Science, vol. 5747 (pp. 429-443). Berlin: Springer.
Widlak, M. (2004). Influence of Word Sense Disambiguation on Text Classification (Trabajo fin de máster). Universidad de Ottawa, Canadá.

Data source: Dialnet