Computational Reproducibility of Named Entity Recognition methods in the biomedical domain

  1. Ana García Serrano
  2. Sebastian Hennig
  3. Andreas Nürnberger
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2021

Issue: 66

Pages: 141-152

Type: Article

More publications in: Procesamiento del lenguaje natural

Sustainable development goals

Abstract

Los enfoques para reconocimiento de entidades nombradas no supervisados (NER, por sus siglas en inglés) no dependen de corpus con datos etiquetados, sino de una fuente de conocimiento donde buscar candidatos prometedores para encontrar el concepto correspondiente. En el ámbito biomédico existe la fuente denominada “Sistema Unificado de Lenguaje Médico” (UMLS, por sus siglas en inglés). En este artículo, se evalúan y comparan tres modelos diferentes de NER no supervisados que utilizan UMLS, a saber, MetaMap, cTakes y MetaMapLite, a partir de los resultados publicados por Demner-Fushman, Rogers y Aronson (2017) y Reategui y Ratte (2018). Para ello se desarrolla el entorno Unsupervised Biomedical Named Entity Recognition (UB-NER), con el que se presentan resultados de los experimentos en los modelos, cinco datasets y dos tareas NER.

Bibliographic References

  • Aronson, A.R. 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: theMetaMap program. Proc. AMIA Annual Symposium, pages 17–21, ISSN 1531605X.
  • Benavent, J., X. Benavent, E. de Ves, R. Granados, and A. Garcia-Serrano. 2010. Experiences at ImageCLEF 2010 using CBIR and TBIR Mixing Information Approaches. CLEF CEUR-WS, vol 1176.
  • Bhasuran, B., G. Murugesan, S. Abdulkadhar, and J. Natarajan. 2016. Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. Journal of Biomedical Informatics 64 (Dec), pp. 1–9. doi: 10.1016/j.jbi.2016.09.009.
  • Campos, D., S. Matos, and J. L. Oliveira. 2015. Gimli: Open source and high-performance biomedical name recognition. BMC Bioinformatics 14.1 Feb, p. 54. doi: 10.1186/1471-2105-14-54
  • Cho, M., J. Ha, C. Park, and S. Park. 2020. Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition. Journal of Biomedical Informatics 103 (Mar) p. 103381. doi: 10.1016/j. jbi.2020.103381.
  • Demner-Fushman, D., W. J. Rogers, and A. R. Aronson. 2017. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J. of the American Medical Informatics Association 24.4, pp. 841–844. doi: 10.1093/jamia/ocw177.
  • Devlin, J., M.Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Technical report. https://github.com/tensorflow/tensor2tensor.
  • Dogan, R.I., R. Leaman, and Z. Lu. 2014. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics, 47:1–10, doi: 10.1016/j.jbi.2013.12.006.
  • Hennig, S. 2020. An experimental survey of Named Entity Recognition methods in the biomedical domain. Master Data and Knowledge Engineering. Faculty of Computer Science. OVGU. A. GarciaSerrano and A. Nürnberger supervisors.
  • Hennig, S. and A. Garcia-Serrano. 2020. Reproducible experiments on the master thesis: An experimental survey of Named Entity Recognition methods in the biomedical domain, UNED e-cienciaDatos, V1 (dec) https://doi.org/10.21950/DYAZRE.
  • Lample, G., M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. 2016. Neural architectures for named entity recognition. Proc. of NAACL HLT 2016, pp. 260–270.
  • Lara-Clares, A., A. Garcia-Serrano. 2019. LSI2_UNED at eHealth-KD Challenge 2019: A Few-shot Learning Model for Knowledge Discovery from eHealth Documents. CEUR-WS, vol 2421, IberLEF. Bilbao, Spain.
  • Lastra-Díaz, J.J. and A. Garcia-Serrano. 2015a. A novel family of IC-based similarity measures with a detailed experimental survey on WordNet. Engineering Applications of Artificial Intelligence 46, 140-153.
  • Lastra-Díaz, J.J. and A. Garcia-Serrano. 2015b. A new family of information content models with an experimental survey on WordNet. Knowledge-Based Systems 89, 509-526.
  • Lee, J., W. Yoon, S. Kim, D. Kim, S. Kim, C.Ho So, and J. Kang. 2020. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36.4 (Feb), pp. 1234–1240. doi: 10.1093/bioinformatics/btz682.
  • Merkel, D. 2014. Docker: lightweight Linux containers for consistent development and deployment. https://dl.acm.org/doi/10.5555/2600239.2600241.
  • Mowery, D. 2013. ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports. PhysioNet https://doi.org/10.13026/0zgk-9j94.
  • Reategui, R. and S. Ratte. 2018. Comparison of MetaMap and cTAKES for entity extraction in clinical notes. BMC Medical Informatics and Decision Making 18.3, p. 74. doi: 10.1186/s12911-018-0654-2.
  • Sagae, K. and J. Tsujii. 2007. Dependency Parsing and Domain Adaptation with LR Computational Reproducibility of Named Entity Recognition methods in the biomedical domain 151 Models and Parser Ensembles. In Proc. of the EMNLP-CoNLL, 2007, pp. 1044–1050 https://www.aclweb.org/anthology/D071111
  • Savova, G., J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G. Chute. 2010. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5):507–513. DOI: 10.1136/jamia.2009.001560
  • Segura-Bedmar, I. and P. Martínez. 2017. Simplifying drug package leaflets written in Spanish by using word embedding. Journal of Biomedical Semantics 8, 45. https://doi.org/10.1186/s13326-017-0156-7
  • Uzuner, A. 2009. Recognizing Obesity and Comorbidities in Sparse Data. Journal of the American Medical Informatics Association, 16(4):561–570, 7.
  • Gang, Y., Y. Yang, X. Wang, H. Zhen, G. He, Z. Li, Y. Zhao, Q. Shu, and L. Shu. 2020. Adversarial active learning for the identification of medical concepts and annotation inconsistency. Journal of Biomedical Informatics 108 (Aug), p. 103481. https://doi.org/10.1016/j.jbi.2020.103481