RoBERTime: A novel model for the detection of temporal expressions in Spanish

Araujo Serna, Lourdes; Martínez Romo, Juan; Sánchez Castro Fernández, Alejandro

RoBERTimeA novel model for the detection of temporal expressions in Spanish

Araujo Serna, Lourdes
Martínez Romo, Juan
Sánchez Castro Fernández, Alejandro

Revue:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Année de publication: 2023

Número: 70

Pages: 39-51

Type: Article

DIALNET GOOGLE SCHOLAR RUA editor

D'autres publications dans: Procesamiento del lenguaje natural

Résumé

Las expresiones temporales son todas aquellas palabras que refieran temporalidad. Su detección o extracción es una tarea compleja, ya que depende del dominio del texto, del idioma y de la forma de escritura. Su estudio en español y más específicamente en el dominio clínico es escaso, debido principalmente a la falta de corpora anotados. En este trabajo se propone el uso de grandes modelos del lenguaje para abordar la tarea, comparando el rendimiento de cinco modelos de distintas características. Tras un proceso de experimentación y fine tuning, se logra crear un nuevo modelo llamado RoBERTime para la detección de expresiones temporales en español, especialmente centrado en el dominio clínico. Este modelo se encuentra disponible de forma pública. RoBERTime alcanza resultados del estado del arte en los corpus E3C y Timebank, siendo este el primer modelo público en detección de expresiones temporales en español especializado en el dominio clínico.

Références bibliographiques

Almasian, S., D. Aumiller, and M. Gertz 2021. Bert got a date: Introducing transformers to temporal tagging. arXiv preprint arXiv:2109.14927
Almasian, S., D. Aumiller, and M. Gertz 2022. Time for some german? pre-training a transformer-based temporal tagger for german. In Text2Story@ ECIR, pages 83– 90
Aumiller, D., S. Almasian, D. Pohl, and M. Gertz. 2022. Online dateing: A web interface for temporal annotations. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3289–3294
Barros, C., E. Lloret, E. Saquete, and B. Navarro-Colorado. 2019. Natsum: Narrative abstractive summarization through cross-document timeline generation Information Processing & Management, 56(5):1775–1793
Bethard, S. 2013. Cleartk-timeml: A minimalist approach to tempeval 2013. In Second joint conference on lexical and computational semantics (* SEM), volume 2: proceedings of the seventh international workshop on semantic evaluation (SemEval 2013), pages 10–14
Bethard, S., G. Savova, M. Palmer, and J. Pustejovsky. 2017. SemEval-2017 task 12: Clinical TempEval. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 565–572, Vancouver, Canada, August. Association for Computational Linguistics
Canete, J., G. Chaperon, R. Fuentes, J.-H Ho, H. Kang, and J. Perez. 2020. Spanish pre-trained bert model and evaluation data Pml4dc at iclr, 2020:1–10
Carrino, C. P., J. Armengol-Estape, A. Gutierrez-Fandino, J. Llop-Palao, M. P`amies, A. Gonzalez-Agirre, and M. Villegas. 2021. Biomedical and clinical language models for spanish: On the benefits of domain-specific pretraining in a mid-resource scenario. arXiv preprint arXiv:2109.03570
Chang, A. X. and C. D. Manning. 2012. Sutime: A library for recognizing and normalizing time expressions. In Lrec, volume 3735, page 3740
Chen, S., G. Wang, and B. Karlsson. 2019 Exploring word representations on time expression recognition. Technical report, Technical report, Microsoft Research Asia
Clark, K., M.-T. Luong, Q. V. Le, and C. D. Manning. 2020. Electra: Pretraining text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555
Cortes, C. and V. Vapnik. 1995. Supportvector networks. Machine learning, 20(3):273–297
Ding, W., G. Gao, L. Shi, and Y. Qu 2019. A pattern-based approach to recognizing time expressions. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6335–6342, Jul
Eberhard, O. and T. Zesch. 2021. Effects of layer freezing on transferring a speech recognition system to under-resourced languages In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pages 208–212
Gildea, D. and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational linguistics, 28(3):245–288
Lafferty, J. D., A. McCallum, and F. C. N Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc
Lange, L., A. Iurshina, H. Adel, and J. Strotgen. 2020. Adversarial alignment of multilingual models for extracting temporal expressions from text. arXiv preprint arXiv:2005.09392
Lange, L., J. Strotgen, H. Adel, and D. Klakow 2022. Multilingual normalization of temporal expressions with masked language models. arXiv preprint arXiv:2205.10399
Lee, J., R. Tang, and J. Lin. 2019. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090
Leeuwenberg, A. and M.-F. Moens. 2018 Temporal information extraction by predicting relative time-lines. arXiv preprint arXiv:1808.09401
Li, H., J. Strotgen, J. Zell, and M. Gertz 2014. Chinese temporal tagging with heideltime In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 133–137
Lin, T.-Y., P. Goyal, R. Girshick, K. He, and P. Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988
Llorens, H., E. Saquete, and B. Navarro 2010. Tipsem (english and spanish): Evaluating crfs and semantic roles in tempeval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 284–291
Magnini, B., B. Altuna, A. Lavelli, M. Speranza, and R. Zanoli. 2020. The e3c project: Collection and annotation of a multilingual corpus of clinical cases. In CLiC-it
Mosbach, M., M. Andriushchenko, and D. Klakow. 2020. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines
Nakayama, H. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakkiworks/ seqeval
Navas-Loro, M. and V. Rodríguez-Doncel 2020. Annotador: a temporal tagger for spanish. Journal of Intelligent & Fuzzy Systems, 39(2):1979–1991
Ng, J. P., Y. Chen, M.-Y. Kan, and Z. Li 2014. Exploiting timelines to enhance multi-document summarization. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 923–933
Nieto, M. G., R. Saurı, and M. A. B. Poveda 2011. Modes timebank: a modern spanish timebank corpus. Procesamiento del lenguaje natural, 47:259–267
Nivre, J., J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret 2007. The conll 2007 shared task on dependency parsing. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 915–932
Pampari, A., P. Raghavan, J. Liang, and J. Peng. 2018. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732
Pennington, J., R. Socher, and C. D. Manning 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543
Pustejovsky, J., K. Lee, H. Bunt, and L. Romary 2010. Iso-timeml: An international standard for semantic annotation. In LREC, volume 10, pages 394–397
Ramshaw, L. A. and M. P. Marcus. 1999 Text chunking using transformation-based learning. In Natural language processing using very large corpora. Springer, pages 157–176
Sang, E. F. and S. Buchholz. 2000. Introduction to the conll-2000 shared task: Chunking arXiv preprint cs/0009008
Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108
Saurı, R., J. Littman, B. Knippen, R. Gaizauskas, A. Setzer, and J. Pustejovsky 2006. Timeml annotation guidelines version 1.2. 1
Skukan, L., G. Glavas, and J. Snajder. 2014 Heideltime. hr: extracting and normalizing temporal expressions in croatian. In Proceedings of the 9th Slovenian Language Technologies Conferences (IS-LT 2014), pages 99–103
Strotgen, J., T. Bogel, J. Zell, A. Armiti, T. V. Canh, and M. Gertz. 2014. Extending HeidelTime for temporal expressions referring to historic dates. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2390–2397, Reykjavik, Iceland, May. European Language Resources Association (ELRA)
Strotgen, J. and M. Gertz. 2010. Heideltime: High quality rule-based extraction and normalization of temporal expressions In Proceedings of the 5th international workshop on semantic evaluation, pages 321–324
Strotgen, J. and M. Gertz. 2013. Multilingual and cross-domain temporal tagging Language Resources and Evaluation, 47(2):269–298
Sun, Y., G. Cheng, and Y. Qu. 2018 Reading comprehension with graph-based temporal-casual reasoning. In Proceedings of the 27th International Conference on Computational Linguistics, pages 806– 817
Tjong Kim Sang, E. F. 2002. Introduction to the CoNLL-2002 shared task: Language independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL2002)
UzZaman, N., H. Llorens, L. Derczynski, J. Allen, M. Verhagen, and J. Pustejovsky 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9
Vapnik, V. 1999. The nature of statistical learning theory. Springer science & business media
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30
Zhong, X. and E. Cambria. 2018. Time expression recognition using a constituent based tagging scheme. In Proceedings of the 2018 world wide web conference, pages 983–992
Zhong, X., A. Sun, and E. Cambria. 2017 Time expression analysis and recognition using syntactic token types and general heuristic rules. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 420–429

La source de données: Dialnet