RoBERTimeA novel model for the detection of temporal expressions in Spanish
- Araujo Serna, Lourdes
- Martínez Romo, Juan
- Sánchez Castro Fernández, Alejandro
ISSN: 1135-5948
Année de publication: 2023
Número: 70
Pages: 39-51
Type: Article
D'autres publications dans: Procesamiento del lenguaje natural
Résumé
Las expresiones temporales son todas aquellas palabras que refieran temporalidad. Su detección o extracción es una tarea compleja, ya que depende del dominio del texto, del idioma y de la forma de escritura. Su estudio en español y más específicamente en el dominio clínico es escaso, debido principalmente a la falta de corpora anotados. En este trabajo se propone el uso de grandes modelos del lenguaje para abordar la tarea, comparando el rendimiento de cinco modelos de distintas características. Tras un proceso de experimentación y fine tuning, se logra crear un nuevo modelo llamado RoBERTime para la detección de expresiones temporales en español, especialmente centrado en el dominio clínico. Este modelo se encuentra disponible de forma pública. RoBERTime alcanza resultados del estado del arte en los corpus E3C y Timebank, siendo este el primer modelo público en detección de expresiones temporales en español especializado en el dominio clínico.
Références bibliographiques
- Almasian, S., D. Aumiller, and M. Gertz 2021. Bert got a date: Introducing transformers to temporal tagging. arXiv preprint arXiv:2109.14927
- Almasian, S., D. Aumiller, and M. Gertz 2022. Time for some german? pre-training a transformer-based temporal tagger for german. In Text2Story@ ECIR, pages 83– 90
- Aumiller, D., S. Almasian, D. Pohl, and M. Gertz. 2022. Online dateing: A web interface for temporal annotations. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3289–3294
- Barros, C., E. Lloret, E. Saquete, and B. Navarro-Colorado. 2019. Natsum: Narrative abstractive summarization through cross-document timeline generation Information Processing & Management, 56(5):1775–1793
- Bethard, S. 2013. Cleartk-timeml: A minimalist approach to tempeval 2013. In Second joint conference on lexical and computational semantics (* SEM), volume 2: proceedings of the seventh international workshop on semantic evaluation (SemEval 2013), pages 10–14
- Bethard, S., G. Savova, M. Palmer, and J. Pustejovsky. 2017. SemEval-2017 task 12: Clinical TempEval. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 565–572, Vancouver, Canada, August. Association for Computational Linguistics
- Canete, J., G. Chaperon, R. Fuentes, J.-H Ho, H. Kang, and J. Perez. 2020. Spanish pre-trained bert model and evaluation data Pml4dc at iclr, 2020:1–10
- Carrino, C. P., J. Armengol-Estape, A. Gutierrez-Fandino, J. Llop-Palao, M. P`amies, A. Gonzalez-Agirre, and M. Villegas. 2021. Biomedical and clinical language models for spanish: On the benefits of domain-specific pretraining in a mid-resource scenario. arXiv preprint arXiv:2109.03570
- Chang, A. X. and C. D. Manning. 2012. Sutime: A library for recognizing and normalizing time expressions. In Lrec, volume 3735, page 3740
- Chen, S., G. Wang, and B. Karlsson. 2019 Exploring word representations on time expression recognition. Technical report, Technical report, Microsoft Research Asia
- Clark, K., M.-T. Luong, Q. V. Le, and C. D. Manning. 2020. Electra: Pretraining text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555
- Cortes, C. and V. Vapnik. 1995. Supportvector networks. Machine learning, 20(3):273–297
- Ding, W., G. Gao, L. Shi, and Y. Qu 2019. A pattern-based approach to recognizing time expressions. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6335–6342, Jul
- Eberhard, O. and T. Zesch. 2021. Effects of layer freezing on transferring a speech recognition system to under-resourced languages In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pages 208–212
- Gildea, D. and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational linguistics, 28(3):245–288
- Lafferty, J. D., A. McCallum, and F. C. N Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc
- Lange, L., A. Iurshina, H. Adel, and J. Strotgen. 2020. Adversarial alignment of multilingual models for extracting temporal expressions from text. arXiv preprint arXiv:2005.09392
- Lange, L., J. Strotgen, H. Adel, and D. Klakow 2022. Multilingual normalization of temporal expressions with masked language models. arXiv preprint arXiv:2205.10399
- Lee, J., R. Tang, and J. Lin. 2019. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090
- Leeuwenberg, A. and M.-F. Moens. 2018 Temporal information extraction by predicting relative time-lines. arXiv preprint arXiv:1808.09401
- Li, H., J. Strotgen, J. Zell, and M. Gertz 2014. Chinese temporal tagging with heideltime In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 133–137
- Lin, T.-Y., P. Goyal, R. Girshick, K. He, and P. Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988
- Llorens, H., E. Saquete, and B. Navarro 2010. Tipsem (english and spanish): Evaluating crfs and semantic roles in tempeval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 284–291
- Magnini, B., B. Altuna, A. Lavelli, M. Speranza, and R. Zanoli. 2020. The e3c project: Collection and annotation of a multilingual corpus of clinical cases. In CLiC-it
- Mosbach, M., M. Andriushchenko, and D. Klakow. 2020. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines
- Nakayama, H. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakkiworks/ seqeval
- Navas-Loro, M. and V. Rodríguez-Doncel 2020. Annotador: a temporal tagger for spanish. Journal of Intelligent & Fuzzy Systems, 39(2):1979–1991
- Ng, J. P., Y. Chen, M.-Y. Kan, and Z. Li 2014. Exploiting timelines to enhance multi-document summarization. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 923–933
- Nieto, M. G., R. Saurı, and M. A. B. Poveda 2011. Modes timebank: a modern spanish timebank corpus. Procesamiento del lenguaje natural, 47:259–267
- Nivre, J., J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret 2007. The conll 2007 shared task on dependency parsing. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 915–932
- Pampari, A., P. Raghavan, J. Liang, and J. Peng. 2018. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732
- Pennington, J., R. Socher, and C. D. Manning 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543
- Pustejovsky, J., K. Lee, H. Bunt, and L. Romary 2010. Iso-timeml: An international standard for semantic annotation. In LREC, volume 10, pages 394–397
- Ramshaw, L. A. and M. P. Marcus. 1999 Text chunking using transformation-based learning. In Natural language processing using very large corpora. Springer, pages 157–176
- Sang, E. F. and S. Buchholz. 2000. Introduction to the conll-2000 shared task: Chunking arXiv preprint cs/0009008
- Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108
- Saurı, R., J. Littman, B. Knippen, R. Gaizauskas, A. Setzer, and J. Pustejovsky 2006. Timeml annotation guidelines version 1.2. 1
- Skukan, L., G. Glavas, and J. Snajder. 2014 Heideltime. hr: extracting and normalizing temporal expressions in croatian. In Proceedings of the 9th Slovenian Language Technologies Conferences (IS-LT 2014), pages 99–103
- Strotgen, J., T. Bogel, J. Zell, A. Armiti, T. V. Canh, and M. Gertz. 2014. Extending HeidelTime for temporal expressions referring to historic dates. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2390–2397, Reykjavik, Iceland, May. European Language Resources Association (ELRA)
- Strotgen, J. and M. Gertz. 2010. Heideltime: High quality rule-based extraction and normalization of temporal expressions In Proceedings of the 5th international workshop on semantic evaluation, pages 321–324
- Strotgen, J. and M. Gertz. 2013. Multilingual and cross-domain temporal tagging Language Resources and Evaluation, 47(2):269–298
- Sun, Y., G. Cheng, and Y. Qu. 2018 Reading comprehension with graph-based temporal-casual reasoning. In Proceedings of the 27th International Conference on Computational Linguistics, pages 806– 817
- Tjong Kim Sang, E. F. 2002. Introduction to the CoNLL-2002 shared task: Language independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL2002)
- UzZaman, N., H. Llorens, L. Derczynski, J. Allen, M. Verhagen, and J. Pustejovsky 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9
- Vapnik, V. 1999. The nature of statistical learning theory. Springer science & business media
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30
- Zhong, X. and E. Cambria. 2018. Time expression recognition using a constituent based tagging scheme. In Proceedings of the 2018 world wide web conference, pages 983–992
- Zhong, X., A. Sun, and E. Cambria. 2017 Time expression analysis and recognition using syntactic token types and general heuristic rules. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 420–429