Querying the DepthsUnveiling the Strengths and Struggles of Large Language Models in SPARQL Generation

  1. Ghajari, Adrián
  2. Ros, Salvador
  3. Pérez, Álvaro
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2024

Issue: 73

Pages: 271-281

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

The emergence of the Semantic Web has precipitated a proliferation of structured data manifested in the form of knowledge graphs, underscoring the imperative of natural language interfaces to enhance accessibility to these repositories of information. The capacity to articulate queries in natural language and subsequently retrieve data through SPARQL queries assumes paramount importance. In the present investigation, we have scrutinized the efficacy of in-context learning based on an agent-based architecture in facilitating the construction of SPARQL queries. Contrary to initial expectations, the augmentation of in-context learning prompts through agent-based mechanisms has been found to diminish the efficacy of Language Model-based Systems (LLMS), as it is perceived as extraneous "noise," thereby delineating the constraints inherent in this approach. The results highlight the need to delve deeper into the intricacies of model training and fine-tuning, focusing on the relational aspects of ontology schemas.

Bibliographic References

  • Dorobat¸, I. C. and V. Posea. 2020. onIQ: An Ontology-Independent Natural Language Interface for Building SPARQL Queries. In 2020 IEEE 16th International Conference on Intelligent Computer Communication and Processing (ICCP), pages 139–144, September.
  • Golovneva, O., Z. Allen-Zhu, J. Weston, and S. Sukhbaatar. 2024. Reverse Training to Nurse the Reversal Curse, March. arXiv:2403.13799 [cs].
  • Guo, D., Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence, January. arXiv:2401.14196 [cs].
  • He, S., Y. Zhang, K. Liu, and J. Zhao. 2014. CASIA@V2: A MLN-based Question Answering System over Linked Data. In Working Notes for CLEF 2014 Conference, Sheffield, UK.
  • Jiang, A. Q., A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. 2023. Mistral 7B, October. arXiv:2310.06825 [cs].
  • Kaufmann, E., A. Bernstein, and R. Zumstein. 2006. Querix: A Natural Language Interface to Query Ontologies Based on Clarification Dialogs. In Proceedings of the 5th International Semantic Web Conference, Athens.
  • Li, Y., H. Yang, and H. Jagadish. 2007. NaLIX: A Generic Natural Language Search Environment for XML Data. acmtds, accepted. ACM Trans. Database Syst., 32, November.
  • Liang, S., K. Stockinger, T. M. de Farias, M. Anisimova, and M. Gil. 2021. Querying knowledge graphs in natural language. Journal of Big Data, 8(1):3, January.
  • Perevalov, A., X. Yan, L. Kovriguina, L. Jiang, A. Both, and R. Usbeck. 2022. Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis, January. arXiv:2201.08174 [cs].
  • Popescu, A.-M., O. Etzioni, and H. Kautz. 2003. Towards a theory of natural language interfaces to databases. In Proceedings of the 8th international conference on Intelligent user interfaces, IUI ’03, pages 149–157, New York, NY, USA. Association for Computing Machinery.
  • Qi, P., Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages, April. arXiv:2003.07082 [cs].
  • Rony, M. R. A. H., U. Kumar, R. Teucher, L. Kovriguina, and J. Lehmann. 2022. SGPT: A Generative Approach for SPARQL Query Generation From Natural Language Questions. IEEE Access, 10:70712–70723.
  • Rozière, B., J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. D´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. 2024. Code Llama: Open Foundation Models for Code, January. arXiv:2308.12950 [cs].
  • Soru, T., E. Marx, A. Valdestilhas, D. Esteves, D. Moussallem, and G. Publio. 2018. Neural Machine Translation for Query Construction and Composition, July. arXiv:1806.10478 [cs].
  • Taffa, T. A. and R. Usbeck. 2023. Leveraging LLMs in Scholarly Knowledge Graph Question Answering, November. arXiv:2311.09841 [cs].
  • Trivedi, P., G. Maheshwari, M. Dubey, and J. Lehmann. 2017. LC-QuAD: A Corpus for Complex Question Answering over Knowledge Graphs. In The Semantic Web – ISWC 2017: 16th International Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part II, pages 210–218, Berlin, Heidelberg, October. Springer-Verlag.
  • Wang, C., M. Xiong, Q. Zhou, and Y. Yu. 2007. PANTO: A Portable Natural Language Interface to Ontologies. In The Semantic Web: Research and Applications, volume 4519, pages 473–487. Book Title: The Semantic Web: Research and Applications ISSN: 0302-9743, 1611-3349 Series Title: Lecture Notes in Computer Science.
  • Yang, S., M. Teng, X. Dong, and F. Bo. 2023. LLM-Based SPARQL Generation with Selected Schema from Large Scale Knowledge Base. In H. Wang, X. Han, M. Liu, G. Cheng, Y. Liu, and N. Zhang, editors, Knowledge Graph and Semantic Computing: Knowledge Graph Empowers Artificial General Intelligence, Communications in Computer and Information Science, pages 304–316, Singapore. Springer Nature.