Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination

  1. Eva Sánchez Salido
  2. Roser Morante
  3. Julio Gonzalo
  4. Guillermo Marco
  5. Jorge Carrillo de Albornoz
  6. Laura Plaza
  7. Enrique Amigó
  8. Andrés Fernandez García
  9. Alejandro Benito-Santos
  10. Adrián Ghajari Espinosa
  11. Víctor Fresno
Proceedings:
Proceedings of the 31st International Conference on Computational Linguistics
  1. Owen Rambow (coord.)
  2. Leo Wanner (coord.)
  3. Marianna Apidianaki (coord.)
  4. Hend Al-Khalifa (coord.)
  5. Barbara Di Eugenio (coord.)
  6. Steven Schockaert (coord.)

Publisher: Association for Computational Linguistics

Year of publication: 2025

Pages: 6184-6200

Congress: Proceedings of the 31st International Conference on Computational Linguistics

Type: Conference paper

Abstract

In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Large Language Models with this dataset. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) Smaller models not only perform worse than the largest models, but also degrade faster in Spanish than in English. The performance gap between both languages is negligible for the best models, but grows up to 37% for smaller models; (ii) Model ranking on UNED-ACCESS 2024 is almost identical (0.98 Pearson correlation) to the one obtained with MMLU (a similar, but publicly available benchmark), suggesting that contamination affects similarly to all models, and (iii) As in publicly available datasets, reasoning questions in UNED-ACCESS