OpenAI Evals PR:
https://github.com/openai/evals/pull/1511
May 2024
The objective of this evaluation was to assess the comprehension and memorisation capabilities of state-of-the-art Large Language Models (LLMs), specifically GPT-3.5 Turbo, GPT-4 Turbo, and the newly introduced GPT-4o, concerning Quranic texts. This endeavour aims to explore the potential application of LLMs in the domains of religious text interpretation, memorisation, and contextual understanding.
Our evaluation leveraged the OpenAI Evals framework, an advanced toolset for conducting detailed assessments of large language models (LLMs) across diverse tasks. OpenAI Evals facilitates the creation of benchmarks for standardised testing and comparative analysis of model performances, ensuring that our methodology adheres to recognised standards of evaluation. More about OpenAI Evals can be found on their GitHub repository.
Utilising OpenAI Evals, we developed a battery of tests designed to evaluate the LMs on four key dimensions of Quranic text comprehension:
These tests were crafted to gauge the models' ability to recall specific information, discern context, and apply comprehension skills to religious texts.
Following the integration of model-guided judging for two types of the tests, the first and the last, our evaluation yielded nuanced insights into the capabilities of GPT-3.5 Turbo, GPT-4 Turbo, and the newly introduced GPT-4o. The table below summarises the key findings, highlighting the models' accuracies across the different test types.
Test Type | GPT-3.5 Turbo Accuracy | GPT-4 Turbo Accuracy | GPT-4o Accuracy |
---|---|---|---|
Surah Identification | 14.55% | 81.52% | 95.15% |
Meccan vs. Madinan | 71.82% | 83.03% | 89.09% |
Quranic Text Recognition | 66.86% | 99.71% | 94.86% |
Fill in the Blank | 20.30% | 64.24% | 78.48% |
The results demonstrate a marked improvement with GPT-4 Turbo and GPT-4o across all categories, underscoring their superior comprehension and memorisation of Quranic texts. GPT-4o, in particular, shows the highest accuracy in Surah Identification and Meccan vs. Madinan, indicating an enhanced understanding of the Quranic context. GPT-3.5 Turbo's performance, while modest, indicates foundational capabilities that could be enhanced with further model refinement.