OpenAI Evals PR:

https://github.com/openai/evals/pull/1511

By Mohamed Sakher Sawan

May 2024

Introduction

The objective of this evaluation was to assess the comprehension and memorisation capabilities of state-of-the-art Large Language Models (LLMs), specifically GPT-3.5 Turbo, GPT-4 Turbo, and the newly introduced GPT-4o, concerning Quranic texts. This endeavour aims to explore the potential application of LLMs in the domains of religious text interpretation, memorisation, and contextual understanding.

Methodology: OpenAI Evals Framework

Our evaluation leveraged the OpenAI Evals framework, an advanced toolset for conducting detailed assessments of large language models (LLMs) across diverse tasks. OpenAI Evals facilitates the creation of benchmarks for standardised testing and comparative analysis of model performances, ensuring that our methodology adheres to recognised standards of evaluation. More about OpenAI Evals can be found on their GitHub repository.

Evaluation Setup

Utilising OpenAI Evals, we developed a battery of tests designed to evaluate the LMs on four key dimensions of Quranic text comprehension:

  1. Surah (Chapter) Identification: Determining the Surah to which a given verse belongs.
  2. Meccan vs. Madinan Revelation: Identifying the period (Meccan or Madinan) of a verse's revelation.
  3. Quranic Text Recognition: Selecting authentic Quranic text from a set of options.
  4. Fill in the Blank: Completing verses with precisely missing words or phrases.

These tests were crafted to gauge the models' ability to recall specific information, discern context, and apply comprehension skills to religious texts.

Evaluation Results: Model-Guided Judging

Following the integration of model-guided judging for two types of the tests, the first and the last, our evaluation yielded nuanced insights into the capabilities of GPT-3.5 Turbo, GPT-4 Turbo, and the newly introduced GPT-4o. The table below summarises the key findings, highlighting the models' accuracies across the different test types.

Test Type GPT-3.5 Turbo Accuracy GPT-4 Turbo Accuracy GPT-4o Accuracy
Surah Identification 14.55% 81.52% 95.15%
Meccan vs. Madinan 71.82% 83.03% 89.09%
Quranic Text Recognition 66.86% 99.71% 94.86%
Fill in the Blank 20.30% 64.24% 78.48%

The results demonstrate a marked improvement with GPT-4 Turbo and GPT-4o across all categories, underscoring their superior comprehension and memorisation of Quranic texts. GPT-4o, in particular, shows the highest accuracy in Surah Identification and Meccan vs. Madinan, indicating an enhanced understanding of the Quranic context. GPT-3.5 Turbo's performance, while modest, indicates foundational capabilities that could be enhanced with further model refinement.

Untitled