9JAVQA:
MODELS UNDER EXAM
Can the world's best multimodal AI pass a Nigerian exam? We tested them. They failed.
BENCHMARK AT:
A GLANCE
THE FINDINGS:
A CLEAR GAP
State-of-the-art models that ace English benchmarks collapse under African language content.
GPT-4o on English
Achieved over 90% accuracy on English-language exam questions — comfortably above human performance.
GPT-4o on African Languages
The same model dropped below 40% on Yoruba, Igbo, and Hausa — a greater than 50-point collapse.
Humans Outperform All
Human participants exceeded 50% accuracy across all three African languages, outperforming every model tested.
Native Prompts Help, Barely
Prompting models in the native language improved results modestly but did not close the gap against human performance.
USE THE:
BENCHMARK
The 9jaVQA dataset is openly available on HuggingFace. We invite researchers to build on this benchmark and help close the gap in African language AI.
@inproceedings{olufemi-etal-2025-challenging,
title = {Challenging Multimodal {LLM}s with African Standardized Exams:
A Document {VQA} Evaluation},
author = {Olufemi, Victor Tolulope and Babatunde, Oreoluwa Boluwatife
and Bolarinwa, Emmanuel and Moshood, Kausar Yetunde},
booktitle = {Proceedings of the 6th Workshop on African Natural Language Processing},
year = {2025},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2025.africanlp-1.22},
}© 2026 LyngualLabs. Bridging the gap between human language and technology.