THE YECS CORPUS:
COMMUNITY SOURCED
Empowering Multilingual AI with the Largest Open-Access Yoruba-English Speech Dataset.
DownloadDATASET AT:
A GLANCE
OUR METHOD:
DATA FARMING
At LyngualLabs, we don't just extract data; we farm it. Our participatory approach treats the community as active research collaborators.
Culturally-Grounded Prompts
50 bilingual speakers generated 51,532 human-written prompts across 16 domains.
Linguistic Precision
Every prompt was vetted by experts for tonal accuracy, diacritics, and language tagging.
Human-Centric Recording
Using our custom web app, speakers recorded content with specific emotional targets.
Rigorous QA
Every second of audio passed expert vetting for signal clarity and linguistic intelligibility.
BENCHMARKING:
EXCELLENCE
Our research demonstrates that natural, domain-specific data drastically outweighs model scale.
A fine-tuned Whisper-Small (244M) model achieved a 19.53% WER, outperforming zero-shot models five times its size.
Models trained on synthetic data collapse in real-world settings. YECS provides the authentic co-articulation necessary for robust ASR and LID.
TECHNICAL:
DISTRIBUTION
To ensure the highest research standards, our dataset is partitioned with strict disjointness (no sentence overlap) across splits.
GET INVOLVED:
USE THE DATA
The YECS Corpus is an Open-Access resource. We invite researchers to use this data for Automatic Speech Recognition (ASR), Emotion Recognition, and Multilingual NLP.
Explore on Mozilla“YECS: A 120-Hour Community-Curated Yoruba-English Code-Switching Corpus.” (2026).© 2026 LyngualLabs. Bridging the gap between human language and technology.