THE YECS CORPUS:
COMMUNITY SOURCED
Empowering Multilingual AI with the Largest Open-Access Yoruba-English Speech Dataset.
DownloadDATASET AT:
A GLANCE
OUR METHOD:
DATA FARMING
At LyngualLabs, we don't just extract data; we farm it. Our participatory approach treats the community as active research collaborators.
Culturally-Grounded Prompts
50 bilingual speakers generated 51,532 human-written prompts across 16 domains.
Linguistic Precision
Every prompt was vetted by experts for tonal accuracy, diacritics, and language tagging.
Human-Centric Recording
Using our custom web app, speakers recorded content with specific emotional targets.
Rigorous QA
Every second of audio passed expert vetting for signal clarity and linguistic intelligibility.
BENCHMARKING:
EXCELLENCE
Our research demonstrates that natural, domain-specific data drastically outweighs model scale.
A fine-tuned Whisper-Small (244M) model achieved a 19.53% WER, outperforming zero-shot models five times its size.
Models trained on synthetic data collapse in real-world settings. YECS provides the authentic co-articulation necessary for robust ASR and LID.
TECHNICAL:
DISTRIBUTION
To ensure the highest research standards, our dataset is partitioned with strict disjointness (no sentence overlap) across splits.
GET INVOLVED:
USE THE DATA
The YECS Corpus is an Open-Access resource. We invite researchers to use this data for Automatic Speech Recognition (ASR), Emotion Recognition, and Multilingual NLP.
Explore on Mozilla@misc{lynguallabs_yecs_2026,
title = {{YECS}: A 120-Hour Community-Curated Yoruba-English Code-Switching Corpus},
author = {{LyngualLabs}},
year = {2026},
note = {140 speakers; 16 semantic domains; word-level language tags},
howpublished = {\url{https://lynguallabs.org/yecs}},
}© 2026 LyngualLabs. Bridging the gap between human language and technology.