YECS Corpus

THE YECS CORPUS:
COMMUNITY SOURCED

Empowering Multilingual AI with the Largest Open-Access Yoruba-English Speech Dataset.

The Yoruba-English Code-Switching (YECS) Corpus is a landmark 120-hour speech dataset developed to solve the "language gap" in Sub-Saharan African AI.

While most AI models struggle with the fluid transitions of bilingual speakers, YECS captures the authentic tonal and prosodic interactions of natural conversation.

DATASET AT:
A GLANCE

Total Volume

120 Hours of validated naturalistic speech

Utterances

99,930 unique segments

Complexity

High-density switching (34.22 CMI)

Diversity

140 Speakers (95 Female, 45 Male)

Annotation

Word-level Language ID & 7 Emotion Categories

Integrity

Professional signal quality (91.6 dB mean SNR)

OUR METHOD:
DATA FARMING

At LyngualLabs, we don't just extract data; we farm it. Our participatory approach treats the community as active research collaborators.

Culturally-Grounded Prompts

50 bilingual speakers generated 51,532 human-written prompts across 16 domains.

Linguistic Precision

Every prompt was vetted by experts for tonal accuracy, diacritics, and language tagging.

Human-Centric Recording

Using our custom web app, speakers recorded content with specific emotional targets.

Rigorous QA

Every second of audio passed expert vetting for signal clarity and linguistic intelligibility.

BENCHMARKING:
EXCELLENCE

Our research demonstrates that natural, domain-specific data drastically outweighs model scale.

Small Model, Big Impact

A fine-tuned Whisper-Small (244M) model achieved a 19.53% WER, outperforming zero-shot models five times its size.

The "Natural" Advantage

Models trained on synthetic data collapse in real-world settings. YECS provides the authentic co-articulation necessary for robust ASR and LID.

View Github

TECHNICAL:
DISTRIBUTION

To ensure the highest research standards, our dataset is partitioned with strict disjointness (no sentence overlap) across splits.

Training

80,015 Uts95.57 Hrs

Validation

9,966 Uts11.96 Hrs

Testing

9,949 Uts11.83 Hrs

GET INVOLVED:
USE THE DATA

The YECS Corpus is an Open-Access resource. We invite researchers to use this data for Automatic Speech Recognition (ASR), Emotion Recognition, and Multilingual NLP.

Explore on Mozilla

Cite Our Work“YECS: A 120-Hour Community-Curated Yoruba-English Code-Switching Corpus.” (2026).

THE YECS CORPUS:COMMUNITY SOURCED

DATASET AT:A GLANCE

OUR METHOD:DATA FARMING