Speech synthesis data preparation process
Data preparation is the crucial first step in developing high‑quality text‑to‑speech (TTS) systems. This article briefly summarizes the conversion process from raw audio to the final training dataset, as well as some open‑source speech corpora suitable for training speech synthesis models.
Data Preparation Overview
Training a TTS system requires a large amount of high‑quality, structured speech data. To obtain such a dataset, we need a complete data‑processing pipeline that includes audio normalization, speaker diarization, segmentation, and transcription, among other steps.
Emilia‑Pipe Processing Pipeline
Emilia‑Pipe is a pipeline designed specifically for TTS data preparation, comprising the following key steps:
| Step | Description |
|---|---|
| Normalization | Normalize the audio to ensure consistent volume and quality |
| Source Separation | Process long recordings into pure speech without background music (BGM) |
| Speaker Diarization | Extract medium‑length single‑speaker speech segments |
| Fine Segmentation based on VAD | Split speech into 3‑30 s single‑speaker fragments |
| ASR | Obtain textual transcriptions of the speech fragments |
| Filtering | Quality control to produce the final processed dataset |
The source code of the Emilia preprocessing tools can be obtained on GitHub: Amphion/preprocessors/Emilia
Speaker Diarization
Speaker diarization is a key step in TTS data preparation that identifies “who spoke when.” This technology is essential for extracting single‑speaker speech segments from multi‑speaker dialogues, podcasts, and other audio sources.
More detailed information about speaker diarization can be found at Speaker Diarization 3.1
RTTM (Rich Transcription Time Marked) is a commonly used annotation format in speech processing for recording speaker‑turn information. The columns in an RTTM file have the following meanings:
| Column Name | Description |
|---|---|
| Type | Segment type; should always be SPEAKER |
| File ID | File name; the base name of the recording without extension, e.g., rec1_a |
| Channel ID | Channel ID (starting from 1); should always be 1 |
| Turn Onset | Start time of the turn (seconds from the beginning of the recording) |
| Turn Duration | Duration of the turn (seconds) |
| Orthography Field | Should always be |
| Speaker Type | Should always be |
| Speaker Name | Speaker identifier; must be unique within each file |
| Confidence Score | System confidence score (probability); should always be |
| Signal Lookahead Time | Should always be |
Efficiency in Practice
In production environments, using a GPU can dramatically increase processing efficiency. Tests show that a single A800 GPU can batch‑process roughly 3,000 hours of audio data per day.
Open‑Source Chinese Speech Corpora
Chinese open‑source speech corpora suitable for training speech synthesis models.
| Dataset Name | Duration (hours) | Number of Speakers | Quality |
|---|---|---|---|
| aidatatang_200zh | 200 | 600 | Medium |
| aishell1 | 180 | 400 | Medium |
| aishell3 | 85 | 218 | Medium |
| primewords | 99 | 296 | Medium |
| thchs30 | 34 | 40 | Medium |
| magicdata | 755 | 1080 | Medium |
| Emilia | 200,000+ | Emilia | Low |
| WenetSpeech4TTS | 12,800 | N/A | Low |
| CommonVoice | N/A | N/A | Low |
Multilingual Open‑Source Speech Corpora
English and multilingual open‑source speech corpora suitable for training speech synthesis models.
| Dataset Name | Duration (hours) | Number of Speakers | Quality |
|---|---|---|---|
| LibriTTS‑R | 585 | 2456 | High |
| Hi‑Fi TTS | 291 | 10 | Very High |
| LibriHeavy | 60,000+ | 7,000+ | 16 kHz |
| MLS English | 44,500 | 5,490 | 16 kHz |
| MLS German | 1,966 | 176 | 16 kHz |
| MLS Dutch | 1,554 | 40 | 16 kHz |
| MLS French | 1,076 | 142 | 16 kHz |
| MLS Spanish | 917 | 86 | 16 kHz |
| MLS Italian | 247 | 65 | 16 kHz |
| MLS Portuguese | 160 | 42 | 16 kHz |
| MLS Polish | 103 | 11 | 16 kHz |
Using lhotse to Process Speech Data
lhotse is a data‑management framework designed specifically for speech processing, providing a complete workflow for handling audio data. Its core concept is manifest‑based data representation:
Data Representation
Audio data representation: Stored via
RecordingSet/Recording, which contain metadata such as sources,sampling_rate,num_samples,duration, andchannel_ids.Annotation data representation: Stored via
SupervisionSet/SupervisionSegment, which includestart,duration,transcript,language,speaker, andgender.
Data Processing Workflow
lhotse uses the concept of a Cut as a view or pointer to an audio segment, primarily including MonoCut, MixedCut, PaddingCut, and CutSet types. The processing workflow is as follows:
Load manifests as a
CutSet, which enables equal‑length chopping, multi‑threaded feature extraction, padding, and generation of a PyTorchSamplerandDataLoader.Feature extraction supports various extractors, such as PyTorch
fbank&MFCC,torchaudio,librosa, etc.Feature normalization supports mean‑variance normalization (CMVN), global normalization, per‑sample normalization, and sliding‑window normalization.
Parallel Processing
lhotse supports multi‑process parallelism. Example code:
from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter
num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
cuts: CutSet = cuts.compute_and_store_features(
extractor=Fbank(),
storage=LilcomChunkyWriter('feats'),
executor=ex)PyTorch Integration
lhotse integrates seamlessly with PyTorch:
CutSetcan be used directly as aDataset, supporting noise padding, acoustic context padding, and dynamic batch sizing.Provides various samplers such as
SimpleCutSampler,BucketingSampler, andCutPairsSampler, which support state restoration and dynamic batch sizing based on total speech duration.Batch I/O supports pre‑computation mode (for slow I/O) and on‑the‑fly feature extraction mode (for data augmentation).
Command‑Line Tools
lhotse’s command‑line utilities are very practical, including combine, copy, copy-feats, and many cut operations such as append, decompose, describe, etc., simplifying the data‑processing pipeline:
lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simplelhotse provides prepare functions for many open‑source datasets, making it easy to download and process these standard speech corpora.
Conclusion
TTS data preparation is a multi‑step, complex process that involves audio processing, speaker diarization, and speech recognition, among other technical areas. With tools like Emilia‑Pipe and a well‑designed workflow, raw audio can be transformed into high‑quality TTS training datasets, laying the foundation for building natural and fluent speech synthesis systems.
For teams aiming to develop TTS systems, it is advisable to allocate sufficient resources to the data‑preparation stage, as data quality directly determines the final model performance.