Speech synthesis data preparation process

Data preparation is the crucial first step in developing high‑quality text‑to‑speech (TTS) systems. This article briefly summarizes the conversion process from raw audio to the final training dataset, as well as some open‑source speech corpora suitable for training speech synthesis models.

数据准备概述

Training a TTS system requires a large amount of high‑quality, structured speech data. To obtain such a dataset, we need a complete data‑processing pipeline that includes audio normalization, speaker diarization, segmentation, and transcription, among other steps.

Emilia-Pipe 处理流程

Emilia-Pipe is a processing pipeline designed specifically for TTS data preparation, comprising the following key steps:

步骤	描述
归一化	对音频进行归一化处理，确保音量和质量的一致性
音源分离	将长音频处理成不含背景音乐（BGM）的纯语音
说话人分离	提取中等长度的单说话人语音数据
基于VAD的精细分割	将语音切分成3-30秒的单说话人片段
ASR	获取语音片段的文本转录
过滤	质量控制，获得最终处理后的数据集

Emilia preprocessing tool source code is available on GitHub: Amphion/preprocessors/Emilia

说话人分离

Speaker Diarization is a key step in TTS data preparation, used to identify “who spoke when.” This technology is essential for extracting single‑speaker speech segments from multi‑speaker dialogues, podcasts, and other audio sources.

More detailed information about speaker diarization technology can be found at Speaker Diarization 3.1

RTTM (Rich Transcription Time Marked) is a commonly used annotation format in speech processing for recording speaker‑turn information. The columns of an RTTM file have the following meanings:

Column Name	Description
Type	Segment type; should always be SPEAKER
File ID	File name; the base name of the recording (without extension), e.g., `rec1_a`
Channel ID	Channel ID (indexed from 1); should always be 1
Turn Onset	Start time of the turn (seconds from the beginning of the recording)
Turn Duration	Duration of the turn (seconds)
Orthography Field	Should always be
Speaker Type	Should always be
Speaker Name	Speaker identifier; must be unique within each file
Confidence Score	System confidence score (probability); should always be
Signal Lookahead Time	Should always be

实践中的效率

In production environments, using GPUs can dramatically increase processing efficiency. Tests show that a single A800 GPU can process roughly 3,000 hours of audio data per day in batch mode.

中文开源语音数据

Open‑source Chinese speech corpora suitable for training speech synthesis models.

Dataset Name	Duration (hours)	Number of Speakers	Quality
aidatatang_200zh	200	600	Medium
aishell1	180	400	Medium
aishell3	85	218	Medium
primewords	99	296	Medium
thchs30	34	40	Medium
magicdata	755	1080	Medium
Emilia	200,000+	Emilia	Low
WenetSpeech4TTS	12,800	N/A	Low
CommonVoice	N/A	N/A	Low

多语言开源语音数据

Open‑source speech corpora in English and multiple languages suitable for training speech synthesis models.

Dataset Name	Duration (hours)	Number of Speakers	Quality
LibriTTS‑R	585	2456	High
Hi‑Fi TTS	291	10	Very High
LibriHeavy	60000+	7000+	16kHz
MLS English	44500	5490	16kHz
MLS German	1966	176	16kHz
MLS Dutch	1554	40	16kHz
MLS French	1076	142	16kHz
MLS Spanish	917	86	16kHz
MLS Italian	247	65	16kHz
MLS Portuguese	160	42	16kHz
MLS Polish	103	11	16kHz

使用 lhotse 处理语音数据

lhotse is a data‑management framework designed specifically for speech processing, providing a complete workflow for handling audio data. Its core concept is manifest‑based data representation:

数据表示

Audio data representation: Stored in RecordingSet/Recording, containing metadata such as sources, sampling_rate, num_samples, duration, and channel_ids.
Annotation data representation: Stored in SupervisionSet/SupervisionSegment, containing information such as start, duration, transcript, language, speaker, and gender.

数据处理流程

lhotse uses the concept of a Cut as a view or pointer to an audio segment, primarily including MonoCut, MixedCut, PaddingCut, and CutSet types. The processing workflow is as follows:

Load manifests as a CutSet, which enables equal‑length chopping, multi‑threaded feature extraction, padding, and generation of a PyTorch Sampler and DataLoader.
Feature extraction supports multiple extractors, such as PyTorch fbank & MFCC, torchaudio, librosa, etc.
Feature normalization supports mean‑variance normalization (CMVN), global normalization, per‑sample normalization, and sliding‑window normalization.

并行处理

lhotse supports multi‑process parallelism; example code:

from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter

num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
    cuts: CutSet = cuts.compute_and_store_features(
        extractor=Fbank(),
        storage=LilcomChunkyWriter('feats'),
        executor=ex)

PyTorch 集成

lhotse integrates seamlessly with PyTorch:

CutSet can be used directly as a Dataset, supporting noise padding, acoustic context padding, and dynamic batch sizing.
Provides various samplers such as SimpleCutSampler, BucketingSampler, and CutPairsSampler, supporting resume‑state and dynamic batch generation based on total speech duration.
Batch I/O supports pre‑compute mode (for slow I/O) and on‑the‑fly feature extraction mode (for data augmentation).

命令行工具

lhotse’s command‑line utilities are very handy, including combine, copy, copy-feats, and many cut operations such as append, decompose, describe, etc., simplifying the data‑processing pipeline:

lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simple

lhotse provides prepare functions for many open‑source datasets, making it easy to download and process these standard speech corpora.

总结

TTS data preparation is a multi‑step, complex process that spans audio processing, speaker diarization, and speech recognition. With tools like Emilia‑Pipe and a well‑designed workflow, raw audio can be transformed into high‑quality TTS training datasets, laying a solid foundation for building natural and fluent speech synthesis systems.

For teams aiming to develop TTS systems, it is advisable to allocate sufficient resources to the data‑preparation stage, as data quality directly determines the final model performance.