Speech synthesis data preparation process
Data preparation is the crucial first step in developing high‑quality text‑to‑speech (TTS) systems. This article briefly summarizes the conversion process from raw audio to the final training dataset, as well as some open‑source speech corpora suitable for training speech synthesis models.
数据准备概述
Training a TTS system requires a large amount of high‑quality, structured speech data. To obtain such a dataset, we need a complete data‑processing pipeline that includes audio normalization, speaker diarization, segmentation, and transcription, among other steps.
Emilia-Pipe 处理流程
Emilia-Pipe is a processing pipeline designed specifically for TTS data preparation, comprising the following key steps:
| 步骤 | 描述 |
|---|---|
| 归一化 | 对音频进行归一化处理,确保音量和质量的一致性 |
| 音源分离 | 将长音频处理成不含背景音乐(BGM)的纯语音 |
| 说话人分离 | 提取中等长度的单说话人语音数据 |
| 基于VAD的精细分割 | 将语音切分成3-30秒的单说话人片段 |
| ASR | 获取语音片段的文本转录 |
| 过滤 | 质量控制,获得最终处理后的数据集 |
Emilia preprocessing tool source code is available on GitHub: Amphion/preprocessors/Emilia
说话人分离
Speaker Diarization is a key step in TTS data preparation, used to identify “who spoke when.” This technology is essential for extracting single‑speaker speech segments from multi‑speaker dialogues, podcasts, and other audio sources.
More detailed information about speaker diarization technology can be found at Speaker Diarization 3.1
RTTM (Rich Transcription Time Marked) is a commonly used annotation format in speech processing for recording speaker‑turn information. The columns of an RTTM file have the following meanings:
| Column Name | Description |
|---|---|
| Type | Segment type; should always be SPEAKER |
| File ID | File name; the base name of the recording (without extension), e.g., rec1_a |
| Channel ID | Channel ID (indexed from 1); should always be 1 |
| Turn Onset | Start time of the turn (seconds from the beginning of the recording) |
| Turn Duration | Duration of the turn (seconds) |
| Orthography Field | Should always be |
| Speaker Type | Should always be |
| Speaker Name | Speaker identifier; must be unique within each file |
| Confidence Score | System confidence score (probability); should always be |
| Signal Lookahead Time | Should always be |
实践中的效率
In production environments, using GPUs can dramatically increase processing efficiency. Tests show that a single A800 GPU can process roughly 3,000 hours of audio data per day in batch mode.
中文开源语音数据
Open‑source Chinese speech corpora suitable for training speech synthesis models.
| Dataset Name | Duration (hours) | Number of Speakers | Quality |
|---|---|---|---|
| aidatatang_200zh | 200 | 600 | Medium |
| aishell1 | 180 | 400 | Medium |
| aishell3 | 85 | 218 | Medium |
| primewords | 99 | 296 | Medium |
| thchs30 | 34 | 40 | Medium |
| magicdata | 755 | 1080 | Medium |
| Emilia | 200,000+ | Emilia | Low |
| WenetSpeech4TTS | 12,800 | N/A | Low |
| CommonVoice | N/A | N/A | Low |
多语言开源语音数据
Open‑source speech corpora in English and multiple languages suitable for training speech synthesis models.
| Dataset Name | Duration (hours) | Number of Speakers | Quality |
|---|---|---|---|
| LibriTTS‑R | 585 | 2456 | High |
| Hi‑Fi TTS | 291 | 10 | Very High |
| LibriHeavy | 60000+ | 7000+ | 16kHz |
| MLS English | 44500 | 5490 | 16kHz |
| MLS German | 1966 | 176 | 16kHz |
| MLS Dutch | 1554 | 40 | 16kHz |
| MLS French | 1076 | 142 | 16kHz |
| MLS Spanish | 917 | 86 | 16kHz |
| MLS Italian | 247 | 65 | 16kHz |
| MLS Portuguese | 160 | 42 | 16kHz |
| MLS Polish | 103 | 11 | 16kHz |
使用 lhotse 处理语音数据
lhotse is a data‑management framework designed specifically for speech processing, providing a complete workflow for handling audio data. Its core concept is manifest‑based data representation:
数据表示
Audio data representation: Stored in
RecordingSet/Recording, containing metadata such as sources,sampling_rate,num_samples,duration, andchannel_ids.Annotation data representation: Stored in
SupervisionSet/SupervisionSegment, containing information such asstart,duration,transcript,language,speaker, andgender.
数据处理流程
lhotse uses the concept of a Cut as a view or pointer to an audio segment, primarily including MonoCut, MixedCut, PaddingCut, and CutSet types. The processing workflow is as follows:
Load manifests as a
CutSet, which enables equal‑length chopping, multi‑threaded feature extraction, padding, and generation of a PyTorchSamplerandDataLoader.Feature extraction supports multiple extractors, such as PyTorch
fbank&MFCC,torchaudio,librosa, etc.Feature normalization supports mean‑variance normalization (CMVN), global normalization, per‑sample normalization, and sliding‑window normalization.
并行处理
lhotse supports multi‑process parallelism; example code:
from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter
num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
cuts: CutSet = cuts.compute_and_store_features(
extractor=Fbank(),
storage=LilcomChunkyWriter('feats'),
executor=ex)PyTorch 集成
lhotse integrates seamlessly with PyTorch:
CutSetcan be used directly as aDataset, supporting noise padding, acoustic context padding, and dynamic batch sizing.Provides various samplers such as
SimpleCutSampler,BucketingSampler, andCutPairsSampler, supporting resume‑state and dynamic batch generation based on total speech duration.Batch I/O supports pre‑compute mode (for slow I/O) and on‑the‑fly feature extraction mode (for data augmentation).
命令行工具
lhotse’s command‑line utilities are very handy, including combine, copy, copy-feats, and many cut operations such as append, decompose, describe, etc., simplifying the data‑processing pipeline:
lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simplelhotse provides prepare functions for many open‑source datasets, making it easy to download and process these standard speech corpora.
总结
TTS data preparation is a multi‑step, complex process that spans audio processing, speaker diarization, and speech recognition. With tools like Emilia‑Pipe and a well‑designed workflow, raw audio can be transformed into high‑quality TTS training datasets, laying a solid foundation for building natural and fluent speech synthesis systems.
For teams aiming to develop TTS systems, it is advisable to allocate sufficient resources to the data‑preparation stage, as data quality directly determines the final model performance.