Data preparation is the crucial first step in developing high‑quality text‑to‑speech (TTS) systems. This article briefly summarizes the conversion process from raw audio to the final training dataset, as well as some open‑source speech corpora suitable for training speech synthesis models.

数据准备概述

Training a TTS system requires a large amount of high‑quality, structured speech data. To obtain such a dataset, we need a complete data‑processing pipeline that includes audio normalization, speaker diarization, segmentation, and transcription, among other steps.

Emilia-Pipe 处理流程

Emilia-Pipe is a processing pipeline designed specifically for TTS data preparation, comprising the following key steps:

步骤描述
归一化对音频进行归一化处理,确保音量和质量的一致性
音源分离将长音频处理成不含背景音乐(BGM)的纯语音
说话人分离提取中等长度的单说话人语音数据
基于VAD的精细分割将语音切分成3-30秒的单说话人片段
ASR获取语音片段的文本转录
过滤质量控制,获得最终处理后的数据集

Emilia preprocessing tool source code is available on GitHub: Amphion/preprocessors/Emilia

说话人分离

Speaker Diarization is a key step in TTS data preparation, used to identify “who spoke when.” This technology is essential for extracting single‑speaker speech segments from multi‑speaker dialogues, podcasts, and other audio sources.

More detailed information about speaker diarization technology can be found at Speaker Diarization 3.1

RTTM (Rich Transcription Time Marked) is a commonly used annotation format in speech processing for recording speaker‑turn information. The columns of an RTTM file have the following meanings:

Column NameDescription
TypeSegment type; should always be SPEAKER
File IDFile name; the base name of the recording (without extension), e.g., rec1_a
Channel IDChannel ID (indexed from 1); should always be 1
Turn OnsetStart time of the turn (seconds from the beginning of the recording)
Turn DurationDuration of the turn (seconds)
Orthography FieldShould always be
Speaker TypeShould always be
Speaker NameSpeaker identifier; must be unique within each file
Confidence ScoreSystem confidence score (probability); should always be
Signal Lookahead TimeShould always be

实践中的效率

In production environments, using GPUs can dramatically increase processing efficiency. Tests show that a single A800 GPU can process roughly 3,000 hours of audio data per day in batch mode.

中文开源语音数据

Open‑source Chinese speech corpora suitable for training speech synthesis models.

Dataset NameDuration (hours)Number of SpeakersQuality
aidatatang_200zh200600Medium
aishell1180400Medium
aishell385218Medium
primewords99296Medium
thchs303440Medium
magicdata7551080Medium
Emilia200,000+EmiliaLow
WenetSpeech4TTS12,800N/ALow
CommonVoiceN/AN/ALow

多语言开源语音数据

Open‑source speech corpora in English and multiple languages suitable for training speech synthesis models.

Dataset NameDuration (hours)Number of SpeakersQuality
LibriTTS‑R5852456High
Hi‑Fi TTS29110Very High
LibriHeavy60000+7000+16kHz
MLS English44500549016kHz
MLS German196617616kHz
MLS Dutch15544016kHz
MLS French107614216kHz
MLS Spanish9178616kHz
MLS Italian2476516kHz
MLS Portuguese1604216kHz
MLS Polish1031116kHz

使用 lhotse 处理语音数据

lhotse is a data‑management framework designed specifically for speech processing, providing a complete workflow for handling audio data. Its core concept is manifest‑based data representation:

数据表示

  1. Audio data representation: Stored in RecordingSet/Recording, containing metadata such as sources, sampling_rate, num_samples, duration, and channel_ids.

  2. Annotation data representation: Stored in SupervisionSet/SupervisionSegment, containing information such as start, duration, transcript, language, speaker, and gender.

数据处理流程

lhotse uses the concept of a Cut as a view or pointer to an audio segment, primarily including MonoCut, MixedCut, PaddingCut, and CutSet types. The processing workflow is as follows:

  • Load manifests as a CutSet, which enables equal‑length chopping, multi‑threaded feature extraction, padding, and generation of a PyTorch Sampler and DataLoader.

  • Feature extraction supports multiple extractors, such as PyTorch fbank & MFCC, torchaudio, librosa, etc.

  • Feature normalization supports mean‑variance normalization (CMVN), global normalization, per‑sample normalization, and sliding‑window normalization.

并行处理

lhotse supports multi‑process parallelism; example code:

from concurrent.futures import ProcessPoolExecutor
from lhotse import CutSet, Fbank, LilcomChunkyWriter

num_jobs = 8
with ProcessPoolExecutor(num_jobs) as ex:
    cuts: CutSet = cuts.compute_and_store_features(
        extractor=Fbank(),
        storage=LilcomChunkyWriter('feats'),
        executor=ex)

PyTorch 集成

lhotse integrates seamlessly with PyTorch:

  • CutSet can be used directly as a Dataset, supporting noise padding, acoustic context padding, and dynamic batch sizing.

  • Provides various samplers such as SimpleCutSampler, BucketingSampler, and CutPairsSampler, supporting resume‑state and dynamic batch generation based on total speech duration.

  • Batch I/O supports pre‑compute mode (for slow I/O) and on‑the‑fly feature extraction mode (for data augmentation).

命令行工具

lhotse’s command‑line utilities are very handy, including combine, copy, copy-feats, and many cut operations such as append, decompose, describe, etc., simplifying the data‑processing pipeline:

lhotse combine
lhotse copy
lhotse copy-feats
lhotse cut append
lhotse cut decompose
lhotse cut describe
lhotse cut export-to-webdataset
lhotse cut mix-by-recording-id
lhotse cut mix-sequential
lhotse cut pad
lhotse cut simple

lhotse provides prepare functions for many open‑source datasets, making it easy to download and process these standard speech corpora.

总结

TTS data preparation is a multi‑step, complex process that spans audio processing, speaker diarization, and speech recognition. With tools like Emilia‑Pipe and a well‑designed workflow, raw audio can be transformed into high‑quality TTS training datasets, laying a solid foundation for building natural and fluent speech synthesis systems.

For teams aiming to develop TTS systems, it is advisable to allocate sufficient resources to the data‑preparation stage, as data quality directly determines the final model performance.