尝试LitData和WebDataset对比

抽空测试了一下Lightning新出的LitData,并和WebDataset进行了简单比较,目前LitData还处于初步阶段。

项目地址

https://github.com/Lightning-AI/litdata
https://github.com/webdataset/webdataset

当前版本

litdata=0.2.18
webdataset=0.2.90

LitData处理SA-1B数据集(10TB)

https://lightning.ai/lightning-ai/studios/prepare-large-image-segmentation-datasets-with-litdata

LitData修改optimize cache路径

默认cache路径在/tmp/chunks/,通过设置环境变量修改路径:

import os  
# Set your desired cache directory  
os.environ["DATA_OPTIMIZER_CACHE_FOLDER"] = "/path/to/your/cache_dir"

不使用CombinedStreamingDataset

CombinedStreamingDataset在读取时,会出现某个dataset读完了,训练就会卡死的情况。StreamingDataset更简单,但也会出现validation卡死。

比较LitData与WebDataset

LitData非常不稳定,在训练VQGAN过程中,从training切换到validation就会失败。LitData在4张A800、batch_size=16、num_workers=12时,处理速度为1.69 it/s.

WebDataset通过resample和with_epoch/with_length,先将shards分到4个显卡(进程),再开num_workers=12个进程读取数据,最后在WebLoader进程中重新打乱数据。batch_size=16时,每个epoch的steps数为dataset_size // (batch_size * world_size),处理速度为1.52 it/s,比LitData稍慢。

LitData提供的map并行函数比WebDataset方便。

WebDataset unbatched IndexError

# webdataset/filters.py:515-521,报错IndexError  
def _unbatched(data):  
    """Turn batched data back into unbatched data."""  
    for sample in data:  
        assert isinstance(sample, (tuple, list)), sample  
        assert len(sample) > 0  
        for i in range(len(sample[0])):  
            yield tuple(x[i] for x in sample)

# 修改后,丢弃不等长的样本  
def _unbatched(data):  
    """Turn batched data back into unbatched data."""  
    for sample in data:  
        assert isinstance(sample, (tuple, list)), sample  
        assert len(sample) > 0  
        min_length = min(len(x) for x in sample)  
        max_length = max(len(x) for x in sample)  
        if max_length != min_length:  
            print(f"{max_length} != {min_length}, some data will be discarded.")  
        for i in range(min_length):  
            yield tuple(x[i] for x in sample)

Leave a Reply

Your email address will not be published. Required fields are marked *