PyTorch数据读入是通过Dataset+DataLoader的方式完成的,Dataset定义好数据的格式和数据变换形式,DataLoader用iterative的方式不断读入批次数据, 本文介绍 Pytorch 数据读入的流程 。
参考 深入浅出PyTorch ,系统补齐基础知识。
我们可以定义自己的Dataset类来实现灵活的数据读取,定义的类需要继承PyTorch自身的Dataset类。主要包含三个函数:
__init__
: 用于向类中传入外部参数,同时定义样本集__getitem__
: 用于逐个读取样本集合中的元素,可以进行一定的变换,并将返回训练/验证所需的数据__len__
: 用于返回数据集的样本数1234 | import torchfrom torchvision import datasetstrain_data = datasets.ImageFolder(train_path, transform=data_transform)val_data = datasets.ImageFolder(val_path, transform=data_transform) |
---|
这里使用了PyTorch自带的ImageFolder类的用于读取按一定结构存储的图片数据(path对应图片存放的目录,目录下包含若干子目录,每个子目录对应属于同一个类的图片)。
其中“data_transform”可以对图像进行一定的变换,如翻转、裁剪等操作,可自己定义。
这里另外给出一个例子,其中图片存放在一个文件夹,另外有一个csv文件给出了图片名称对应的标签。这种情况下需要自己来定义Dataset类:
1234567891011121314151617181920212223242526272829303132333435 | class MyDataset(Dataset): def __init__(self, data_dir, info_csv, image_list, transform=None): """ Args: data_dir: path to image directory. info_csv: path to the csv file containing image indexes with corresponding labels. image_list: path to the txt file contains image names to training/validation set transform: optional transform to be applied on a sample. """ label_info = pd.read_csv(info_csv) image_file = open(image_list).readlines() self.data_dir = data_dir self.image_file = image_file self.label_info = label_info self.transform = transform def __getitem__(self, index): """ Args: index: the index of item Returns: image and its labels """ image_name = self.image_fileindex.strip('\n') raw_label = self.label_info.loc[self.label_info'Image_index' == image_name] label = raw_label.iloc:,0 image_name = os.path.join(self.data_dir, image_name) image = Image.open(image_name).convert('RGB') if self.transform is not None: image = self.transform(image) return image, label def __len__(self): return len(self.image_file) |
---|
构建好Dataset后,就可以使用DataLoader来按批次读入数据了,实现代码如下:
1234 | from torch.utils.data import DataLoadertrain_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, num_workers=4, shuffle=True, drop_last=True)val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size, num_workers=4, shuffle=False) |
---|
其中:
DataLoader 参数很多,支持很强大的数据生成器,pytorch2 的文档如下:
1 | torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='') |
---|
Parameters:
1
).True
to have the data reshuffled at every epoch (default: False
).Iterable
with __len__
implemented. If specified, shuffle
must not be specified.sampler
, but returns a batch of indices at a time. Mutually exclusive with batch_size
, shuffle
, sampler
, and drop_last
.0
means that the data will be loaded in the main process. (default: 0
)True
, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn
returns a batch that is a custom type, see the example below.True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False
)0
)None
, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]
) as input, after seeding and before data loading. (default: None
)None
, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None
)2
means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None
. Otherwise if value of num_workers>0 default is 2
).True
, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False
)