【PyTorch】ImageNet数据集的使用和miniImageNet的构建

阿里云国内75折回扣微信号：monov8

阿里云国际，腾讯云国际，低至75折。AWS 93折免费开户实名账号代冲值优惠多多微信号：monov8 飞机：@monov6

【PyTorch】ImageNet的使用和miniImageNet的构建

1. ImageNet下载和简介
2. miniImageNet
- 2.1 miniImageNet的划分
3. 使用ImageFolder构建数据集类

1. ImageNet下载和简介

ImageNet是由斯坦福大学等机构从2007年着手开始组件的大型计算机视觉数据集。自从2009年发布以来已经成为了计算机视觉领域广泛用于指标评价的数据集。直到目前该数据集有超过1400万张图像是深度学习领域中图像分类、检测、定位的最常用数据集之一。

“ImageNet大型视觉识别任务”即ImageNet Large-Scale Visual Recognition Challenge是基于ImageNet的一项比赛。使用的数据集是ImageNet的子集。官网提供下ILSVRC2011~ILSVRC2017的数据集。比较常用的是ILSVRC2012。

该数据集拥有1000个类每个分类约有1000张图片。其中约120万张作为训练集5万张作为验证集10万张作为测试集无标签。

1.1 下载地址

ImageNet的官方地址https://image-net.org/。目前ImageNet数据集已经不面向公众开放想要下载数据集必须使用.edu的教育邮箱注册认证。在本人下载数据集的过程中发现官网的下载速度只有300K/s左右实在是太慢了。可以使用迅雷等torrent下载器使用以下种子链接下载

验证集
http://academictorrents.com/download/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5.torrent
训练集
http://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent

这里也给大家安利一个网站https://academictorrents.com/。这个网站收录了大部分常用的数据集的下载链接。

1.2 初步处理

下载完成后得到两个文件ILSVRC2012_img_train.tar和ILSVRC2012_img_val.tar。在这两个文件的目录上进入git bash运行命令

md5sum ILSVRC2012_img_train.tar
md5sum ILSVRC2012_img_val.tar

得到两个文件的MD5校验码如果下载的数据集完整、正确得到的校验码应当和官网给出的一样

训练集1d675b47d978889d74fa0da5fadfb00e
验证集29b22e2961454d5413ddabcf34fc5622

接下来解压数据集其中phase参数代表训练集train还是验证集val。

import argparser
args = parser.parse_args()

parser.add_argument('--tar_dir', type=str)
def untarring(phase):
    if args.tar_dir is None:
        raise ValueError("tar_dir must be not None")
    print('Untarring ILSVRC2012 ' + phase + ' package')
    imagenet_dir = './ImageNet/' + phase
    if not os.path.exists(imagenet_dir):
        os.mkdir(imagenet_dir)
    os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
    return imagenet_dir

解压后得到的文件结构如下

训练集
|-ImagNet
	|-train
		|-class0.tar
		|-class1.tar
		|-...
	|-val
		|-img1.JPEG
		|-...

1.3 devkit介绍

除了数据集官网还给出了一个ILSVRC2012_devkit_t12。这里面包含了数据集的一些信息。

其中.\ILSVRC2012_devkit_t12\data\ILSVRC2012_validation_ground_truth.txt给出了验证集样本的对应标签。例如第一个验证样本的标签是490第二个是361…
在这里插入图片描述
除此之外该目录下还有一个meta文件。该文件记录了数据集的分类信息。
meta中的synset字段含有ILSVRC2012_IDWIND等信息。WIND即对应了每一类文件夹的名称是类的名字。ILSVRC2012_ID是该类在ILSVRC2012C数据集中的类别标识验证集的标签采用这种标识。其中ILSVRC2012_ID<=1000的为类别标识大于等于1000的则包含了超过一个类把相近的类放在了一起。这构成了一个树的结构。

2. miniImageNet

miniImageNet是ILSCVRC2012-img-train的子集。最早由Vinyals等人在 Matching Networks for One Shot Learning一文中提出用于少样本学习任务不过由于其在最一开始未公布划分方法Ravi等人在Optimization as a model for few-shot learning一文中使用了独自的划分方法这也是目前常用的miniImageNet的划分之一。

2.1 miniImageNet的划分

在Ravi的划分方法https://github.com/twitter-research/meta-learning-lstm/tree/master/data/miniImagenet中从ILSVRC2012C中随机选择了100个类。其中64个类作为trainset16个类作为valset20个类作为testset并将每个图像缩放为84×84的大小。本文也采用这种划分方法。

在Ravi的csv问价中表处理filename和对应的label名称。但注意这里给出的filename并不和原始的ILSVR2012C的数据集中的名称一一对应在label之后给出的序号是该样本在该类当中的排序位置从1开始。

下面给出构建miniImageNet的python实现源自https://github.com/yaoyao-liu/mini-imagenet-tools

import argparse
import os
import numpy as np
import pandas as pd
import glob
import cv2
from shutil import copyfile
from tqdm import tqdm

# argument parser
parser = argparse.ArgumentParser(description='')
parser.add_argument('--tar_dir', type=str)
parser.add_argument('--phase', type=str, choices=['train', 'val'])
parser.add_argument('--imagenet_dir', type=str)
parser.add_argument('--miniImageNet_dir', type=str)
parser.add_argument('--split_filepath', typr=str)
parser.add_argument('--image_resize', type=int, default=84)

args = parser.parse_args()


def untarring(phase):
    if args.tar_dir is None:
        raise ValueError("tar_dir must be not None")
    print('Untarring ILSVRC2012 ' + phase + ' package')
    imagenet_dir = './ImageNet/' + phase
    if not os.path.exists(imagenet_dir):
        os.mkdir(imagenet_dir)
    os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
    return imagenet_dir


class MiniImageNetGenerator(object):
    def __init__(self, input_args):
        self.processed_img_dir = './miniImageNet'
        self.mini_keys = None
        self.input_args = input_args
        self.imagenet_dir = input_args.imagenet_dir
        self.raw_mini_dir = './miniImageNet_raw'
        self.csv_paths = input_args.split_filepath
        if not os.path.exists(self.raw_mini_dir):
            os.mkdir(self.raw_mini_dir)
        self.image_resize = self.input_args.image_resize

    def untar_mini(self):
        self.mini_keys = ['n02110341', 'n01930112', 'n04509417', 'n04067472', 'n04515003', 'n02120079', 'n03924679',
                          'n02687172', 'n03075370', 'n07747607', 'n09246464', 'n02457408', 'n04418357', 'n03535780',
                          'n04435653', 'n03207743', 'n04251144', 'n03062245', 'n02174001', 'n07613480', 'n03998194',
                          'n02074367', 'n04146614', 'n04243546', 'n03854065', 'n03838899', 'n02871525', 'n03544143',
                          'n02108089', 'n13133613', 'n03676483', 'n03337140', 'n03272010', 'n01770081', 'n09256479',
                          'n02091244', 'n02116738', 'n04275548', 'n03773504', 'n02606052', 'n03146219', 'n04149813',
                          'n07697537', 'n02823428', 'n02089867', 'n03017168', 'n01704323', 'n01532829', 'n03047690',
                          'n03775546', 'n01843383', 'n02971356', 'n13054560', 'n02108551', 'n02101006', 'n03417042',
                          'n04612504', 'n01558993', 'n04522168', 'n02795169', 'n06794110', 'n01855672', 'n04258138',
                          'n02110063', 'n07584110', 'n02091831', 'n03584254', 'n03888605', 'n02113712', 'n03980874',
                          'n02219486', 'n02138441', 'n02165456', 'n02108915', 'n03770439', 'n01981276', 'n03220513',
                          'n02099601', 'n02747177', 'n01749939', 'n03476684', 'n02105505', 'n02950826', 'n04389033',
                          'n03347037', 'n02966193', 'n03127925', 'n03400231', 'n04296562', 'n03527444', 'n04443257',
                          'n02443484', 'n02114548', 'n04604644', 'n01910747', 'n04596742', 'n02111277', 'n03908618',
                          'n02129165', 'n02981792']

        for idx, keys in enumerate(self.mini_keys):
            print('Untarring ' + keys)
            os.system('tar xvf ' + self.imagenet_dir + '/' + keys + '.tar -C ' + self.raw_mini_dir)
        print('All the tar files are untarred')

    def process_original_files(self):
        split_lists = ['train', 'val', 'test']

        if not os.path.exists(self.processed_img_dir):
            os.makedirs(self.processed_img_dir)

        for this_split in split_lists:
            filename = os.path.join(self.csv_paths, this_split + '.csv')
            this_split_dir = self.processed_img_dir + '/' + this_split
            if not os.path.exists(this_split_dir):
                os.makedirs(this_split_dir)
            with open(filename) as csvfile:
                csv = pd.read_csv(csvfile, delimiter=',')
                images = {}
                print('Reading IDs....')

                for row in csv.values:
                    if row[1] in images.keys():
                        images[row[1]].append(row[0])
                    else:
                        images[row[1]] = [row[0]]

                print('Writing photos....')
                for cls in tqdm(images.keys()):
                    this_cls_dir = this_split_dir + '/' + cls
                    if not os.path.exists(this_cls_dir):
                        os.makedirs(this_cls_dir)
                    # find files which name matches '.../...cls...'
                    lst_files = glob.glob(self.raw_mini_dir + "/*" + cls + "*")
                    # sort file names, get index
                    lst_index = [int(i[i.rfind('_') + 1:i.rfind('.')]) for i in lst_files]
                    index_sorted = np.argsort(np.array(lst_index))
                    # get file names in miniImageNet, the name in csv indicates the file index in miniImageNet class
                    index_selected = [int(i[i.index('.') - 4:i.index('.')]) for i in images[cls]]
                    # note that names in csv begin from 1 not 0, get selected images indexes
                    selected_images = index_sorted[np.array(index_selected) - 1]
                    for i in np.arange(len(selected_images)):
                        if self.image_resize == 0:
                            copyfile(lst_files[selected_images[i]], os.path.join(this_cls_dir, images[cls][i]))
                        else:
                            im = cv2.imread(lst_files[selected_images[i]])
                            im_resized = cv2.resize(im, (self.image_resize, self.image_resize),
                                                    interpolation=cv2.INTER_AREA)
                            cv2.imwrite(os.path.join(this_cls_dir, images[cls][i]), im_resized)


if __name__ == "__main__":
    dataset_generator = MiniImageNetGenerator(args)
    dataset_generator.untar_mini()
    dataset_generator.process_original_files()

在执行后miniImageNet_raw文件夹下存储着未处理、未分类的样本。miniImageNet文件夹的结构如下

|-miniImageNet
	|-train
		|-class1
			|-img1.jpg
			|-...
		|-...
	|-val
	|-test

其中val、test的文件夹结构和train一样。

3. 使用ImageFolder构建数据集类

PyTorch提供非常方便的类ImageFolder用于图像数据集的构建。

dataset = ImageFolder(root='./miniImageNet/train')

但是这样构造的数据集样本的标签是根据文件夹的顺序而来的。例如在第一个文件夹的样本都会被标志为0类。如果我们想要让样本标签和数据集的标签对应那么需要重写一些方法。

3.1 重写DataFolder中的方法

ImageFolder是DataFolder的子类在DataFolder中提供了两个可供重写的方法
find_classes和make_dataset。其中find_classes需要返回类别名称和类别名称list与标签之间的映射dict。make_dataset则根据find_classes返回的参数构建数据集。

重写如下

meta_dir = os.path.join(os.getcwd(), 'meta_info')
if not os.path.exists(meta_dir):
    os.makedirs(meta_dir)

meta_info_path = os.path.join(meta_dir, "meta_info.npy")
if not os.path.exists(meta_info_path):
    meta = loadmat('./ILSVRC2012_devkit_t12/data/meta.mat')
    meta = meta.get('synsets')
    meta = meta.reshape(1860)
    meta_id = [[i[0].item(), i[1].item()] for i in meta]
    meta_info = np.array(meta_id[:1000])
    np.save(meta_info_path, meta_info, allow_pickle=True)
else:
    meta_info = np.load(meta_info_path, allow_pickle=True)


class MiniImageNetFolder(ImageFolder):
    """
    Generator miniImageNet Dataset. This is a subclass of ImageFolder <- DataFolder
    Overwrite method find_class() and make_dataset() to let the label match ILSVRC2012-ID

    -----------------------------------------------------------------------------------------
    Parameters:
      root: root directory of image dataset, should have structure of
        root/dog/xxx.png
        root/dog/xxy.png
        root/dog/[...]/xxz.png

        root/cat/123.png
        root/cat/nsdf3.png
        root/cat/[...]/asd932_.pn

      phase: the validation phase of model, takes value from {"train", "validation", "test"}
    """

    def __init__(self, root, phase="train", transformer=None):

        self.meta_info = meta_info
        self.phase = phase
        super(MiniImageNetFolder, self).__init__(root=root, transform=transformer)

    def find_classes(self, directory):
        dic = {}
        names = np.unique(np.array(pd.read_csv('./split_csv/miniImageNet/' + self.phase + '.csv')['label']))
        for i in self.meta_info:
            if i[1] in names:
                dic[i[1]] = int(i[0])
        return list(names), dic

其中meta_info存储了ILSVRC2012中所有类和label的对应关系。

3.2 BatchSampler实现episode采样

在少样本学习中经常采用的一种方法是Episode Training。该方法将每次训练划分为K个不同的任务每个任务代表了对一个类别的少样本学习包含数个support样本和query样本。support样本用于模型的训练query样本用于评估模型的性能。通常使用K-way N-shot来表示episode训练的模式N就是support样本的数量。如果我们要向实现一个5-way 5-shot的episode训练假设query样本数量为1那么每次需要从训练集中采样5×(5+1)=30个样本。

采样策略可以通过BatchSampler来实现

class PrototypicalBatchSampler(object):
    """
    Adopted from
    https://github.com/orobix/Prototypical-Networks-for-Few-shot-Learning-PyTorch/blob/master/src/prototypical_batch_sampler.py

    Yield a batch of indexes at each iteration.
    Indexes are calculated by keeping in account 'classes_per_it' and 'num_samples',
    Each iteration the batch indexes will refer to  'num_support' + 'num_query' samples
    for 'classes_per_it' random classes.

    __len__ returns the number of episodes per epoch (same as 'self.iterations').

    ----------------------------------------------------------------------------------------------
    Parameters:
        labels: ndarray, all labels for current dataset
        classes_per_episode: int, number of classes in one episode
        sample_per_class: int, numer of sample in one class
        iterations: int, number of episodes in one epoch
    """

    def __init__(self, labels, classes_per_episode, sample_per_class, iterations, dataset_name="miniImageNet_train"):
        """
        Initialize the PrototypicalBatchSampler object
        Args:
        - labels: an iterable containing all the labels for the current dataset
        samples indexes will be inferred from this iterable.
        - classes_per_it: number of random classes for each iteration
        - num_samples: number of samples for each iteration for each class (support + query)
        - iterations: number of iterations (episodes) per epoch
        """
        super(PrototypicalBatchSampler, self).__init__()
        self.labels = labels
        self.classes_per_it = classes_per_episode
        self.sample_per_class = sample_per_class
        self.iterations = iterations
        self.dataset_name = dataset_name

        # 该函数是去除数组中的重复数字并进行排序之后输出
        self.classes, self.counts = np.unique(self.labels, return_counts=True)
        self.classes = torch.LongTensor(self.classes)

        # create a matrix, indexes, of dim: classes X max(elements per class)
        # fill it with nans
        # for every class c, fill the relative row with the indices samples belonging to c
        # in numel_per_class we store the number of samples for each class/row
        indexes_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_indexes.npy')
        numel_per_class_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_numel_per_class.npy')
        if not os.path.exists(indexes_path) and not os.path.exists(numel_per_class_path):
            print("Creat dataset indexes")
            self.idxs = range(len(self.labels))
            self.indexes = np.empty((len(self.classes), max(self.counts)), dtype=int) * np.nan
            self.indexes = torch.Tensor(self.indexes)
            self.numel_per_class = torch.zeros_like(self.classes)
            for idx, label in enumerate(self.labels):
                label_idx = np.argwhere(self.classes == label).item()
                # 即np.where(condition),只有条件 (condition)没有x和y则输出满足条件 (即非0) 元素的坐标 (等价于numpy.nonzero)。
                # 这里的坐标以tuple的形式给出通常原数组有多少维输出的tuple中就包含几个数组分别对应符合条件元素的各维坐标。
                # 这里会返回第label_idx行中是nan的坐标, 格式((),)所以取[0][0] 获得第一个为nan的坐标
                # 给indexes对应class对应sample赋予idx (在label数组中的idx)
                self.indexes[label_idx, np.where(np.isnan(self.indexes[label_idx]))[0][0]] = idx
                self.numel_per_class[label_idx] += 1
            save_path = os.path.join(os.getcwd(), 'episode_idx')
            if not os.path.exists(save_path):
                os.makedirs(save_path)
            np.save(os.path.join(save_path, self.dataset_name) + "_indexes.npy", self.indexes)
            np.save(os.path.join(save_path, self.dataset_name) + "_numel_per_class.npy", self.numel_per_class)
        else:
            print("Read Dataset indexes.")
            self.indexes = torch.tensor(np.load(indexes_path))
            self.numel_per_class = torch.tensor(np.load(numel_per_class_path))

    def __iter__(self):
        """
        yield a batch (episode) of indexes
        """
        spc = self.sample_per_class
        cpi = self.classes_per_it

        for it in range(self.iterations):
            batch_size = spc * cpi
            batch = torch.LongTensor(batch_size)
            # 随机选取c个类
            c_idxs = torch.randperm(len(self.classes))[:cpi]
            for i, c in enumerate(self.classes[c_idxs]):
                # 从第i个class到第i+1个class在batch中的slice
                s = slice(i * spc, (i + 1) * spc)
                # FIXME when torch.argwhere will exists
                # 找到第i个类的label_idx
                label_idx = torch.arange(len(self.classes)).long()[self.classes == c].item()
                # 在第label_idx类中随机选择spc个样本
                sample_idxs = torch.randperm(self.numel_per_class[label_idx])[:spc]
                # 这些样本的索引写如batch
                batch[s] = self.indexes[label_idx][sample_idxs]
            # 随机打乱batch
            batch = batch[torch.randperm(len(batch))]
            yield batch

    def __len__(self):
        """
        returns the number of iterations (episodes, batches) per epoch
        """
        return self.iterations

这样在每个训练的epoch中我们采样iterations次每次采样的batch包含了K-way N-shot的样本。则模型总共进行episode训练的次数为epochs×iterations。

3.3 batch可视化

trans = Compose([ToTensor()])
dataset = MiniImageNetFolder(root='F:/processed_images/train/', phase="train", transformer=trans)
dataloader = DataLoader(dataset=dataset, batch_sampler=PrototypicalBatchSampler(dataset.targets, 5, 5, 10))


def visual_batch(dataloader, dataset_name):
    """
    Visualize images.
    :param x: Tensor, with shape of [batch_size, 3, h, w]
    :param y: Tensor, with shape of [batch_size, 1]
    :return:
    """
    x, y = next(iter(dataloader))
    plt.figure(figsize=(12, 12))
    for i in range(x.shape[0]):
        plt.subplot(5, 5, i + 1)
        idx = y[i].item() - 1
        plt.title(meta_info[idx, 1])
        plt.imshow(x[i].permute(1, 2, 0))
        plt.axis('off')

    if not os.path.exists(os.path.join(os.getcwd(), 'imgs')):
        os.makedirs(os.path.join(os.getcwd(), 'imgs'))
    plt.savefig('./imgs/visual_batch_' + dataset_name + '.png')

visual_batch(dataloader, "miniImageNet_train")