Dataset¶
Introduction¶
To adapt to its internal dataloader API, Intel® Neural Compressor implements some built-in datasets.
A dataset is a container which holds all data that can be used by the dataloader, and have the ability to be fetched by index or created as an iterator. One can implement a specific dataset by inheriting from the Dataset class by implementing __iter__
method or __getitem__
method, while implementing __getitem__
method, __len__
method is recommended.
Users can use Neural Compressor built-in dataset objects as well as register their own datasets.
Supported Framework Dataset Matrix¶
TensorFlow¶
Dataset | Parameters | Comments | Usage |
---|---|---|---|
MNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. | In yaml file: dataset: MNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
FashionMNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. | In yaml file: dataset: FashionMNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR10(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR10: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR100(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR100: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True) |
ImageRecord(root, transform, filter) | root (str): Root directory of dataset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: root/validation-000-of-100 root/validation-001-of-100 ... root/validation-099-of-100 The file name needs to follow this pattern: ' - * -of- ' |
In yaml file: dataset: ImageRecord: root: /path/to/root In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImageRecord'] (root=root, transform=transform, filter=None) |
ImageFolder(root, transform, filter) | root (str): Root directory of dataset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: root/class_1/xxx.png root/class_1/xxy.png root/class_1/xxz.png ... root/class_n/123.png root/class_n/nsdf3.png root/class_n/asd932_.png Please put images of different categories into different folders. |
In yaml file: dataset: ImageFolder: root: /path/to/root In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None) |
ImagenetRaw(data_path, image_list, transform, filter) | data_path (str): Root directory of dataset image_list (str): data file, record image_names and their labels transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: data_path/img1.jpg data_path/img2.jpg ... data_path/imgx.jpg dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically. |
In yaml file: dataset: ImagenetRaw: data_path: /path/to/image image_list: /path/to/label In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None) |
COCORecord(root, num_cores, transform, filter) | root (str): Root directory of dataset num_cores (int, default=28):The number of input Datasets to interleave from in parallel transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Root is a full path to tfrecord file, which contains the file name. Please use Resize transform when batch_size > 1 |
In yaml file: dataset: COCORecord: root: /path/to/tfrecord num_cores: 28 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['COCORecord'] (root, num_cores=28, transform=transform, filter=None) |
COCORaw(root, img_dir, anno_dir, transform, filter) | root (str): Root directory of dataset img_dir (str, default='val2017'): image file directory anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: /root/img_dir/1.jpg /root/img_dir/2.jpg ... /root/img_dir/n.jpg /root/anno_dir Please use Resize transform when batch_size > 1 |
In yaml file: dataset: COCORaw: root: /path/to/root img_dir: /path/to/image anno_dir: /path/to/annotation In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None) If anno_dir is not set, the dataset will use default label map |
COCONpy(root, npy_dir, anno_dir) | root (str): Root directory of dataset npy_dir (str, default='val2017'): npy file directory anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory |
Please arrange data in this way: /root/npy_dir/1.jpg.npy /root/npy_dir/2.jpg.npy ... /root/npy_dir/n.jpg.npy /root/anno_dir Please use Resize transform when batch_size > 1 |
In yaml file: dataset: COCORaw: root: /path/to/root npy_dir: /path/to/npy anno_dir: /path/to/annotation In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['COCONpy'] (root, npy_dir, anno_dir) If anno_dir is not set, the dataset will use default label map |
dummy(shape, low, high, dtype, label, transform, filter) | shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' label (bool, default=True):whether to return 0 as label transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy: shape: [3, 224, 224, 3] low: 0.0 high: 127.0 dtype: float32 label: True In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None) |
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) | input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy_v2: input_shape: [224, 224, 3] label_shape: [1] low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None) |
style_transfer(content_folder, style_folder, crop_ratio, resize_shape, image_format, transform, filter) | content_folder (str):Root directory of content images style_folder (str):Root directory of style images crop_ratio (float, default=0.1):cropped ratio to each side resize_shape (tuple, default=(256, 256)):target size of image image_format (str, default='jpg'): target image format transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Dataset used for style transfer task. This Dataset is to construct a dataset from two specific image holders representing content image folder and style image folder. | In yaml file: dataset: style_transfer: content_folder: /path/to/content_folder style_folder: /path/to/style_folder crop_ratio: 0.1 resize_shape: [256, 256] image_format: 'jpg' In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['style_transfer'] (content_folder, style_folder, crop_ratio, resize_shape, image_format, transform=transform, filter=None) |
TFRecordDataset(root, transform, filter) | root (str): filename of dataset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Root is a full path to tfrecord file, which contains the file name. | In yaml file: dataset: TFRecordDataset: root: /path/to/tfrecord In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['TFRecordDataset'] (root, transform=transform) |
bert(root, label_file, task, transform, filter) | root (str): path of dataset label_file (str): path of label file task (str, default='squad'): task type of model model_type (str, default='bert'): model type, support 'bert'. transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset supports tfrecord data, please refer to Guide to create tfrecord file first. | In yaml file: dataset: bert: root: /path/to/root label_file: /path/to/label_file task: squad model_type: bert In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['bert'] (root, label_file, transform=transform) |
sparse_dummy_v2(dense_shape, label_shape, sparse_ratio, low, high, dtype, transform, filter) | dense_shape (list or tuple):create single or multi sparse tensors, tuple represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. sparse_ratio (float, default=0.5): the ratio of sparsity, support [0, 1]. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: sparse_dummy_v2: dense_shape: [224, 224, 3] label_shape: [1] sparse_ratio: 0.5 low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['sparse_dummy_v2'] (dense_shape, label_shape, sparse_ratio, low, high, dtype, transform=None, filter=None) |
PyTorch¶
Dataset | Parameters | Comments | Usage |
---|---|---|---|
MNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. | In yaml file: dataset: MNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
FashionMNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. | In yaml file: dataset: FashionMNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR10(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR10: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR100(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR100: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True) |
ImageFolder(root, transform, filter) | root (str): Root directory of dataset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: root/class_1/xxx.png root/class_1/xxy.png root/class_1/xxz.png ... root/class_n/123.png root/class_n/nsdf3.png root/class_n/asd932_.png Please put images of different categories into different folders. |
In yaml file: dataset: ImageFolder: root: /path/to/root In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None) |
ImagenetRaw(data_path, image_list, transform, filter) | data_path (str): Root directory of dataset image_list (str): data file, record image_names and their labels transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: data_path/img1.jpg data_path/img2.jpg ... data_path/imgx.jpg dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically. |
In yaml file: dataset: ImagenetRaw: data_path: /path/to/image image_list: /path/to/label In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None) |
COCORaw(root, img_dir, anno_dir, transform, filter) | root (str): Root directory of dataset img_dir (str, default='val2017'): image file directory anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: /root/img_dir/1.jpg /root/img_dir/2.jpg ... /root/img_dir/n.jpg /root/anno_dir Please use Resize transform when batch_size>1 |
In yaml file: dataset: COCORaw: root: /path/to/root img_dir: /path/to/image anno_dir: /path/to/annotation In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None) If anno_dir is not set, the dataset will use default label map |
dummy(shape, low, high, dtype, label, transform, filter) | shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' label (bool, default=True):whether to return 0 as label transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy: shape: [3, 224, 224, 3] low: 0.0 high: 127.0 dtype: float32 label: True In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None) |
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) | input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy_v2: input_shape: [224, 224, 3] label_shape: [1] low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None) |
bert(dataset, task, model_type, transform, filter) | dataset (list): list of data task (str): the task of the model, support "classifier", "squad" model_type (str, default='bert'): model type, support 'distilbert', 'bert', 'xlnet', 'xlm' transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
This Dataset is to construct from the Bert TensorDataset and not a full implementation from yaml config. The original repo link is: https://github.com/huggingface/transformers. When you want use this Dataset, you should add it before you initialize your DataLoader. | In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['bert'] (dataset, task, model_type, transform=transform, filter=None) Now not support yaml implementation |
sparse_dummy_v2(dense_shape, label_shape, sparse_ratio, low, high, dtype, transform, filter) | dense_shape (list or tuple):create single or multi sparse tensors, tuple represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. sparse_ratio (float, default=0.5): the ratio of sparsity, support [0, 1]. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: sparse_dummy_v2: dense_shape: [224, 224, 3] label_shape: [1] sparse_ratio: 0.5 low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['sparse_dummy_v2'] (dense_shape, label_shape, sparse_ratio, low, high, dtype, transform=None, filter=None) |
MXNet¶
Dataset | Parameters | Comments | Usage |
---|---|---|---|
MNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. | In yaml file: dataset: MNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
FashionMNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. | In yaml file: dataset: FashionMNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR10(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR10: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR100(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR100: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True) |
ImageFolder(root, transform, filter) | root (str): Root directory of dataset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: root/class_1/xxx.png root/class_1/xxy.png root/class_1/xxz.png ... root/class_n/123.png root/class_n/nsdf3.png root/class_n/asd932_.png Please put images of different categories into different folders. |
In yaml file: dataset: ImageFolder: root: /path/to/root In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None) |
ImagenetRaw(data_path, image_list, transform, filter) | data_path (str): Root directory of dataset image_list (str): data file, record image_names and their labels transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: data_path/img1.jpg data_path/img2.jpg ... data_path/imgx.jpg dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically. |
In yaml file: dataset: ImagenetRaw: data_path: /path/to/image image_list: /path/to/label In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None) |
COCORaw(root, img_dir, anno_dir, transform, filter) | root (str): Root directory of dataset img_dir (str, default='val2017'): image file directory anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: /root/img_dir/1.jpg /root/img_dir/2.jpg ... /root/img_dir/n.jpg /root/anno_dir Please use Resize transform when batch_size > 1 |
In yaml file: dataset: COCORaw: root: /path/to/root img_dir: /path/to/image anno_dir: /path/to/annotation In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None) If anno_dir is not set, the dataset will use default label map |
dummy(shape, low, high, dtype, label, transform, filter) | shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' label (bool, default=True):whether to return 0 as label transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy: shape: [3, 224, 224, 3] low: 0.0 high: 127.0 dtype: float32 label: True In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None) |
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) | input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy_v2: input_shape: [224, 224, 3] label_shape: [1] low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None) |
sparse_dummy_v2(dense_shape, label_shape, sparse_ratio, low, high, dtype, transform, filter) | dense_shape (list or tuple):create single or multi sparse tensors, tuple represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. sparse_ratio (float, default=0.5): the ratio of sparsity, support [0, 1]. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: sparse_dummy_v2: dense_shape: [224, 224, 3] label_shape: [1] sparse_ratio: 0.5 low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['sparse_dummy_v2'] (dense_shape, label_shape, sparse_ratio, low, high, dtype, transform=None, filter=None) |
ONNXRT¶
Dataset | Parameters | Comments | Usage |
---|---|---|---|
MNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/MNIST/, otherwise user should put mnist.npz under root/MNIST/ manually. | In yaml file: dataset: MNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['MNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
FashionMNIST(root, train, transform, filter, download) | root (str): Root directory of dataset train(bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/FashionMNIST/, otherwise user should put train-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and t10k-images-idx3-ubyte.gz under root/FashionMNIST/ manually. | In yaml file: dataset: FashionMNIST: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['FashionMNIST'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR10(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR10: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR10'] (root=root, train=False, transform=transform, filter=None, download=True) |
CIFAR100(root, train, transform, filter, download) | root (str): Root directory of dataset train (bool, default=False): If True, creates dataset from train subset, otherwise from validation subset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions download (bool, default=True): If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. |
If download is True, it will download dataset to root/ and extract it automatically, otherwise user can download file from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz manually to root/ and extract it. | In yaml file: dataset: CIFAR100: root: /path/to/root train: False download: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['CIFAR100'] (root=root, train=False, transform=transform, filter=None, download=True) |
ImageFolder(root, transform, filter) | root (str): Root directory of dataset transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: root/class_1/xxx.png root/class_1/xxy.png root/class_1/xxz.png ... root/class_n/123.png root/class_n/nsdf3.png root/class_n/asd932_.png Please put images of different categories into different folders. |
In yaml file: dataset: ImageFolder: root: /path/to/root In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImageFolder'] (root=root,transform=transform, filter=None) |
ImagenetRaw(data_path, image_list, transform, filter) | data_path (str): Root directory of dataset image_list (str): data file, record image_names and their labels transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: data_path/img1.jpg data_path/img2.jpg ... data_path/imgx.jpg dataset will read name and label of each image from image_list file, if user set image_list to None, it will read from data_path/val_map.txt automatically. |
In yaml file: dataset: ImagenetRaw: data_path: /path/to/image image_list: /path/to/label In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['ImagenetRaw'] (data_path, image_list, transform=transform, filter=None) |
COCORaw(root, img_dir, anno_dir, transform, filter) | root (str): Root directory of dataset img_dir (str, default='val2017'): image file directory anno_dir (str, default='annotations/instances_val2017.json'): annotation file directory transform (transform object, default=None): transform to process input data filter (Filter objects, default=None): filter out examples according to specific conditions |
Please arrange data in this way: /root/img_dir/1.jpg /root/img_dir/2.jpg ... /root/img_dir/n.jpg /root/anno_dir *Please use Resize transform when batch_size > 1 |
In yaml file: dataset: COCORaw: root: /path/to/root img_dir: /path/to/image anno_dir: /path/to/annotation In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['COCORaw'] (root, img_dir, anno_dir, transform=transform, filter=None) If anno_dir is not set, the dataset will use default label map |
dummy(shape, low, high, dtype, label, transform, filter) | shape (list or tuple):shape of total samples, the first dimension should be the sample count of the dataset. support create multi shape tensors, use list of tuples for each tuple in the list, will create a such size tensor. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' label (bool, default=True):whether to return 0 as label transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy: shape: [3, 224, 224, 3] low: 0.0 high: 127.0 dtype: float32 label: True In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy'] (shape, low, high, dtype, label, transform=None, filter=None) |
dummy_v2(input_shape, label_shape, low, high, dtype, transform, filter) | input_shape (list or tuple):create single or multi input tensors list represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: dummy_v2: input_shape: [224, 224, 3] label_shape: [1] low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['dummy_v2'] (input_shape, low, high, dtype, transform=None, filter=None) |
GLUE(data_dir, model_name_or_path, max_seq_length, do_lower_case, task, model_type, dynamic_length, evaluate, transform, filter) | data_dir (str): The input data dir model_name_or_path (str): Path to pre-trained student model or shortcut name, max_seq_length (int, default=128): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. do_lower_case (bool, default=True): Whether or not to lowercase the input. task (bool, default=True): The name of the task to fine-tune. Choices include mrpc, qqp, qnli, rte, sts-b, cola, mnli, wnli. model_type (str, default='bert'): model type, support 'distilbert', 'bert', 'mobilebert', 'roberta'. dynamic_length (bool, default=False): Whether to use fixed sequence length. evaluate (bool, default=True): Whether do evaluation or training. transform (bool, default=True): If true, filter (bool, default=True): If true, |
Refer to this example on how to prepare dataset | In yaml file: dataset: bert: data_dir: False model_name_or_path: True (transform and filter are not set in the range of dataset) In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['bert'] (data_dir='/path/to/data/', model_name_or_path='bert-base-uncased', max_seq_length=128, task='mrpc', model_type='bert', dynamic_length=True, transform=None, filter=None) |
sparse_dummy_v2(dense_shape, label_shape, sparse_ratio, low, high, dtype, transform, filter) | dense_shape (list or tuple):create single or multi sparse tensors, tuple represent the sample shape of the dataset, eg and image size should be represented as (224, 224, 3), tuple contains multiple list and represent multi input tensors. label_shape (list or tuple):create single or multi label tensors list represent the sample shape of the label, eg and label size should be represented as (1,), tuple contains multiple list and represent multi label tensors. In yaml usage, it offers (1,) as the default value. sparse_ratio (float, default=0.5): the ratio of sparsity, support [0, 1]. low (list or float, default=-128.):low out the tensor value range from[0, 1] to [0, low] or [low, 0] if low < 0, if float, will implement all tensors with same low value. high (list or float, default=127.):high the tensor value by add all tensor element value high. If list, length of list should be same with shape list dtype (list or str, default='float32'):support multi tensor dtype setting. If list, length of list should be same with shape list, if str, all tensors will use same dtype. dtype support 'float32', 'float16', 'uint8', 'int8', 'int32', 'int64', 'bool' transform (transform object, default=None): dummy dataset does not need transform. If transform is not None, it will ignore it. filter (Filter objects, default=None): filter out examples according to specific conditions |
This dataset is to construct a dataset from a specific shape, the value range is calculated from: low * stand_normal(0, 1) + high. | In yaml file: dataset: sparse_dummy_v2: dense_shape: [224, 224, 3] label_shape: [1] sparse_ratio: 0.5 low: 0.0 high: 127.0 dtype: float32 In user code: from neural_compressor.data import Datasets datasets = Datasets(framework) dataset = datasets['sparse_dummy_v2'] (dense_shape, label_shape, sparse_ratio, low, high, dtype, transform=None, filter=None) |
Get start with Dataset API¶
Config dataloader in a yaml file¶
quantization:
approach: post_training_static_quant
calibration:
dataloader:
dataset:
COCORaw:
root: /path/to/calibration/dataset
filter:
LabelBalance:
size: 1
transform:
Resize:
size: 300
evaluation:
accuracy:
metric:
...
dataloader:
batch_size: 16
dataset:
COCORaw:
root: /path/to/evaluation/dataset
transform:
Resize:
size: 300
performance:
dataloader:
batch_size: 16
dataset:
dummy_v2:
input_shape: [224, 224, 3]
User-specific dataset¶
Users can register their own datasets as follows:
class Dataset(object):
def __init__(self, args):
# init code here
def __getitem__(self, idx):
# use idx to get data and label
return data, label
def __len__(self):
return len
After defining the dataset class, pass it to the quantizer:
from neural_compressor.experimental import Quantization, common
quantizer = Quantization(yaml_file)
quantizer.calib_dataloader = common.DataLoader(dataset) # user can pass more optional args to dataloader such as batch_size and collate_fn
quantizer.model = graph
quantizer.eval_func = eval_func
q_model = quantizer.fit()
Examples¶
Refer to this example to learn how to define a customised dataset.
Refer to this HelloWorld example to learn how to configure a built-in dataset.