probly.datasets.torch

Collection of dataset classes for loading data from different datasets.

Classes

Benthic(root[, transform, first_order])

Implementation of the Benthic dataset.

CIFAR10H(root[, transform, download])

A Dataset class for the CIFAR10H dataset introduced in [PBGR19].

DCICDataset(root[, transform, first_order])

A Dataset base class for the DCICDatasets introduced in [SGZ+22].

ImageNetReaL(root[, transform])

A Dataset class for the ImageNet ReaL dataset introduced in [BHenaffK+20].

Plankton(root[, transform, first_order])

Implementation of the Plankton dataset.

QualityMRI(root[, transform, first_order])

Implementation of the QualityMRI dataset.

Treeversity1(root[, transform, first_order])

Implementation of the Treeversity#1 dataset.

Treeversity6(root[, transform, first_order])

Implementation of the Treeversity#6 dataset.

class probly.datasets.torch.Benthic(root, transform=None, *, first_order=True)[source]

Bases: DCICDataset

Implementation of the Benthic dataset.

The dataset can be found at https://zenodo.org/records/7180818.

Parameters:
  • root (Path | str)

  • transform (Callable[..., Any] | None)

  • first_order (bool)

class probly.datasets.torch.CIFAR10H(root, transform=None, *, download=False)[source]

Bases: CIFAR10

A Dataset class for the CIFAR10H dataset introduced in [PBGR19].

The dataset can be found at https://github.com/jcpeterson/cifar-10h.

Parameters:
  • root (str)

  • transform (Callable[..., Any] | None)

  • download (bool)

counts

Tensor containing counts.

Type:

torch.Tensor

targets

Tensor of size (n_instances, n_classes), first-order distribution.

Type:

torch.Tensor

download()
Return type:

None

extra_repr()
Return type:

str

base_folder = 'cifar-10-batches-py'
filename = 'cifar-10-python.tar.gz'
meta = {'filename': 'batches.meta', 'key': 'label_names', 'md5': '5ff9c542aee3614f3951f8cda6e48888'}
test_list = [['test_batch', '40351d587109b95175f43aff81a1287e']]
tgz_md5 = 'c58f30108f718f92721af3b95e74349a'
train_list = [['data_batch_1', 'c99cafc152244af753f735de768cd75f'], ['data_batch_2', 'd4bba439e000b95fd0a9bffe97cbabec'], ['data_batch_3', '54ebc095f3ab1f0389bbae665268c751'], ['data_batch_4', '634d18415352ddfa80567beed471001a'], ['data_batch_5', '482c414d41f54cd18b22e5b47cb7c3cb']]
url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
class probly.datasets.torch.DCICDataset(root, transform=None, *, first_order=True)[source]

Bases: Dataset

A Dataset base class for the DCICDatasets introduced in [SGZ+22].

These datasets can be found at https://zenodo.org/records/7180818.

Parameters:
  • root (Path | str)

  • transform (Callable[..., Any] | None)

  • first_order (bool)

root

Root directory of the dataset.

Type:

str

transform

Transform to apply to the data.

Type:

Callable

image_labels

Dictionary of image labels grouped by image.

Type:

dict

image_paths

List of image paths.

Type:

list

label_mappings

Mapping of labels to indices.

Type:

dict

num_classes

Number of classes.

Type:

int

data

List of images.

Type:

list

targets

List of labels.

Type:

list

class probly.datasets.torch.ImageNetReaL(root, transform=None)[source]

Bases: ImageNet

A Dataset class for the ImageNet ReaL dataset introduced in [BHenaffK+20].

This dataset is a re-labeled version of the ImageNet validation set, where each image can belong to multiple classes resulting in a distribution over classes. The ImageNet dataset needs to be downloaded from https://www.image-net.org and the first order labels can be downloaded from https://github.com/google-research/reassessed-imagenet.

Parameters:
  • root (str | Path)

  • transform (Callable[..., Any] | None)

dists

List of distributions over target classes.

Type:

list

static make_dataset(directory, class_to_idx, extensions=None, is_valid_file=None, allow_empty=False)

Generates a list of samples of a form (path_to_sample, class).

This can be overridden to e.g. read files from a compressed zip file instead of from the disk.

Parameters:
  • directory (str) – root dataset directory, corresponding to self.root.

  • class_to_idx (Dict[str, int]) – Dictionary mapping class name to class index.

  • extensions (optional) – A list of allowed extensions. Either extensions or is_valid_file should be passed. Defaults to None.

  • is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (used to check of corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.

  • allow_empty (bool, optional) – If True, empty folders are considered to be valid classes. An error is raised on empty folders if False (default).

Raises:
  • ValueError – In case class_to_idx is empty.

  • ValueError – In case extensions and is_valid_file are None or both are not None.

  • FileNotFoundError – In case no valid file was found for any class.

Returns:

samples of a form (path_to_sample, class)

Return type:

List[Tuple[str, int]]

extra_repr()
Return type:

str

find_classes(directory)

Find the class folders in a dataset structured as follows:

directory/
├── class_x
│   ├── xxx.ext
│   ├── xxy.ext
│   └── ...
│       └── xxz.ext
└── class_y
    ├── 123.ext
    ├── nsdf3.ext
    └── ...
    └── asd932_.ext

This method can be overridden to only consider a subset of classes, or to adapt to a different dataset directory structure.

Parameters:

directory (str) – Root directory path, corresponding to self.root

Raises:

FileNotFoundError – If dir has no class folders.

Returns:

List of all classes and dictionary mapping each class to an index.

Return type:

(Tuple[List[str], Dict[str, int]])

parse_archives()
Return type:

None

property split_folder: str
class probly.datasets.torch.Plankton(root, transform=None, *, first_order=True)[source]

Bases: DCICDataset

Implementation of the Plankton dataset.

The dataset can be found at https://zenodo.org/records/7180818.

Parameters:
  • root (Path | str)

  • transform (Callable[..., Any] | None)

  • first_order (bool)

class probly.datasets.torch.QualityMRI(root, transform=None, *, first_order=True)[source]

Bases: DCICDataset

Implementation of the QualityMRI dataset.

The dataset can be found at https://zenodo.org/records/7180818.

Parameters:
  • root (Path | str)

  • transform (Callable[..., Any] | None)

  • first_order (bool)

class probly.datasets.torch.Treeversity1(root, transform=None, *, first_order=True)[source]

Bases: DCICDataset

Implementation of the Treeversity#1 dataset.

The dataset can be found at https://zenodo.org/records/7180818.

Parameters:
  • root (Path | str)

  • transform (Callable[..., Any] | None)

  • first_order (bool)

class probly.datasets.torch.Treeversity6(root, transform=None, *, first_order=True)[source]

Bases: DCICDataset

Implementation of the Treeversity#6 dataset.

The dataset can be found at https://zenodo.org/records/7180818.

Parameters:
  • root (Path | str)

  • transform (Callable[..., Any] | None)

  • first_order (bool)