Datasets

The Dataset the most basic class and implements the loading of your dataset elements. You can either load your data in a lazy way e.g. loading them just at the moment they are needed or you could preload them and cache them.

Datasets can be indexed by integers and return single samples.

To implement custom datasets you should derive the AbstractDataset

AbstractDataset

class AbstractDataset(data_path: str, load_fn: Callable)[source]

Bases: object

Base Class for Dataset

abstract _make_dataset(path: str)[source]

Create dataset

Parameters

path (str) – path to data samples

Returns

data: List of sample paths if lazy; List of samples if not

Return type

list

get_sample_from_index(index)[source]

Returns the data sample for a given index (without any loading if it would be necessary) This implements the base case and can be subclassed for index mappings. The actual loading behaviour (lazy or cached) should be implemented in __getitem__

See also

:method:ConcatDataset.get_sample_from_index :method:BaseLazyDataset.__getitem__ :method:BaseCacheDataset.__getitem__

Parameters

index (int) – index corresponding to targeted sample

Returns

sample corresponding to given index

Return type

Any

get_subset(indices)[source]

Returns a Subset of the current dataset based on given indices

Parameters

indices (iterable) – valid indices to extract subset from current dataset

Returns

the subset

Return type

BlankDataset

train_test_split(*args, **kwargs)[source]

split dataset into train and test data

Parameters
  • *args – positional arguments of train_test_split

  • **kwargs – keyword arguments of train_test_split

Returns

  • BlankDataset – train dataset

  • BlankDataset – test dataset

See also

sklearn.model_selection.train_test_split

BaseLazyDataset

class BaseLazyDataset(data_path: Union[str, list], load_fn: Callable, **load_kwargs)[source]

Bases: delira.data_loading.dataset.AbstractDataset

Dataset to load data in a lazy way

_make_dataset(path: Union[str, list])[source]

Helper Function to make a dataset containing paths to all images in a certain directory

Parameters

path (str or list) – path to data samples

Returns

list of sample paths

Return type

list

Raises

AssertionError – if path is not a valid directory

get_sample_from_index(index)

Returns the data sample for a given index (without any loading if it would be necessary) This implements the base case and can be subclassed for index mappings. The actual loading behaviour (lazy or cached) should be implemented in __getitem__

See also

:method:ConcatDataset.get_sample_from_index :method:BaseLazyDataset.__getitem__ :method:BaseCacheDataset.__getitem__

Parameters

index (int) – index corresponding to targeted sample

Returns

sample corresponding to given index

Return type

Any

get_subset(indices)

Returns a Subset of the current dataset based on given indices

Parameters

indices (iterable) – valid indices to extract subset from current dataset

Returns

the subset

Return type

BlankDataset

train_test_split(*args, **kwargs)

split dataset into train and test data

Parameters
  • *args – positional arguments of train_test_split

  • **kwargs – keyword arguments of train_test_split

Returns

  • BlankDataset – train dataset

  • BlankDataset – test dataset

See also

sklearn.model_selection.train_test_split

BaseCacheDataset

class BaseCacheDataset(data_path: Union[str, list], load_fn: Callable, **load_kwargs)[source]

Bases: delira.data_loading.dataset.AbstractDataset

Dataset to preload and cache data

Notes

data needs to fit completely into RAM!

_make_dataset(path: Union[str, list])[source]

Helper Function to make a dataset containing all samples in a certain directory

Parameters

path (str or list) – if data_path is a string, _sample_fn is called for all items inside the specified directory if data_path is a list, _sample_fn is called for elements in the list

Returns

list of items which where returned from _sample_fn (typically dict)

Return type

list

Raises

AssertionError – if path is not a list and is not a valid directory

get_sample_from_index(index)

Returns the data sample for a given index (without any loading if it would be necessary) This implements the base case and can be subclassed for index mappings. The actual loading behaviour (lazy or cached) should be implemented in __getitem__

See also

:method:ConcatDataset.get_sample_from_index :method:BaseLazyDataset.__getitem__ :method:BaseCacheDataset.__getitem__

Parameters

index (int) – index corresponding to targeted sample

Returns

sample corresponding to given index

Return type

Any

get_subset(indices)

Returns a Subset of the current dataset based on given indices

Parameters

indices (iterable) – valid indices to extract subset from current dataset

Returns

the subset

Return type

BlankDataset

train_test_split(*args, **kwargs)

split dataset into train and test data

Parameters
  • *args – positional arguments of train_test_split

  • **kwargs – keyword arguments of train_test_split

Returns

  • BlankDataset – train dataset

  • BlankDataset – test dataset

See also

sklearn.model_selection.train_test_split

BaseExtendCacheDataset

class BaseExtendCacheDataset(data_path: Union[str, list], load_fn: Callable, **load_kwargs)[source]

Bases: delira.data_loading.dataset.BaseCacheDataset

Dataset to preload and cache data. Function to load sample is expected to return an iterable which can contain multiple samples

Notes

data needs to fit completely into RAM!

_make_dataset(path: Union[str, list])[source]

Helper Function to make a dataset containing all samples in a certain directory

Parameters

path (str or iterable) – if data_path is a string, _sample_fn is called for all items inside the specified directory if data_path is a list, _sample_fn is called for elements in the list

Returns

list of items which where returned from _sample_fn (typically dict)

Return type

list

Raises

AssertionError – if path is not a list and is not a valid directory

get_sample_from_index(index)

Returns the data sample for a given index (without any loading if it would be necessary) This implements the base case and can be subclassed for index mappings. The actual loading behaviour (lazy or cached) should be implemented in __getitem__

See also

:method:ConcatDataset.get_sample_from_index :method:BaseLazyDataset.__getitem__ :method:BaseCacheDataset.__getitem__

Parameters

index (int) – index corresponding to targeted sample

Returns

sample corresponding to given index

Return type

Any

get_subset(indices)

Returns a Subset of the current dataset based on given indices

Parameters

indices (iterable) – valid indices to extract subset from current dataset

Returns

the subset

Return type

BlankDataset

train_test_split(*args, **kwargs)

split dataset into train and test data

Parameters
  • *args – positional arguments of train_test_split

  • **kwargs – keyword arguments of train_test_split

Returns

  • BlankDataset – train dataset

  • BlankDataset – test dataset

See also

sklearn.model_selection.train_test_split

ConcatDataset

class ConcatDataset(*datasets)[source]

Bases: delira.data_loading.dataset.AbstractDataset

abstract _make_dataset(path: str)

Create dataset

Parameters

path (str) – path to data samples

Returns

data: List of sample paths if lazy; List of samples if not

Return type

list

get_sample_from_index(index)[source]

Returns the data sample for a given index (without any loading if it would be necessary) This method implements the index mapping of a global index to the subindices for each dataset. The actual loading behaviour (lazy or cached) should be implemented in __getitem__

See also

:method:AbstractDataset.get_sample_from_index :method:BaseLazyDataset.__getitem__ :method:BaseCacheDataset.__getitem__

Parameters

index (int) – index corresponding to targeted sample

Returns

sample corresponding to given index

Return type

Any

get_subset(indices)

Returns a Subset of the current dataset based on given indices

Parameters

indices (iterable) – valid indices to extract subset from current dataset

Returns

the subset

Return type

BlankDataset

train_test_split(*args, **kwargs)

split dataset into train and test data

Parameters
  • *args – positional arguments of train_test_split

  • **kwargs – keyword arguments of train_test_split

Returns

  • BlankDataset – train dataset

  • BlankDataset – test dataset

See also

sklearn.model_selection.train_test_split

BlankDataset

Nii3DLazyDataset

Nii3DCacheDataset

TorchvisionClassificationDataset:

class TorchvisionClassificationDataset(dataset, root='/tmp/', train=True, download=True, img_shape=(28, 28), one_hot=False, **kwargs)[source]

Bases: delira.data_loading.dataset.AbstractDataset

Wrapper for torchvision classification datasets to provide consistent API

_make_dataset(dataset, **kwargs)[source]

Create the actual dataset

Parameters
  • dataset (str) – Defines the dataset to use. must be one of [‘mnist’, ‘emnist’, ‘fashion_mnist’, ‘cifar10’, ‘cifar100’]

  • **kwargs – Additional keyword arguments passed to the torchvision dataset class for initialization

Returns

actual Dataset

Return type

torchvision.Dataset

Raises

KeyError – Dataset string does not specify a valid dataset

get_sample_from_index(index)

Returns the data sample for a given index (without any loading if it would be necessary) This implements the base case and can be subclassed for index mappings. The actual loading behaviour (lazy or cached) should be implemented in __getitem__

See also

:method:ConcatDataset.get_sample_from_index :method:BaseLazyDataset.__getitem__ :method:BaseCacheDataset.__getitem__

Parameters

index (int) – index corresponding to targeted sample

Returns

sample corresponding to given index

Return type

Any

get_subset(indices)

Returns a Subset of the current dataset based on given indices

Parameters

indices (iterable) – valid indices to extract subset from current dataset

Returns

the subset

Return type

BlankDataset

train_test_split(*args, **kwargs)

split dataset into train and test data

Parameters
  • *args – positional arguments of train_test_split

  • **kwargs – keyword arguments of train_test_split

Returns

  • BlankDataset – train dataset

  • BlankDataset – test dataset

See also

sklearn.model_selection.train_test_split