audtorch

Test status code coverage audtorch's documentation audtorch's supported Python versions audtorch's MIT license

Deep learning with PyTorch and audio.

Have a look at the installation and usage instructions as a starting point.

If you are interested in PyTorch and audio you should also check out the efforts to integrate more audio directly into PyTorch:

Installation

audtorch is supported by Python 3.6 or higher. To install it run (preferable in a virtual environment):

pip install audtorch

Usage

audtorch automates the data iteration process for deep neural network training using PyTorch. It provides a set of feature extraction transforms that can be implemented on-the-fly on the CPU.

The following example creates a data set of speech samples that are cut to a fixed length of 10240 samples. In addition they are augmented on the fly during data loading by a transform that adds samples from another data set:

>>> import sounddevice as sd
>>> from audtorch import datasets, transforms
>>> noise = datasets.WhiteNoise(duration=10240, sampling_rate=16000)
>>> augment = transforms.Compose([transforms.RandomCrop(10240),
...                               transforms.RandomAdditiveMix(noise)])
>>> data = datasets.LibriSpeech(root='~/LibriSpeech', sets='dev-clean',
...                             download=True, transform=augment)
>>> signal, label = data[8]
>>> sd.play(signal.transpose(), data.sampling_rate)

Besides data sets and transforms the package provides standard evaluation metrics, samplers, and necessary collate functions for training deep neural networks for audio tasks.

Contributing

Everyone is invited to contribute to this project. Feel free to create a pull request. If you find errors, omissions, inconsistencies or other things that need improvement, please create an issue.

Development Installation

Instead of pip-installing the latest release from PyPI, you should get the newest development version from Github:

git clone https://github.com/audeering/audtorch/
cd audtorch
# Create virtual environment, e.g.
# virtualenv --python=python3 _env
# source _env/bin/activate
python setup.py develop

This way, your installation always stays up-to-date, even if you pull new changes from the Github repository.

If you prefer, you can also replace the last command with:

pip install -r requirements.txt

Pull requests

When creating a new pull request, please conside the following points:

  • Focus on a single topic as it is easier to review short pull requests

  • Ensure your code is readable and PEP 8 compatible

  • Provide a test for proposed new functionality

  • Add a docstring, see the Writing Documentation remarks below

  • Choose a meaningful commit messages

Writing Documentation

The API documentation of audtorch is build automatically from the docstrings of its classes and functions.

docstrings are written in reStructuredText as indicated by the r at its beginning and they are written using the Google docstring convention with the following additions:

  • Start argument description in lower case and end the last sentence without a punctation.

  • If the argument is optional, its default value has to be indicated.

  • Description of attributes start as well in lower case and stop without punctuation.

  • Attributes that can influence the behavior of the class should be described by the word controls.

  • Attributes that are supposed to be read only and provide only information should be described by the word holds.

  • Have a special section for class attributes.

  • Python variables should be set in single back tics in the description of the docstring, e.g. `True`. Only for some explicit statements like a list of variables it might be look better to write them as code, e.g. `'mean'`.

The important part of the docstrings is the first line which holds a short summary of the functionality, that should not be longer than one line, written in imperative, and stops with a point. It is also considered good practice to include an usage example.

reStructuredText allows for easy inclusion of math in LaTeX syntax that will be dynamically rendered in the browser.

After you are happy with your docstring, you have to include it into the main documentation under the docs/ folder in the appropriate api file. E.g. energy() is part of the utils module and the corresponding file in the documentation would be docs/api-utils.rst, where it is included.

Building Documentation

If you make changes to the documentation, you can re-create the HTML pages using Sphinx. You can install it and a few other necessary packages with:

pip install -r doc/requirements.txt

To create the HTML pages, use:

sphinx-build docs/ build/sphinx/html/ -b html

The generated files will be available in the directory build/sphinx/html/.

It is also possible to automatically check if all links are still valid:

sphinx-build docs/ build/sphinx/html/ -b linkcheck

Running Tests

You’ll need pytest and a few dependencies for that. It can be installed with:

pip install -r tests/requirements.txt

To execute the tests, simply run:

pytest

Creating a New Release

New releases are made using the following steps:

  1. Update CHANGELOG.rst

  2. Commit those changes as “Release X.Y.Z”

  3. Create an (annotated) tag with git tag -a X.Y.Z

  4. Push the commit and the tag to Github

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Version 0.6.4 (2020-11-02)

  • Fixed: link to documentation on Github pages in Python package

Version 0.6.3 (2020-10-30)

  • Added: use copy-button Sphinx plugin

  • Added: links to usage and installation to README

  • Changed: use sphinx-audeering-theme

  • Changed: update all documentation links to Github pages

Version 0.6.2 (2020-10-30)

  • Fixed: install missing pandoc for publishing documentation

Version 0.6.1 (2020-10-30)

  • Fixed: only install doc dependency for automatic release

Version 0.6.0 (2020-10-30)

  • Added: code coverage

  • Added: automatic publishing using Github Actions

  • Changed: use Github Actions for testing

  • Changed: host documentation as Github pages

  • Fixed: use newest librosa version

Version 0.5.2 (2020-03-03)

  • Fixed: disable automatic execution of notebook

Version 0.5.1 (2020-03-03)

  • Fixed: execute jupyter notebook on readthedocs

  • Fixed: release date of 5.0.0 in CHANGELOG

Version 0.5.0 (2020-03-03)

  • Added: RandomConvolutionalMix transform

  • Added: EmoDB data set

  • Added: introduction tutorial

  • Added: Python 3.8 support

  • Added: column_end + column_start to CsvDataset and PandasDataset

  • Added: random convolutional mix transform

  • Changed: default filename column in data sets is now file

  • Changed: force keyword only arguments

  • Fixed: stft functional example

  • Fixed: import of librosa

  • Removed: Python 3.5 support

Version 0.4.2 (2019-11-04)

  • Fixed: critical bug of missing files in wheel package (#60)

Version 0.4.1 (2019-10-25)

  • Fixed: default axis values for Masking transforms (#59)

Version 0.4.0 (2019-10-21)

  • Added: masking transforms in time and frequency domain

Version 0.3.2 (2019-10-04)

  • Fixed: long description in setup.cfg

Version 0.3.1 (2019-10-04)

  • Changed: define package in setup.cfg

Version 0.3.0 (2019-09-13)

  • Added: datasets.SpeechCommands (#49)

  • Removed: LogSpectrogram (#52)

Version 0.2.1 (2019-08-01)

  • Changed: Remove os.system call for moving files (#43)

  • Fixed: Remove broken logos from issue templates (#31)

  • Fixed: Wrong Spectrogram output shape in documentation (#40)

  • Fixed: Broken data set loading for relative paths (#33)

Version 0.2.0 (2019-06-28)

  • Added: Standardize, Log (#29)

  • Changed: Switch to Keep a Changelog format (#34)

  • Deprecated: LogSpectrogram (#29)

  • Fixed: normalize axis (#28)

Version 0.1.1 (2019-05-23)

  • Fixed: Broken API documentation on readthedocs

Version 0.1.0 (2019-05-22)

  • Added: Public release

Introduction

In this tutorial, we will see how one can use audtorch to rapidly speed up the development of audio-based deep learning applications.

Preliminaries

  • PyTorch already has an inteface for data sets, aptly called Dataset

  • It then wraps this interface with a DataLoader that efficiently allows us to loop through the data in parallel, and takes care of the random order as well

  • All we need to do is implement the Dataset interface to get the input for the model and the labels

  • However, it is not easy for beginners to see how one can go from a bunch of files in their hard drive, to the features that will be used as input in a machine learning model

  • Thankfully, audtorch is there to take of all that for you :-)

Before you start, you might want to familiarize yourselves with PyTorch’s data pipeline

Data loading

We are going to start with loading the necessary data.

audtorch offers a growing collection of data sets. Normally, using this interface requires one to have that particular data set on their hard drive. Some of them even support downloading from their original source.

We will be using the Berlin Database of Emotional Speech (EmoDB) for this tutorial. For convenience, we have included two of its files in a sub-directory. We recommend you to get the whole data base from its original website.

[ ]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import audtorch
import IPython.display as ipd
[ ]:
dataset = audtorch.datasets.EmoDB(
    root='data/emodb'
)
print(dataset)
[ ]:
x, y = dataset[0]
print(x.shape)
print(y)
[ ]:
ipd.Audio(x, rate=dataset.sampling_rate)

That’s it really. Up to this point, audtorch does not add much to the PyTorch’s data API, which is already quite advanced anyway.

Feature extraction

Feature extraction is the first important benefit of using audtorch.

audtorch collects an ever growing set of feature transformation and data pre-processing utilities. That way you don’t need to worry too much about getting your data pipeline ready, but you can quickly start with the cool modelling part.

A typical kind of features used in the audio domain, are spectral features. Audio signals are analyzed with respect to their frequency content using something called a Fourier transform.

Moreover, since that content changes over time, we normally use a short-time Fourier Transform. This leads then to the generation of a so-called spectrogram, which is nothing more than an image representation of the frequency content of a signal over time.

We assume that the reader is already familiar with this terminology. What’s important to point out, is that audtorch is designed to allow for easy usage of those features in a typical PyTorch workflow. Below, we see an example of how a feature extraction transform is defined:

[ ]:
spec = audtorch.transforms.Spectrogram(
    window_size=int(0.025 * dataset.sampling_rate),
    hop_size=int(0.010 * dataset.sampling_rate)
)
print(spec)

By plotting the spectrogram, we see what frequency content our signal has over time.

[ ]:
spectrogram = spec(x)
plt.imshow(spectrogram.squeeze())
plt.gca().invert_yaxis()

The above image looks mostly empty. That’s why we have a lot of content with very low power that is dominated by the presence of a few frequencies where most of the signal’s power is concentrated.

It is typical to compute the logarithm of the spectrogram to reveal more information. That squashes the input and reveals previously hidden structure in other frequency bands. Incidentally, this squashing reduces the dynamic range of the resulting image, which makes our input more suitable for deep neural network training.

audtorch provides a nice wrapper function for numpy’s log to simplify things.

[ ]:
lg = audtorch.transforms.Log()
print(lg)
[ ]:
log_spectrogram = lg(spectrogram)
[ ]:
plt.imshow(log_spectrogram.squeeze())
plt.gca().invert_yaxis()

This image shows that there is a lot more going on in our signal than we previously thought.

In general, we recommend to always start with a preliminary data analysis before you jump into modelling to ensure you have the proper understanding of your problem.

audtorch is here to help you with that, and another useful feature is that it allows you to stack multiple transforms in a Compose transform. Below, we stack together the spectrogram and the log transforms to form a single object.

[ ]:
t = audtorch.transforms.Compose(
    [
        audtorch.transforms.Spectrogram(
            window_size=int(0.025 * 16000),
            hop_size=int(0.010 * 16000)
        ),
        audtorch.transforms.Log()
    ]
)
print(t)
[ ]:
plt.imshow(t(x).squeeze())
plt.gca().invert_yaxis()

This stacking can continue ad infinum, as seen below with the Standardize transform.

Make sure to always stay up to date with all the transforms offered by audtorch!

[ ]:
t = audtorch.transforms.Compose(
    [
        audtorch.transforms.Spectrogram(
            window_size=int(0.025 * 16000),
            hop_size=int(0.010 * 16000)
        ),
        audtorch.transforms.Log(),
        audtorch.transforms.Standardize()
    ]
)
print(t)
[ ]:
plt.imshow(t(x).squeeze())
plt.gca().invert_yaxis()

Data augmentation

One of the most crucial aspects of recent deep learning successes is arguably data augmentation. Roughly, this means increasing the sampling of your input space by creating slightly different copies of the original input without changing the label.

In the image domain, people use a variety of transforms, such as:

  • Adding noise

  • Cropping

  • Rotating

  • Etc.

Things are not so easy in the audio domain. Rotation, for example, does not make any sense for spectrogram features, since the two axes are not interchangeable. In general, the community seems to use the following transforms:

  • Noise

  • Time/frequency masking

  • Pitch shifting

  • Etc.

An important feature of audtorch is making these transformations very easy to use in practice. In the following example, we will be using RandomAdditiveMix. This transforms allows you to randomly mix audio samples with a noise data set of your choice (e.g. a large audio data set like AudioSet).

In this example, we will use a built-in data set, WhiteNoise, which simply creates a random white noise signal every time it is called.

[ ]:
random_mix = audtorch.transforms.RandomAdditiveMix(
    dataset=audtorch.datasets.WhiteNoise(sampling_rate=dataset.sampling_rate)
)
print(random_mix)

You can see that this transforms modifies the audio signal itself, by adding this “static” TV noise to our original signal. Obviously though, the emotion of the speaker remains the same. This is a very practical way to augment your training set without changing the labels.

[ ]:
import IPython.display as ipd
ipd.Audio(random_mix(x), rate=dataset.sampling_rate)

Stacking data augmentation and feature extraction

What is really important, is that audtorch allows us to do simultaneous data augmentation and feature extraction on-the-fly.

This is very useful in the typical case where we run the same training samples multiple times through the network (i.e. when we train for multiple epochs), and would like to slightly change the input every time. All we have to do is stack our data augmentation transforms on top of our feature extraction ones.

[ ]:
t = audtorch.transforms.Compose(
    [
        audtorch.transforms.RandomAdditiveMix(
            dataset=audtorch.datasets.WhiteNoise(sampling_rate=dataset.sampling_rate),
            expand_method='multiple'
        ),
        audtorch.transforms.Spectrogram(
            window_size=int(0.025 * dataset.sampling_rate),
            hop_size=int(0.010 * dataset.sampling_rate)
        ),
        audtorch.transforms.Log(),
        audtorch.transforms.Standardize()
    ]
)
print(t)

We can clearly see how this spectrogram seems noisier than the one we had before. Hopefully, this will be enough to make our classifier generalize better!

[ ]:
plt.imshow(t(x).squeeze())
plt.gca().invert_yaxis()

audtorch.collate

Collate functions manipulate and merge a list of samples to form a mini-batch, see torch.utils.data.DataLoader. An example use case is batching sequences of variable-length, which requires padding each sample to the maximum length in the batch.

Collation

class audtorch.collate.Collation

Abstract interface for collation classes.

All other collation classes should subclass it. All subclasses should override __call__, that executes the actual collate function.

Seq2Seq

class audtorch.collate.Seq2Seq(sequence_dimensions, *, batch_first=None, pad_values=[0, 0], sort_sequences=True)

Pads mini-batches to longest contained sequence for seq2seq-purposes.

This class pads features and targets to the largest sequence in the batch. Before padding, length information are extracted from them.

Note

The tensors can be sorted in descending order of features’ lengths by enabling sort_sequences. Thereby the requirements of torch.nn.utils.rnn.pack_padded_sequence() are anticipated, which is used by recurrent layers.

  • sequence_dimensions holds dimension of sequence in features and targets

  • batch_first controls output shape of features and targets

  • pad_values controls values to pad features (targets) with

  • sort_sequences controls if sequences are sorted in descending order of features’ lengths

Parameters
  • sequence_dimensions (list of ints) – indices representing dimension of sequence in feature and target tensors. Position 0 represents sequence dimension of features, position 1 represents sequence dimension of targets. Negative indexing is permitted

  • batch_first (bool or None, optional) – determines output shape of collate function. If None, original shape of features and targets is kept with dimension of batch size prepended. See Shape for more information. Default: None

  • pad_values (list, optional) – values to pad shorter sequences with. Position 0 represents value of features, position 1 represents value of targets. Default: [0, 0]

  • sort_sequences (bool, optional) – option whether to sort sequences in descending order of features’ lengths. Default: True

Shape:
  • Input: \((*, S, *)\), where \(*\) can be any number of further dimensions except \(N\) which is the batch size, and where \(S\) is the sequence dimension.

  • Output:

    • features:

      • \((N, *, S, *)\) if batch_first is None, i.e. the original input shape with \(N\) prepended which is the batch size

      • \((N, S, *, *)\) if batch_first is True

      • \((S, N, *, *)\) if batch_first is False

    • feats_lengths: \((N,)\)

    • targets: analogous to features

    • tgt_lengths: analogous to feats_lengths

Example

>>> # data format: FS = (feature dimension, sequence dimension)
>>> batch = [[torch.zeros(161, 108), torch.zeros(10)],
...          [torch.zeros(161, 223), torch.zeros(12)]]
>>> collate_fn = Seq2Seq([-1, -1], batch_first=None)
>>> features = collate_fn(batch)[0]
>>> list(features.shape)
[2, 161, 223]

audtorch.datasets

Audio data sets.

AudioSet

class audtorch.datasets.AudioSet(*args: Any, **kwargs: Any)

A large-scale dataset of manually annotated audio events.

Open and publicly available data set of audio events from Google: https://research.google.com/audioset/

License: CC BY 4.0

The categories corresponding to an audio signal are returned as a list, starting with those included in the top hierarchy of the AudioSet ontology, followed by those from the second hierarchy and then all other categories in a random order.

The signals to be returned can be limited by excluding or including only certain categories. This is achieved by first including only the desired categories, estimating all its parent categories and then applying the exclusion.

  • transform controls the input transform

  • target_transform controls the target transform

  • files controls the audio files of the data set

  • targets controls the corresponding targets

  • sampling_rate holds the sampling rate of the returned data

  • original_sampling_rate holds the sampling rate of the audio files of the data set

Parameters
  • root (str) – root directory of dataset

  • csv_file (str, optional) – name of a CSV file located in root. Can be one of balanced_train_segments.csv, unbalanced_train_segments.csv, eval_segments.csv. Default: balanced_train_segments.csv

  • include (list of str, optional) – list of categories to include. If None all categories are included. Default: None

  • exclude (list of str, optional) – list of categories to exclude. If None no category is excluded. Default: None

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

AudioSet ontology categories of the two top hierarchies:

Human sounds            Animal                   Music
|-Human voice           |-Domestic animals, pets |-Musical instrument
|-Whistling             |-Livestock, farm        |-Music genre
|-Respiratory sounds    | animals, working       |-Musical concepts
|-Human locomotion      | animals                |-Music role
|-Digestive             \-Wild animals           \-Music mood
|-Hands
|-Heart sounds,         Sounds of things         Natural sounds
| heartbeat             |-Vehicle                |-Wind
|-Otoacoustic emission  |-Engine                 |-Thunderstorm
\-Human group actions   |-Domestic sounds,       |-Water
                        | home sounds            \-Fire
Source-ambiguous sounds |-Bell
|-Generic impact sounds |-Alarm                  Channel, environment
|-Surface contact       |-Mechanisms             and background
|-Deformable shell      |-Tools                  |-Acoustic environment
|-Onomatopoeia          |-Explosion              |-Noise
|-Silence               |-Wood                   \-Sound reproduction
\-Other sourceless      |-Glass
                        |-Liquid
                        |-Miscellaneous sources
                        \-Specific impact sounds

Warning

Some of the recordings in AudioSet were captured with mono and others with stereo input. The user must be careful to handle this, e.g. using a transform to adjust number of channels.

Example

>>> import sounddevice as sd
>>> data = AudioSet(root='/data/AudioSet', include=['Thunderstorm'])
>>> print(data)
Dataset AudioSet
    Number of data points: 73
    Root Location: /data/AudioSet
    Sampling Rate: 16000Hz
    CSV file: balanced_train_segments.csv
    Included categories: ['Thunderstorm']
>>> signal, target = data[4]
>>> target
['Natural sounds', 'Thunderstorm', 'Water', 'Rain', 'Thunder']
>>> sd.play(signal.transpose(), data.sampling_rate)

EmoDB

class audtorch.datasets.EmoDB(*args: Any, **kwargs: Any)

EmoDB data set.

Open and publicly available data set of acted emotions: http://www.emodb.bilderbar.info/navi.html

EmoDB is a small audio data set collected in an anechoic chamber in the Berlin Institute of Technology, it contains 5 male and 5 female speakers, consists of 10 unique sentences, and is annotated for 6 emotions plus a neutral state. The spoken language is German.

Parameters
  • root – root directory of dataset

  • transform – function/transform applied on the signal

  • target_transform – function/transform applied on the target

Note

  • When using the EmoDB data set in your research, please cite the following publication: [BPR+05].

Example

>>> import sounddevice as sd
>>> data = EmoDB('/data/emodb')
>>> print(data)
Dataset EmoDB
    Number of data points: 465
    Root Location: /data/emodb
    Sampling Rate: 16000Hz
    Labels: emotion
>>> signal, target = data[0]
>>> target
'A'
>>> sd.play(signal.transpose(), data.sampling_rate)

LibriSpeech

class audtorch.datasets.LibriSpeech(*args: Any, **kwargs: Any)

LibriSpeech speech data set.

Open and publicly available data set of voices from OpenSLR: http://www.openslr.org/12/

License: CC BY 4.0.

LibriSpeech contains several hundred hours of English speech with corresponding transcriptions in capital letters without punctuation.

It is split into different subsets according to WER-level achieved when performing speech recognition on the speakers. The subsets are: train-clean-100, train-clean-360, train-other-500 dev-clean, dev-other, test-clean, test-other

  • root holds the data set’s location

  • transform controls the input transform

  • target_transform controls the target transform

  • files controls the audio files of the data set

  • labels controls the corresponding labels

  • sampling_rate holds the sampling rate of data set

In addition, the following class attributes are available

  • all_sets holds the names of the different pre-defined sets

  • urls holds the download links of the different sets

Parameters
  • root (str) – root directory of data set

  • sets (str or list, optional) – desired sets of LibriSpeech. Mutually exclusive with dataframe. Default: None

  • dataframe (pandas.DataFrame, optional) – pandas data frame containing columns audio_path (relative to root) and transcription. It can be used to pre-select files based on meta information, e.g. sequence length. Mutually exclusive with sets. Default: None

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

  • download (bool, optional) – download data set to root directory if not present. Default: False

Example

>>> import sounddevice as sd
>>> data = LibriSpeech(root='/data/LibriSpeech', sets='dev-clean')
>>> print(data)
Dataset LibriSpeech
    Number of data points: 2703
    Root Location: /data/LibriSpeech
    Sampling Rate: 16000Hz
    Sets: dev-clean
>>> signal, label = data[8]
>>> label
AS FOR ETCHINGS THEY ARE OF TWO KINDS BRITISH AND FOREIGN
>>> sd.play(signal.transpose(), data.sampling_rate)

MozillaCommonVoice

class audtorch.datasets.MozillaCommonVoice(*args: Any, **kwargs: Any)

Mozilla Common Voice speech data set.

Open and publicly available data set of voices from Mozilla: https://voice.mozilla.org/en/datasets

License: CC-0 (public domain)

Mozilla Common Voice includes the labels text, up_votes, down_votes, age, gender, accent, duration. You can select one of those labels which is returned as a string by the data set as target or you can specify a list of the labels and the data set will return a dictionary containing those labels. The default label that is returned is text.

  • root holds the data set’s location

  • transform controls the input transform

  • target_transform controls the target transform

  • files controls the audio files of the data set

  • targets controls the corresponding targets

  • sampling_rate holds the sampling rate of the returned data

  • original_sampling_rate holds the sampling rate of the audio files of the data set

In addition, the following class attribute is available

  • url holds the download link of the data set

Parameters
  • root (str) – root directory of data set, where the CSV files are located, e.g. /data/MozillaCommonVoice/cv_corpus_v1

  • csv_file (str, optional) – name of a CSV file from the root folder. No absolute path is possible. You are most probably interested in cv-valid-train.csv, cv-valid-dev.csv, and cv-valid-test.csv. Default: cv-valid-train.csv.

  • label_type (str or list of str, optional) – one of text, up_votes, down_votes, age, gender, accent, duration. Or a list of any combination of those. Default: text

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

  • download (bool, optional) – download data set if not present. Default: False

Note

The Mozilla Common Voice data set is constantly growing. If you choose to download it, it will always grep the latest version. If you require reproducibility of your results, make sure to store a safe snapshot of the version you used.

Example

>>> import sounddevice as sd
>>> data = MozillaCommonVoice('/data/MozillaCommonVoice/cv_corpus_v1')
>>> print(data)
Dataset MozillaCommonVoice
    Number of data points: 195776
    Root Location: /data/MozillaCommonVoice/cv_corpus_v1
    Sampling Rate: 48000Hz
    Labels: text
    CSV file: cv-valid-train.csv
>>> signal, target = data[0]
>>> target
'learn to recognize omens and follow them the old king had said'
>>> sd.play(signal.transpose(), data.sampling_rate)

SpeechCommands

class audtorch.datasets.SpeechCommands(*args: Any, **kwargs: Any)

Data set of spoken words designed for keyword spotting tasks.

Speech Commands V2 publicly available from Google: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

License: CC BY 4.0

Parameters
  • root (str) – root directory of data set, where the CSV files are located, e.g. /data/speech_commands_v0.02

  • train (bool, optional) – Partition the dataset into the training set. False returns the test split. Default: False

  • download (bool, optional) – Download the dataset to root if it’s not already available. Default: False

  • include (str, or list of str, optional) – commands to include as ‘recognised’ words. Options: “10cmd”, “full”. A custom dataset can be defined using a list of command words. For example, [“stop”,”go”]. Words that are not in the “include” list are treated as unknown words. Default: ‘10cmd’

  • silence (bool, optional) – include a ‘silence’ class composed of background noise (Note: use randomcrop when training). Default: True

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

Example

>>> import sounddevice as sd
>>> data = SpeechCommands(root='/data/speech_commands_v0.02')
>>> print(data)
Dataset SpeechCommands
    Number of data points: 97524
    Root Location: /data/speech_commands_v0.02
    Sampling Rate: 16000Hz
>>> signal, target = data[4]
>>> target
'right'
>>> sd.play(signal.transpose(), data.sampling_rate)

VoxCeleb1

class audtorch.datasets.VoxCeleb1(*args: Any, **kwargs: Any)

VoxCeleb1 data set.

Open and publicly available data set of voices from University of Oxford: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

VoxCeleb1 is a large audio-visual data set consisting of short clips of human speech extracted from YouTube interviews with celebrities. It is free for commercial and research purposes.

Licence: CC BY-SA 4.0

  • transform controls the input transform

  • target_transform controls the target transform

  • files controls the audio files of the data set

  • targets controls the corresponding targets

  • sampling_rate holds the sampling rate of data set

In addition, the following class attributes are available:

  • url holds its URL

Parameters
  • root (str) – root directory of dataset

  • partition (str, optional) – name of the data partition to use. Choose one of train, dev, test or None. If None is given, then the whole data set will be returned. Default: train

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

Note

  • This data set will work only if the identification file is downloaded as is from the official homepage. Please open it in your browser and copy paste its contents in a file in your computer.

  • To download the data set go to http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ and fill in the form to request a password. Get the Audio Files that the owners provide.

  • When using the VoxCeleb1 data set in your research, please cite the following publication: [NCZ17].

Example

>>> import sounddevice as sd
>>> data = VoxCeleb1('/data/voxceleb1')
>>> print(data)
Dataset VoxCeleb1
    Number of data points: 138361
    Root Location: /data/voxceleb1
    Sampling Rate: 16000Hz
    Labels: speaker ID
>>> signal, target = data[0]
>>> target
'id10003'
>>> sd.play(signal.transpose(), data.sampling_rate)

WhiteNoise

class audtorch.datasets.WhiteNoise(*args: Any, **kwargs: Any)

White noise data set.

The white noise is generated by numpy.random.standard_normal.

  • duration controls the duration of the noise signal

  • sampling_rate holds the sampling rate of the returned data

  • mean controls the mean of the underlying distribution

  • stdev controls the standard deviation of the underlying distribution

  • transform controls the input transform

  • target_transform controls the target transform

As white noise has not really a sampling rate you can use the following attribute to change it instead of resampling:

  • original_sampling_rate controls the sampling rate of the data set

Parameters
  • duration (float) – duration of the noise signal in seconds

  • sampling_rate (int, optional) – sampling rate in Hz. Default: 44100

  • mean (float, optional) – mean of underlying distribution. Default: 0

  • stdev (float, optional) – standard deviation of underlying distribution. Default: 1

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

Note

Even WhiteNoise has an infintely number of entries, its length is 1 as repeated calls of the same index return different signals.

Example

>>> import sounddevice as sd
>>> data = WhiteNoise(duration=1, sampling_rate=44100)
>>> print(data)
Dataset WhiteNoise
    Number of data points: Inf
    Signal length: 1s
    Sampling Rate: 44100Hz
    Label (str): noise type
>>> signal, target = data[0]
>>> target
'white noise'
>>> sd.play(signal.transpose(), data.sampling_rate)

Base

This section contains a mix of generic data sets that are useful for a wide variety of cases and can be used as base classes for other data sets.

AudioDataset

class audtorch.datasets.AudioDataset(*args: Any, **kwargs: Any)

Basic audio signal data set.

This data set can be used if you have a list of files and a list of corresponding targets.

In addition, this class is a great starting point to inherit from if you wish to build your own data set.

  • transform controls the input transform

  • target_transform controls the target transform

  • files controls the audio files of the data set

  • targets controls the corresponding targets

  • duration controls audio duration for every file in seconds

  • offset controls audio offset for every file in seconds

  • sampling_rate holds the sampling rate of the returned data

  • original_sampling_rate holds the sampling rate of the audio files of the data set

Parameters
  • files (list) – list of files

  • targets (list) – list of targets

  • sampling_rate (int) – sampling rate in Hz of the data set

  • root (str, optional) – root directory of dataset. Default: None

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

Example

>>> data = AudioDataset(files=['speech.wav', 'noise.wav'],
...                     targets=['speech', 'noise'],
...                     sampling_rate=8000,
...                     root='/data')
>>> print(data)
Dataset AudioDataset
    Number of data points: 2
    Root Location: /data
    Sampling Rate: 8000Hz
>>> signal, target = data[0]
>>> target
'speech'
extra_repr()

Set the extra representation of the data set.

To print customized extra information, you should reimplement this method in your own data set. Both single-line and multi-line strings are acceptable.

The extra information will be shown after the sampling rate entry.

PandasDataset

class audtorch.datasets.PandasDataset(*args: Any, **kwargs: Any)

Data set from pandas.DataFrame.

Create a data set by accessing the file locations and corresponding labels through a pandas.DataFrame.

You have to specify which labels of the data set you want as target by the names of the corresponding columns in the data frame. If you want to select one of those columns the label is returned directly in its corresponding data type or you can specify a list of columns and the data set will return a dictionary containing the labels.

The filenames of the corresponding audio files have to be specified with absolute path. If they are relative to a folder, you have to use the root argument to specify that folder.

  • transform controls the input transform

  • target_transform controls the target transform

  • files controls the audio files of the data set

  • targets controls the corresponding targets

  • sampling_rate holds the sampling rate of the returned data

  • original_sampling_rate holds the sampling rate of the audio files of the data set

  • column_labels holds the name of the label columns

Parameters
  • df (pandas.DataFrame) – data frame with filenames and labels

  • sampling_rate (int) – sampling rate in Hz of the data set

  • root (str, optional) – root directory added before the files listed in the CSV file. Default: None

  • column_labels (str or list of str, optional) – name of data frame column(s) containing the desired labels. Default: label

  • column_filename (str, optional) – name of column holding the file names. Default: file

  • column_start (str, optional) – name of column holding start of audio in the corresponding file in seconds. Default: None

  • column_end (str, optional) – name of column holding end of audio in the corresponding file in seconds. Default: None

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

Example

>>> data = PandasDataset(root='/data',
...                      df=dataset_dataframe,
...                      sampling_rate=44100,
...                      column_labels='age')
>>> print(data)
Dataset AudioDataset
    Number of data points: 120
    Root Location: /data
    Sampling Rate: 44100Hz
    Label: age
>>> signal, target = data[0]
>>> target
31

CsvDataset

class audtorch.datasets.CsvDataset(*args: Any, **kwargs: Any)

Data set from CSV files.

Create a data set by reading the file locations and corresponding labels from a CSV file.

You have to specify which labels you want as the target of the data set by the names of the corresponding columns in the CSV file. If you want to select one of those columns the target is returned directly in its corresponding data type or you can specify a list of columns and the data set will return a dictionary containing the targets.

The filenames of the corresponding audio files have to be specified with absolute path. If they are relative to a folder, you have to use the root argument to specify that folder.

  • transform controls the input transform

  • target_transform controls the target transform

  • files controls the audio files of the data set

  • targets controls the corresponding targets

  • sampling_rate holds the sampling rate of the returned data

  • original_sampling_rate holds the sampling rate of the audio files of the data set

  • csv_file holds the path to the used CSV file

Parameters
  • csv_file (str) – CSV file with filenames and labels

  • sampling_rate (int) – sampling rate in Hz of the data set

  • root (str, optional) – root directory added before the files listed in the CSV file. Default: None

  • column_labels (str or list of str, optional) – name of CSV column(s) containing the desired labels. Default: label

  • column_filename (str, optional) – name of CSV column holding the file names. Default: file

  • column_start (str, optional) – name of column holding start of audio in the corresponding file in seconds. Default: None

  • column_end (str, optional) – name of column holding end of audio in the corresponding file in seconds. Default: None

  • sep (str, optional) – CSV delimiter. Default: ,

  • transform (callable, optional) – function/transform applied on the signal. Default: None

  • target_transform (callable, optional) – function/transform applied on the target. Default: None

Example

>>> data = CsvDataset(csv_file='/data/train.csv',
...                   sampling_rate=44100,
...                   column_labels='age')
>>> print(data)
Dataset AudioDataset
    Number of data points: 120
    Sampling Rate: 44100Hz
    Label: age
    CSV file: /data/train.csv
>>> signal, target = data[0]
>>> target
31

AudioConcatDataset

class audtorch.datasets.AudioConcatDataset(*args: Any, **kwargs: Any)

Concatenation data set of multiple audio data sets.

This data set checks that all audio data sets are compatible with respect to the sampling rate which they are processed with.

  • sampling_rate holds the consistent sampling rate of the concatenated data set

  • datasets holds a list of all audio data sets

  • cumulative_sizes holds a list of sizes accumulated over all audio data sets, i.e. [len(data1), len(data1) + len(data2), …]

Parameters

datasets (list of audtorch.AudioDataset) – Audio data sets with property sampling_rate.

Example

>>> import sounddevice as sd
>>> from audtorch.datasets import LibriSpeech
>>> dev_clean = LibriSpeech(root='/data/LibriSpeech', sets='dev-clean')
>>> dev_other = LibriSpeech(root='/data/LibriSpeech', sets='dev-other')
>>> data = AudioConcatDataset([dev_clean, dev_other])
>>> print(data)
Data set AudioConcatDataset
Number of data points: 5567
Sampling Rate: 16000Hz

data sets      data points  extra
-----------  -------------  ---------------
LibriSpeech           2703  Sets: dev-clean
LibriSpeech           2864  Sets: dev-other
>>> signal, label = data[8]
>>> label
AS FOR ETCHINGS THEY ARE OF TWO KINDS BRITISH AND FOREIGN
>>> sd.play(signal.transpose(), data.sampling_rate)
extra_repr()

Set the extra representation of the data set.

To print customized extra information, you should reimplement this method in your own data set. Both single-line and multi-line strings are acceptable.

The extra information will be shown after the sampling rate entry.

Mixture

This section contains data sets that are primarily used for mixing different data sets.

SpeechNoiseMix

class audtorch.datasets.SpeechNoiseMix(*args: Any, **kwargs: Any)

Mix speech and noise with speech as target.

Add noise to each speech sample from the provided data by a mix transform. Return the mix as input and the speech signal as corresponding target. In addition, allow to replace randomly some of the mixes by noise as input and silence as output. This helps to train a speech enhancement algorithm to deal with background noise only as input signal [RPS18].

  • speech_dataset controls the speech data set

  • mix_transform controls the transform that adds noise

  • transform controls the transform applied on the mix

  • target_transform controls the transform applied on the target clean speech

  • joint_transform controls the transform applied jointly on the mixture an the target clean speech

  • percentage_silence controls the amount of noise-silent data augmentation

Parameters
  • speech_dataset (Dataset) – speech data set

  • mix_transform (callable) – function/transform that can augment a signal with noise

  • transform (callable, optional) – function/transform applied on the speech-noise-mixture (input) only. Default; None

  • target_transform (callable, optional) – function/transform applied on the speech (target) only. Default: None

  • joint_transform (callable, optional) – function/transform applied on the mixtue (input) and speech (target) simultaneously. If the transform includes randomization it is applied with the same random parameter during both calls

  • percentage_silence (float, optional) – value between 0 and 1, which controls the percentage of randomly inserted noise input, silent target pairs. Default: 0

Examples

>>> import sounddevice as sd
>>> from audtorch import datasets, transforms
>>> noise = datasets.WhiteNoise(duration=10, sampling_rate=48000)
>>> mix = transforms.RandomAdditiveMix(noise)
>>> normalize = transforms.Normalize()
>>> speech = datasets.MozillaCommonVoice(root='/data/MozillaCommonVoice/cv_corpus_v1')
>>> data = SpeechNoiseMix(speech, mix, transform=normalize)
>>> noisy, clean = data[0]
>>> sd.play(noisy.transpose(), data.sampling_rate)

Utils

Utility functions for handling audio data sets.

load

audtorch.datasets.load(filename, *, duration=None, offset=0)

Load audio file.

If an error occurrs during loading as the file could not be found, is empty, or has the wrong format an empty signal is returned and a warning shown.

Parameters
  • file (str or int or file-like object) – file name of input audio file

  • duration (float, optional) – return only a specified duration in seconds. Default: None

  • offset (float, optional) – start reading at offset in seconds. Default: 0

Returns

  • numpy.ndarray: two-dimensional array with shape (channels, samples)

  • int: sample rate of the audio file

Return type

tuple

Example

>>> signal, sampling_rate = load('speech.wav')

download_url

audtorch.datasets.download_url(url, root, *, filename=None, md5=None)

Download a file from an url to a specified directory.

Parameters
  • url (str) – URL to download file from

  • root (str) – directory to place downloaded file in

  • filename (str, optional) – name to save the file under. If None, use basename of URL. Default: None

  • md5 (str, optional) – MD5 checksum of the download. If None, do not check. Default: None

Returns

path to downloaded file

Return type

str

download_url_list

audtorch.datasets.download_url_list(urls, root, *, num_workers=0)

Download files from a list of URLs to a specified directory.

Parameters
  • urls (list of str or dict) – either list of URLs or dictionary with URLs as keys and with either filenames or tuples of filename and MD5 checksum as values. Uses basename of URL if filename is None. Performs no check if MD5 checksum is None

  • root (str) – directory to place downloaded files in

  • num_workers (int, optional) – number of worker threads (0 = len(urls)). Default: 0

extract_archive

audtorch.datasets.extract_archive(filename, *, out_path=None, remove_finished=False)

Extract archive.

Currently tar.gz and tar archives are supported.

Parameters
  • filename (str) – path to archive

  • out_path (str, optional) – extract archive in this folder. Default: folder where archive is located in

  • remove_finished (bool, optional) – if True remove archive after extraction. Default: False

sampling_rate_after_transform

audtorch.datasets.sampling_rate_after_transform(dataset)

Sampling rate of data set after all transforms are applied.

A change of sampling rate by a transform is only recognized, if that transform has the attribute output_sampling_rate.

Parameters

dataset (torch.utils.data.Dataset) – data set with sampling_rate attribute or property

Returns

sampling rate in Hz after all transforms are applied

Return type

int

Example

>>> from audtorch import datasets, transforms
>>> t = transforms.Resample(input_sampling_rate=16000,
...                         output_sampling_rate=8000)
>>> data = datasets.WhiteNoise(sampling_rate=16000, transform=t)
>>> sampling_rate_after_transform(data)
8000

ensure_same_sampling_rate

audtorch.datasets.ensure_same_sampling_rate(datasets)

Raise error if provided data set differ in sampling rate.

All data sets that are checked need to have a sampling_rate attribute or property.

Parameters

datasets (list of torch.utils.data.Dataset) – list of at least two audio data sets.

ensure_df_columns_contain

audtorch.datasets.ensure_df_columns_contain(df, labels)

Raise error if list of labels are not in dataframe columns.

Parameters
  • df (pandas.dataframe) – data frame

  • labels (list of str) – labels to be expected in df.columns

Example

>>> import pandas as pd
>>> df = pd.DataFrame(data=[(1, 2)], columns=['a', 'b'])
>>> ensure_df_columns_contain(df, ['a', 'c'])
Traceback (most recent call last):
RuntimeError: Dataframe contains only these columns: 'a, b'

ensure_df_not_empty

audtorch.datasets.ensure_df_not_empty(df, labels=None)

Raise error if dataframe is empty.

Parameters
  • df (pandas.dataframe) – data frame

  • labels (list of str, optional) – list of labels used to shrink data set. Default: None

Example

>>> import pandas as pd
>>> df = pd.DataFrame()
>>> ensure_df_not_empty(df)
Traceback (most recent call last):
RuntimeError: No valid data points found in data set

files_and_labels_from_df

audtorch.datasets.files_and_labels_from_df(df, *, column_labels=None, column_filename='filename')

Extract list of files and labels from dataframe columns.

Parameters
  • df (pandas.DataFrame) – data frame with filenames and labels

  • column_labels (str or list of str, optional) – name of data frame column(s) containing the desired labels. Default: None

  • column_filename (str, optional) – name of column holding the file names. Default: filename

Returns

  • list of str: list of files

  • list of str or list of dicts: list of labels

Return type

tuple

Example

>>> import pandas as pd
>>> df = pd.DataFrame(data=[('speech.wav', 'speech')],
...                   columns=['filename', 'label'])
>>> files, labels = files_and_labels_from_df(df, column_labels='label')
>>> os.path.relpath(files[0]), labels[0]
('speech.wav', 'speech')

defined_split

audtorch.datasets.defined_split(dataset, split_func)

Split data set into desired non-overlapping subsets.

Parameters
  • dataset (torch.utils.data.Dataset) – data set to be split

  • split_func (func) – function mapping from data set index to subset id, \(f(\text{index}) = \text{subset\_id}\). The target domain of subset ids does not need to cover the complete range [0, 1, …, (num_subsets - 1)]

Returns

desired subsets according to split_func

Return type

(list of Subsets)

Example

>>> import torch
>>> from torch.utils.data import TensorDataset
>>> from audtorch.samplers import buckets_of_even_size
>>> data = TensorDataset(torch.randn(100))
>>> lengths = np.random.randint(0, 1000, (100,))
>>> split_func = buckets_of_even_size(lengths, 5)
>>> subsets = defined_split(data, split_func)
>>> [len(subset) for subset in subsets]
[20, 20, 20, 20, 20]

audtorch.metrics

EnergyConservingLoss

class audtorch.metrics.EnergyConservingLoss(*args: Any, **kwargs: Any)

Energy conserving loss.

A two term loss that enforces energy conservation after [RPS18].

The loss can be described as:

\[\ell(x, y, m) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = |x_n - y_n| + |b_n - \hat{b_n}|, \]

where \(N\) is the batch size. If reduction is not 'none', then:

\[\ell(x, y, m) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases} \]

\(x\) is the input signal (estimated target), \(y\) the target signal, \(m\) the mixture signal, \(b\) the background signal given by \(b = m - y\), and \(\hat{b}\) the estimated background signal given by \(\hat{b} = m - x\).

Parameters

reduction (string, optional) – specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the sum of the output will be divided by the number of elements in the output, ‘sum’: the output will be summed.

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions

  • Target: \((N, *)\), same shape as the input

  • Mixture: \((N, *)\), same shape as the input

  • Output: scalar. If reduction is 'none', then \((N, *)\), same shape as the input

Examples

>>> import torch
>>> _ = torch.manual_seed(0)
>>> loss = EnergyConservingLoss()
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.randn(3, 5)
>>> mixture = torch.randn(3, 5)
>>> loss(input, target, mixture)
tensor(2.1352, grad_fn=<AddBackward0>)

PearsonR

class audtorch.metrics.PearsonR(*, reduction='mean', batch_first=True)

Computes Pearson Correlation Coefficient.

Pearson Correlation Coefficient (also known as Linear Correlation Coefficient or Pearson’s \(\rho\)) is computed as:

\[\rho = \frac {E[(X-\mu_X)(Y-\mu_Y)]} {\sigma_X\sigma_Y}\]

If inputs are vectors, computes Pearson’s \(\rho\) between the two of them. If inputs are multi-dimensional arrays, computes Pearson’s \(\rho\) along the first or last input dimension according to the batch_first argument, returns a torch.Tensor as output, and optionally reduces it according to the reduction argument.

Parameters
  • reduction (string, optional) – specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the sum of the output will be divided by the number of elements in the output, ‘sum’: the output will be summed. Default: ‘mean’

  • batch_first (bool, optional) – controls if batch dimension is first. Default: True

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions

  • Target: \((N, *)\), same shape as the input

  • Output: scalar. If reduction is 'none', then \((N, 1)\)

Example

>>> import torch
>>> _ = torch.manual_seed(0)
>>> metric = PearsonR()
>>> input = torch.rand(3, 5)
>>> target = torch.rand(3, 5)
>>> metric(input, target)
tensor(0.1220)

ConcordanceCC

class audtorch.metrics.ConcordanceCC(*, reduction='mean', batch_first=True)

Computes Concordance Correlation Coefficient (CCC).

CCC is computed as:

\[\rho_c = \frac {2\rho\sigma_X\sigma_Y} {\sigma_X\sigma_X + \sigma_Y\sigma_Y + (\mu_X - \mu_Y)^2}\]

where \(\rho\) is Pearson Correlation Coefficient, \(\sigma_X\), \(\sigma_Y\) are the standard deviation, and \(\mu_X\), \(\mu_Y\) the mean values of \(X\) and \(Y\) accordingly.

If inputs are vectors, computes CCC between the two of them. If inputs are multi-dimensional arrays, computes CCC along the first or last input dimension according to the batch_first argument, returns a torch.Tensor as output, and optionally reduces it according to the reduction argument.

Parameters
  • reduction (string, optional) – specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied, ‘mean’: the sum of the output will be divided by the number of elements in the output, ‘sum’: the output will be summed. Default: ‘mean’

  • batch_first (bool, optional) – controls if batch dimension is first. Default: True

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions

  • Target: \((N, *)\), same shape as the input

  • Output: scalar. If reduction is 'none', then \((N, 1)\)

Example

>>> import torch
>>> _ = torch.manual_seed(0)
>>> metric = ConcordanceCC()
>>> input = torch.rand(3, 5)
>>> target = torch.rand(3, 5)
>>> metric(input, target)
tensor(0.0014)

audtorch.metrics.functional

The goal of the metrics functionals is to provide functions that work independent on the dimensions of the input signal and can be used easily to create additional metrics and losses.

pearsonr

audtorch.metrics.functional.pearsonr(x, y, batch_first=True)

Computes Pearson Correlation Coefficient across rows.

Pearson Correlation Coefficient (also known as Linear Correlation Coefficient or Pearson’s \(\rho\)) is computed as:

\[\rho = \frac {E[(X-\mu_X)(Y-\mu_Y)]} {\sigma_X\sigma_Y}\]

If inputs are matrices, then then we assume that we are given a mini-batch of sequences, and the correlation coefficient is computed for each sequence independently and returned as a vector. If batch_fist is True, then we assume that every row represents a sequence in the mini-batch, otherwise we assume that batch information is in the columns.

Warning

We do not account for the multi-dimensional case. This function has been tested only for the 2D case, either in batch_first==True or in batch_first==False mode. In the multi-dimensional case, it is possible that the values returned will be meaningless.

Parameters
  • x (torch.Tensor) – input tensor

  • y (torch.Tensor) – target tensor

  • batch_first (bool, optional) – controls if batch dimension is first. Default: True

Returns

correlation coefficient between x and y

Return type

torch.Tensor

Note

\(\sigma_X\) is computed using PyTorch builtin Tensor.std(), which by default uses Bessel correction:

\[\sigma_X=\displaystyle\frac{1}{N-1}\sum_{i=1}^N({x_i}-\bar{x})^2\]

We therefore account for this correction in the computation of the covariance by multiplying it with \(\frac{1}{N-1}\).

Shape:
  • Input: \((N, M)\) for correlation between matrices, or \((M)\) for correlation between vectors

  • Target: \((N, M)\) or \((M)\). Must be identical to input

  • Output: \((N, 1)\) for correlation between matrices, or \((1)\) for correlation between vectors

Examples

>>> import torch
>>> _ = torch.manual_seed(0)
>>> input = torch.rand(3, 5)
>>> target = torch.rand(3, 5)
>>> output = pearsonr(input, target)
>>> print('Pearson Correlation between input and target is {0}'.format(output[:, 0]))
Pearson Correlation between input and target is tensor([ 0.2991, -0.8471,  0.9138])

concordance_cc

audtorch.metrics.functional.concordance_cc(x, y, batch_first=True)

Computes Concordance Correlation Coefficient across rows.

Concordance Correlation Coefficient is computed as:

\[\rho_c = \frac {2\rho\sigma_X\sigma_Y} {\sigma_X\sigma_X + \sigma_Y\sigma_Y + (\mu_X - \mu_Y)^2}\]

where \(\rho\) is Pearson Correlation Coefficient, \(\sigma_X\), \(\sigma_Y\) are the standard deviation, and \(\mu_X\), \(\mu_Y\) the mean values of \(X\) and \(Y\) accordingly.

If inputs are matrices, then then we assume that we are given a mini-batch of sequences, and the concordance correlation coefficient is computed for each sequence independently and returned as a vector. If batch_fist is True, then we assume that every row represents a sequence in the mini-batch, otherwise we assume that batch information is in the columns.

Warning

We do not account for the multi-dimensional case. This function has been tested only for the 2D case, either in batch_first==True or in batch_first==False mode. In the multi-dimensional case, it is possible that the values returned will be meaningless.

Note

\(\sigma_X\) is computed using PyTorch builtin Tensor.std(), which by default uses Bessel correction:

\[\sigma_X=\displaystyle\frac{1}{N-1}\sum_{i=1}^N({x_i}-\bar{x})^2\]

We therefore account for this correction in the computation of the concordance correlation coefficient by multiplying all standard deviations with \(\frac{N-1}{N}\). This is equivalent to multiplying only \((\mu_X - \mu_Y)^2\) with \(\frac{N}{ N-1}\). We choose that option for numerical stability.

Parameters
  • x (torch.Tensor) – input tensor

  • y (torch.Tensor) – target tensor

  • batch_first (bool, optional) – controls if batch dimension is first. Default: True

Returns

concordance correlation coefficient between x and y

Return type

torch.Tensor

Shape:
  • Input: \((N, M)\) for correlation between matrices, or \((M)\) for correlation between vectors

  • Target: \((N, M)\) or \((M)\). Must be identical to input

  • Output: \((N, 1)\) for correlation between matrices, or \((1)\) for correlation between vectors

Examples

>>> import torch
>>> _ = torch.manual_seed(0)
>>> input = torch.rand(3, 5)
>>> target = torch.rand(3, 5)
>>> output = concordance_cc(input, target)
>>> print('Concordance Correlation between input and target is {0}'.format(output[:, 0]))
Concordance Correlation between input and target is tensor([ 0.2605, -0.7862,  0.5298])

audtorch.samplers

BucketSampler

class audtorch.samplers.BucketSampler(*args: Any, **kwargs: Any)

Creates batches from ordered data sets.

This sampler iterates over the data sets of concat_dataset and samples sequentially from them. Samples of each batch deliberately originate solely from the same data set. Only when the current data set is exhausted, the next data set is sampled from. In other words, samples from different buckets are never mixed.

In each epoch num_batches batches of size batch_sizes are extracted from each data set. If the requested number of batches cannot be extracted from a data set, only its available batches are queued. By default, the data sets (and thus their batches) are iterated over in increasing order of their data set id.

Note

The information in batch_sizes and num_batches refer to datasets at the same index independently of permuted_order.

Simple Use Case: “Train on data with increasing sequence length”

bucket_id:

[0, 1, 2, … end ]

batch_sizes:

[32, 16, 8, … 2 ]

num_batches:

[None, None, None, … None ]

Result: “Extract all batches (None) from all data sets, all of different batch size, and queue them in increasing order of their data set id”

  • batch_sizes controls batch size for each data set

  • num_batches controls number of batches to extract from each data set

  • permuted_order controls if order in which data sets are iterated over is permuted or in which specific order iteration is permuted

  • shuffle_each_bucket controls if each data set is shuffled

  • drop_last controls whether to drop last samples of a bucket which cannot form a mini-batch

Parameters
  • concat_dataset (torch.utils.data.ConcatDataset) – ordered concatenated data set

  • batch_sizes (list) – batch sizes per data set. Permissible values are unsigned integers

  • num_batches (list or None, optional) – number of batches per data set. Permissible values are non-negative integers and None. If None, then as many batches are extracted as data set provides. Default: None

  • permuted_order (bool or list, optional) – option whether to permute the order of data set ids in which the respective data set’s batches are queued. If True (False), data set ids are (not) shuffled. Besides, a customized list of permuted data set ids can be specified. Default: False

  • shuffle_each_bucket (bool, optional) – option whether to shuffle samples in each data set. Recommended to set to True. Default: True

  • drop_last (bool, optional) – controls whether the last samples of a bucket which cannot form a mini-batch should be dropped. Default: False

Example

>>> import torch
>>> from torch.utils.data import (TensorDataset, ConcatDataset)
>>> from audtorch.datasets.utils import defined_split
>>> data = TensorDataset(torch.randn(100))
>>> lengths = np.random.randint(0, 890, (100,))
>>> split_func = buckets_of_even_size(lengths, num_buckets=3)
>>> subsets = defined_split(data, split_func)
>>> concat_dataset = ConcatDataset(subsets)
>>> batch_sampler = BucketSampler(concat_dataset, 3 * [16])

buckets_by_boundaries

audtorch.samplers.buckets_by_boundaries(key_values, bucket_boundaries)

Split samples into buckets based on key values using bucket boundaries.

Note

A sample is sorted into bucket \(i\) if for their key value holds:

\(b_{i-1} <= \text{key value} < b_i\), where \(b_i\) is bucket boundary at index \(i\)

Parameters
  • key_values (list) – contains key values, e.g. sequence lengths

  • bucket_boundaries (list) – contains boundaries of buckets in ascending order. The list should neither contain a lower or upper boundary, e.g. not numpy.iinfo.min or numpy.iinfo.max.

Returns

Key function to use for splitting: \(f(\text{item}) = \text{bucket\_id}\)

Return type

func

Example

>>> lengths = [288, 258, 156, 99, 47, 13]
>>> boundaries = [80, 150]
>>> split_func = buckets_by_boundaries(lengths, boundaries)
>>> [split_func(i) for i in range(len(lengths))]
[2, 2, 2, 1, 0, 0]

buckets_of_even_size

audtorch.samplers.buckets_of_even_size(key_values, num_buckets, *, reverse=False)

Split samples into buckets of even size based on key values.

The samples are sorted with either increasing (or decreasing) key value. If number of samples cannot be distributed evenly to buckets, the first buckets are filled up with one remainder each.

Parameters
  • key_values (list) – contains key values, e.g. sequence lengths

  • num_buckets (int) – number of buckets to form. Permitted are positive integers

  • reverse (bool, optional) – if True, then sort in descending order. Default: False

Returns

Key function to use for splitting: \(f(\text{item}) = \text{bucket\_id}\)

Return type

func

Example

>>> lengths = [288, 258, 156, 47, 112, 99, 13]
>>> num_buckets = 4
>>> split_func = buckets_of_even_size(lengths, num_buckets)
>>> [split_func(i) for i in range(len(lengths))]
[3, 2, 2, 0, 1, 1, 0]

audtorch.transforms

The transforms can be provided to audtorch.datasets as an argument and work on the data before it will be returned.

Note

All of the transforms work currently only with numpy.array as inputs, not torch.Tensor.

Compose

class audtorch.transforms.Compose(transforms, *, fix_randomization=False)

Compose several transforms together.

Parameters
  • transforms (list of object) – list of transforms to compose

  • fix_randomization (bool, optional) – controls randomization of underlying transforms. Default: False

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Compose([Crop(-1), Pad(1)])
>>> print(t)
Compose(
    Crop(idx=-1, axis=-1)
    Pad(padding=1, value=0, axis=-1)
)
>>> t(a)
array([[0, 2, 0],
       [0, 4, 0]])

Crop

class audtorch.transforms.Crop(idx, *, axis=- 1)

Crop along an axis.

  • idx controls the index for cropping

  • axis controls axis of cropping

Parameters
  • idx (int or tuple) – first (and last) index to return

  • axis (int, optional) – axis along to crop. Default: -1

Note

Indexing from the end with -1, -2, … is allowed. But you cannot use -1 in the second part of the tuple to specify the last entry. Instead you have to write (-2, signal.shape[axis]) to get the last two entries of axis, or simply -1 if you only want to get the last entry.

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to crop and \(N_\text{out}\) is the output length, which is \(1\) for an integer as idx and \(\text{idx[1]} - \text{idx[0]}\) for a tuple with positive entries as idx. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Crop(1, axis=1)
>>> print(t)
Crop(idx=1, axis=1)
>>> t(a)
array([[2],
       [4]])

RandomCrop

class audtorch.transforms.RandomCrop(size, *, method='pad', axis=- 1, fix_randomization=False)

Random crop of specified width along an axis.

If the signal is too short it is padded by trailing zeros first or replicated to fit specified size.

If the signal is shorter than the desired length, it can be expanded by one of these methods:

  • 'pad' expand the signal by adding trailing zeros

  • 'replicate' first replicate the signal so that it matches or exceeds the specified size

  • size controls the size of output signal

  • method holds expansion method

  • axis controls axis of cropping

  • fix_randomization controls the randomness

Parameters
  • size (int) – desired width of spectrogram in samples

  • method (str, optional) – expansion method. Default: pad

  • axis (int, optional) – axis along to crop. Default: -1

  • fix_randomization (bool, optional) – fix random selection between different calls of transform. Default: False

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to crop and \(N_\text{out}\) is the output length as given by size. \(*\) can be any additional number of dimensions.

Example

>>> random.seed(0)
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> t = RandomCrop(2)
>>> print(t)
RandomCrop(size=2, method=pad, axis=-1)
>>> t(a)
array([[2, 3],
       [6, 7]])
static random_index(input_size, output_size)

Random index for crop.

Parameters
  • input_size (int) – input signal size

  • output_size (int) – expected output size

Returns

random index for cropping

Return type

tuple

Pad

class audtorch.transforms.Pad(padding, *, value=0, axis=- 1)

Pad along an axis.

If padding is an integer it pads equally on the left and right of the signal. If padding is a tuple with two entries it uses the first for the left side and the second for the right side.

  • padding controls the padding to be applied

  • value controls the value used for padding

  • axis controls the axis of padding

Parameters
  • padding (int or tuple) – padding to apply on the left and right

  • value (float, optional) – value to pad with. Default: 0

  • axis (int, optional) – axis along to pad. Default: -1

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to pad and \(N_\text{out} = N_\text{in} + \sum \text{padding}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Pad((0, 1))
>>> print(t)
Pad(padding=(0, 1), value=0, axis=-1)
>>> t(a)
array([[1, 2, 0],
       [3, 4, 0]])

RandomPad

class audtorch.transforms.RandomPad(padding, *, value=0, axis=- 1, fix_randomization=False)

Random pad along an axis.

It splits the padding value randomly between the left and right of the signal along the specified axis.

  • padding controls the size of padding to be applied

  • value controls the value used for padding

  • axis controls the axis of padding

  • fix_randomization controls the randomness

Parameters
  • padding (int) – padding to apply randomly split on the left and right

  • value (float, optional) – value to pad with. Default: 0

  • axis (int, optional) – axis along to pad. Default: -1

  • fix_randomization (bool, optional) – fix random selection between different calls of transform. Default: False

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to pad and \(N_\text{out} = N_\text{in} + \sum \text{padding}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> random.seed(0)
>>> a = np.array([[1, 2], [3, 4]])
>>> t = RandomPad(1)
>>> print(t)
RandomPad(padding=1, value=0, axis=-1)
>>> t(a)
array([[0, 1, 2],
       [0, 3, 4]])
static random_split(number)

Split number randomly into two which sum up to number.

Parameters

number (int) – input number to be split

Returns

randomly splitted number

Return type

tuple

Replicate

class audtorch.transforms.Replicate(repetitions, *, axis=- 1)

Replicate along an axis.

  • repetitions controls number of signal replications

  • axis controls the axis of replication

Parameters
  • repetitions (int or tuple) – number of times to replicate signal

  • axis (int, optional) – axis along which to replicate. Default: -1

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to replicate and \(N_\text{out} = N_\text{in} \cdot \text{repetitions}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2, 3]])
>>> t = Replicate(3)
>>> print(t)
Replicate(repetitions=3, axis=-1)
>>> t(a)
array([[1, 2, 3, 1, 2, 3, 1, 2, 3]])

RandomReplicate

class audtorch.transforms.RandomReplicate(*, max_repetitions=100, axis=- 1, fix_randomization=False)

Replicate by a random number of times along an axis.

  • repetitions holds number of times to replicate signal

  • axis controls the axis of replication

  • fix_randomization controls the randomness

Parameters
  • max_repetitions (int, optional) – controls the maximum number of times a signal is allowed to be replicated. Default: 100

  • axis (int, optional) – axis along which to pad. Default: -1

  • fix_randomization (bool, optional) – fix random selection between different calls of transform. Default: False

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to pad and \(N_\text{out} = N_\text{in} \cdot \text{repetitions}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> random.seed(0)
>>> a = np.array([1, 2, 3])
>>> t = RandomReplicate(max_repetitions=3)
>>> print(t)
RandomReplicate(max_repetitions=3, repetitions=None, axis=-1)
>>> t(a)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])

Expand

class audtorch.transforms.Expand(size, *, method='pad', axis=- 1)

Expand signal.

Ensures that the signal matches the desired output size by padding or replicating it.

  • size controls the size of output signal

  • method controls whether to replicate signal or pad it

  • axis controls axis of expansion

The expansion is done by one of these methods:

  • 'pad' expand the signal by adding trailing zeros

  • 'replicate' replicate the signal to match the specified size. If result exceeds specified size after replication, the signal will then be cropped

Parameters
  • size (int) – desired length of output signal in samples

  • method (str, optional) – expansion method. Default: pad

  • axis (int, optional) – axis along to crop. Default: -1

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to expand and \(N_\text{out}\) is the output length as given by size. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2, 3]])
>>> t = Expand(6)
>>> print(t)
Expand(size=6, method=pad, axis=-1)
>>> t(a)
array([[1, 2, 3, 0, 0, 0]])

RandomMask

class audtorch.transforms.RandomMask(coverage, max_width, value, axis)

Randomly masks signal along axis.

The signal is masked by multiple blocks (i.e. consecutive units) size of which is uniformly sampled given an upper limit on the block size. The algorithm for a single block is as follows:

  1. \(\text{width} ~ U[0, {\text{maximum\_width}}]\)

  2. \(\text{start} ~ U[0, {\text{signal\_size}} - \text{width})\)

The number of blocks is approximated by the specified coverage of the masking and the average size of a block.

  • coverage controls how large the proportion of masking is relative to the signal size

  • max_width controls the maximum size of a masked block

  • value controls the value to mask the signal with

  • axis controls the axis to mask the signal along

Parameters
  • coverage (float) – proportion of signal to mask

  • max_width (int) – maximum block size. The unit depends on the signal and axis. See MaskSpectrogramTime and MaskSpectrogramFrequency

  • value (float) – mask value

  • axis (int) – axis to mask signal along

Example

>>> a = torch.empty((1, 4, 10)).uniform_(1, 2)
>>> t = RandomMask(0.1, max_width=1, value=0, axis=2)
>>> print(t)
RandomMask(coverage=0.1, max_width=1, value=0, axis=2)
>>> len((t(a) == 0).nonzero())  # number of 0 elements
4

MaskSpectrogramTime

class audtorch.transforms.MaskSpectrogramTime(coverage, *, max_width=11, value=0)

Randomly masks spectrogram along time axis.

See RandomMask for more details.

Note

The time axis is derived from Spectrogram’s output shape.

Parameters
  • coverage (float) – proportion of signal to mask

  • max_width (int) – maximum block size in number of samples. The default value corresponds to a time span of 0.1 seconds of a signal with sr=16000 and stft-specifications of window_size=320 and hop_size=160. Default: 11

  • value (float) – mask value

Example

>>> from librosa.display import specshow  
>>> import matplotlib.pyplot as plt  
>>> a = torch.empty(65000).uniform_(-1, 1)
>>> t = Compose([Spectrogram(320, 160), MaskSpectrogramTime(0.1)])
>>> magnitude = t(a).squeeze().numpy()
>>> specshow(np.log10(np.abs(magnitude) + 1e-4)) 
>>> plt.show()  

MaskSpectrogramFrequency

class audtorch.transforms.MaskSpectrogramFrequency(coverage, *, max_width=8, value=0)

Randomly masks spectrogram along frequency axis.

See RandomMask for more details.

Note

The frequency axis is derived from Spectrogram’s output shape.

Parameters
  • coverage (float) – proportion of signal to mask

  • max_width (int, optional) – maximum block size in number of frequency bins. The default value corresponds to approximately 5% of all frequency bins with stft-specifications of window_size=320 and hop_size=160. Default: 8

  • value (float) – mask value

Example

>>> from librosa.display import specshow  
>>> import matplotlib.pyplot as plt  
>>> a = torch.empty(65000).uniform_(-1, 1)
>>> t = Compose([Spectrogram(320, 160), MaskSpectrogramFrequency(0.1)])
>>> magnitude = t(a).squeeze().numpy()
>>> specshow(np.log10(np.abs(magnitude) + 1e-4)) 
>>> plt.show()  

Downmix

class audtorch.transforms.Downmix(channels, *, method='mean', axis=- 2)

Downmix to the provided number of channels.

The downmix is done by one of these methods:

  • 'mean' replace last desired channel by mean across itself and all remaining channels

  • 'crop' drop all remaining channels

  • channels controls the number of desired channels

  • method controls downmixing method

  • axis controls axis of downmix

Parameters
  • channels (int) – number of desired channels

  • method (str, optional) – downmix method. Default: ‘mean’

  • axis (int, optional) – axis to downmix. Default: -2

Shape:
  • Input: \((*, C_\text{in}, *)\)

  • Output: \((*, C_\text{out}, *)\), where \(C_\text{in}\) is the number of input channels and \(C_\text{out}\) is the number of output channels as given by channels. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Downmix(1, axis=0)
>>> print(t)
Downmix(channels=1, method=mean, axis=0)
>>> t(a)
array([[2, 3]])

Upmix

class audtorch.transforms.Upmix(channels, *, method='mean', axis=- 2)

Upmix to the provided number of channels.

The upmix is achieved by adding the same signal in the additional channels. This signal is calculated by one of the following methods:

  • 'mean' mean across all input channels

  • 'zero' zeros

  • 'repeat' last input channel

  • channels controls the number of desired channels

  • method controls downmixing method

  • axis controls axis of upmix

Parameters
  • channels (int) – number of desired channels

  • method (str, optional) – upmix method. Default: ‘mean’

  • axis (int, optional) – axis to upmix. Default: -2

Shape:
  • Input: \((*, C_\text{in}, *)\)

  • Output: \((*, C_\text{out}, *)\), where \(C_\text{in}\) is the number of input channels and \(C_\text{out}\) is the number of output channels as given by channels. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Upmix(3, axis=0)
>>> print(t)
Upmix(channels=3, method=mean, axis=0)
>>> t(a)
array([[1., 2.],
       [3., 4.],
       [2., 3.]])

Remix

class audtorch.transforms.Remix(channels, *, method='mean', axis=- 2)

Remix to the provided number of channels.

The remix is achieved by repeating the mean of all other channels or by replacing the last desired channel by the mean across all channels.

It is internally achieved by running Upmix or Downmix with method mean.

  • channels controls the number of desired channels

  • axis controls axis of upmix

Parameters
  • channels (int) – number of desired channels

  • axis (int, optional) – axis to upmix. Default: -2

Shape:
  • Input: \((*, C_\text{in}, *)\)

  • Output: \((*, C_\text{out}, *)\), where \(C_\text{in}\) is the number of input channels and \(C_\text{out}\) is the number of output channels as given by channels. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Remix(3, axis=0)
>>> print(t)
Remix(channels=3, axis=0)
>>> t(a)
array([[1., 2.],
       [3., 4.],
       [2., 3.]])

Normalize

class audtorch.transforms.Normalize(*, axis=- 1)

Normalize signal.

Ensure the maximum of the absolute value of the signal is 1.

  • axis controls axis for normalization

Parameters

axis (int, optional) – axis for normalization. Default: -1

Shape:
  • Input: \((*)\)

  • Output: \((*)\), where \(*\) can be any number of dimensions.

Example

>>> a = np.array([1, 2, 3, 4])
>>> t = Normalize()
>>> print(t)
Normalize(axis=-1)
>>> t(a)
array([0.25, 0.5 , 0.75, 1.  ])

Standardize

class audtorch.transforms.Standardize(*, mean=True, std=True, axis=- 1)

Standardize signal.

Ensure the signal has a mean value of 0 and a variance of 1.

  • mean controls whether mean centering will be applied

  • std controls whether standard deviation normalization will be applied

  • axis controls axis for standardization

Parameters
  • mean (bool, optional) – apply mean centering. Default: True

  • std (bool, optional) – normalize by standard deviation. Default: True

  • axis (int, optional) – standardize only along the given axis. Default: -1

Shape:
  • Input: \((*)\)

  • Output: \((*)\), where \(*\) can be any number of dimensions.

Example

>>> a = np.array([1, 2, 3, 4])
>>> t = Standardize()
>>> print(t)
Standardize(axis=-1, mean=True, std=True)
>>> t(a)
array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])

Resample

class audtorch.transforms.Resample(input_sampling_rate, output_sampling_rate, *, method='kaiser_best', axis=- 1)

Resample to new sampling rate.

The signal is resampled by one of the following methods.

  • 'kaiser_best' as implemented by resampy

  • 'kaiser_fast' as implemented by resampy

  • 'scipy' uses scipy for resampling

  • input_sampling_rate controls input sample rate in Hz

  • output_sampling_rate controls output sample rate in Hz

  • method controls the resample method

  • axis controls axis for resampling

Parameters
  • input_sampling_rate (int) – input sample rate in Hz

  • output_sampling_rate (int) – output sample rate in Hz

  • method (str, optional) – resample method. Default: kaiser_best

  • axis (int, optional) – axis for resampling. Default: -1

Note

If the default method kaiser_best is too slow for your purposes, you should try scipy instead. scipy is the fastest method, but might crash for very long signals.

Shape:
  • Input: \((*)\)

  • Output: \((*)\), where \(*\) can be any number of dimensions.

Example

>>> a = np.array([1, 2, 3, 4])
>>> t = Resample(4, 2)
>>> print(t)
Resample(input_sampling_rate=4, output_sampling_rate=2, method=kaiser_best, axis=-1)
>>> t(a)
array([0, 2])

Spectrogram

class audtorch.transforms.Spectrogram(window_size, hop_size, *, fft_size=None, window='hann', axis=- 1)

Spectrogram of an audio signal.

The spectrogram is calculated by librosa and its magnitude is returned as real valued matrix.

  • window_size controls FFT window size in samples

  • hop_size controls STFT window hop size in samples

  • fft_size controls number of frequency bins in STFT

  • window controls window function of spectrogram computation

  • axis controls axis of spectrogram computation

  • phase holds the phase of the spectrogram

Parameters
  • window_size (int) – size of STFT window in samples

  • hop_size (int) – size of STFT window hop in samples

  • fft_size (int, optional) – number of frequency bins in STFT. If None, then it defaults to window_size. Default: None

  • window (str, tuple, number, function, or numpy.ndarray, optional) – type of STFT window. Default: hann

  • axis (int, optional) – axis of STFT calculation. Default: -1

Shape:
  • Input: \((*, N_\text{in}, *)\)

  • Output: \((*, N_f, N_t, *)\), where \(N_\text{in}\) is the number of input samples and \(N_f = {\text{window\_size} \over 2} + 1\) is the number of output samples along the frequency axis of the spectrogram, and \(N_t = \lceil {1 \over \text{hop\_size}} (N_\text{in} + {\text{window\_size} \over 2}) \rceil\) is the number of output samples along the time axis of the spectrogram. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([1., 2., 3., 4.])
>>> t = Spectrogram(2, 2)
>>> print(t)
Spectrogram(window_size=2, hop_size=2, axis=-1)
>>> t(a)
array([[1., 3., 3.],
       [1., 3., 3.]])

Log

class audtorch.transforms.Log(*, magnitude_boost=1e-07)

Logarithmic transform of an input signal.

  • magnitude_boost controls the non-negative value added to the magnitude of the signal before applying the logarithmus

Parameters

magnitude_boost (float, optional) – positive value added to the magnitude of the signal before applying the logarithmus. Default: 1e-7

Shape:
  • Input: \((*)\)

  • Output: \((*)\), where \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([1., 2., 3., 4.])
>>> spect = Spectrogram(window_size=2, hop_size=2)
>>> t = Log()
>>> print(t)
Log(magnitude_boost=1e-07)
>>> np.set_printoptions(precision=5)
>>> t(spect(a))
array([[1.00000e-07, 1.09861e+00, 1.09861e+00],
       [1.00000e-07, 1.09861e+00, 1.09861e+00]])

RandomAdditiveMix

class audtorch.transforms.RandomAdditiveMix(dataset, *, ratios=[0, 15, 30], normalize=False, expand_method='pad', crop_method='random', percentage_silence=0, time_axis=- 1, channel_axis=- 2, fix_randomization=False)

Mix two signals additively by a randomly picked ratio.

Randomly pick a signal from an augmentation data set and mix it with the actual signal by a signal-to-noise ratio in dB randomly selected from a list of possible ratios.

The signal from the augmentation data set is expanded, cropped, or has its number of channels adjusted by a downmix or upmix using Remix if necessary.

The signal can be expanded by:

  • 'multiple' loading multiple files from the augmentation data set and concatenating them along the time axis

  • 'pad' expand the signal by adding trailing zeros

  • 'replicate' replicate the signal to match the specified size. If result exceeds specified size after replication, the signal will then be cropped

The signal can be cropped by:

  • 'start' crop signal from the beginning of the file all the way to the necessary length

  • 'random' starts at a random offset from the beginning of the file

  • dataset controls the data set used for augmentation

  • ratio controls the ratio in dB between mixed signals

  • ratios controls the ratios to be randomly picked from

  • normalize controls if the mixed signal is normalized

  • expand_method controls if the signal from the augmented data set is automatically expanded according to an expansion rule. Default: pad

  • crop_method controls how the signal is cropped. Is only relevant if the augmentation signal is longer than the input one, or if expand_method is set to multiple. Default: random

  • percentage_silence controls the percentage of the input data that will be mixed with silence. Should be between 0 and 1. Default: 1

  • time_axis controls time axis for automatic signal adjustment

  • channel_axis controls channel axis for automatic signal adjustment

  • fix_randomization controls the randomness of the ratio selection

Note

fix_randomization covers only the selection of the ratio. The selection of a signal from the augmentation data set and its signal length adjustment will always be random.

Parameters
  • dataset (torch.utils.data.Dataset) – data set for augmentation

  • ratios (list of int, optional) – mix ratios in dB to randomly pick from (e.g. SNRs). Default: [0, 15, 30]

  • normalize (bool, optional) – normalize mixture. Default: False

  • expand_method (str, optional) – controls the adjustment of the length data set that is added to the original data set. Default: pad

  • crop_method (str, optional) – controls the crop transform that will be called on the mix signal if it is longer than the input signal. Default: random

  • percentage_silence (float, optional) – controls the percentage of input data that should be augmented with silence. Default: 0

  • time_axis (int, optional) – length axis of both data sets. Default: -1

  • channel_axis (int, optional) – channels axis of both data sets. Default: -2

  • fix_randomization (bool, optional) – freeze random selection between different calls of transform. Default: False

Shape:
  • Input: \((*, C, N, *)\)

  • Output: \((*, C, N, *)\), where \(C\) is the number of channels and \(N\) is the number of samples. They don’t have to be placed in the order shown here, but the order is preserved during transformation. \(*\) can be any additional number of dimensions.

Example

>>> from audtorch import datasets
>>> np.random.seed(0)
>>> a = np.array([[1, 2], [3, 4]])
>>> noise = datasets.WhiteNoise(duration=1, sampling_rate=2)
>>> t = RandomAdditiveMix(noise, ratios=[3], expand_method='pad')
>>> print(t)
RandomAdditiveMix(dataset=WhiteNoise, ratios=[3], ratio=None, percentage_silence=0, expand_method=pad, crop_method=random, time_axis=-1, channel_axis=-2)
>>> np.set_printoptions(precision=8)
>>> t(a)
array([[3.67392992, 2.60655362],
       [5.67392992, 4.60655362]])

RandomConvolutionalMix

class audtorch.transforms.RandomConvolutionalMix(dataset, *, normalize=False, axis=- 1)

Convolve the signal with an augmentation data set.

Randomly pick an impulse response from an augmentation data set and convolve it with the signal. The impulse responses have to be one-dimensional.

  • dataset controls the data set used for augmentation

  • normalize controls normalisation of convolved signal

  • axis controls axis of upmix

Parameters
  • dataset (torch.utils.data.Dataset) – data set for augmentation

  • normalize (bool, optional) – normalize mixture. Default: False

  • axis (int, optional) – axis of convolution. Default: -1

Shape:
  • Input: \((*, N, *)\)

  • Output: \((*, N + M - 1, *)\), where \(N\) is the number of samples of the signal and \(M\) the number of samples of the impulse response. \(*\) can be any additional number of dimensions.

Example

>>> from audtorch import datasets
>>> np.random.seed(0)
>>> a = np.array([[1, 2], [3, 4]])
>>> noise = datasets.WhiteNoise(duration=1, sampling_rate=2, transform=np.squeeze)
>>> t = RandomConvolutionalMix(noise, normalize=True)
>>> print(t)
RandomConvolutionalMix(dataset=WhiteNoise, axis=-1, normalize=True)
>>> np.set_printoptions(precision=8)
>>> t(a)
array([[0.21365151, 0.47576767, 0.09692931],
       [0.64095452, 1.        , 0.19385863]])

audtorch.transforms.functional

The goal of the transform functionals is to provide functions that work independent on the dimensions of the input signal and can be used easily to create the actual transforms.

Note

All of the transforms work currently only with numpy.array as inputs, not torch.Tensor.

crop

audtorch.transforms.functional.crop(signal, idx, *, axis=- 1)

Crop signal along an axis.

Parameters
  • signal (numpy.ndarray) – audio signal

  • idx (int or tuple) – first (and last) index to return

  • axis (int, optional) – axis along to crop. Default: -1

Note

Indexing from the end with -1, -2, … is allowed. But you cannot use -1 in the second part of the tuple to specify the last entry. Instead you have to write (-2, signal.shape[axis]) to get the last two entries of axis, or simply -1 if you only want to get the last entry.

Returns

cropped signal

Return type

numpy.ndarray

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> crop(a, 1)
array([[2],
       [4]])

pad

audtorch.transforms.functional.pad(signal, padding, *, value=0, axis=- 1)

Pad signal along an axis.

If padding is an integer it pads equally on the left and right of the signal. If padding is a tuple with two entries it uses the first for the left side and the second for the right side.

Parameters
  • signal (numpy.ndarray) – audio signal

  • padding (int or tuple) – padding to apply on the left and right

  • value (float, optional) – value to pad with. Default: 0

  • axis (int, optional) – axis along which to pad. Default: -1

Returns

padded signal

Return type

numpy.ndarray

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> pad(a, (0, 1))
array([[1, 2, 0],
       [3, 4, 0]])

replicate

audtorch.transforms.functional.replicate(signal, repetitions, *, axis=- 1)

Replicate signal along an axis.

Parameters
  • signal (numpy.ndarray) – audio signal

  • repetitions (int) – number of times to replicate signal

  • axis (int, optional) – axis along which to replicate. Default: -1

Returns

replicated signal

Return type

numpy.ndarray

Example

>>> a = np.array([1, 2, 3])
>>> replicate(a, 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])

downmix

audtorch.transforms.functional.downmix(signal, channels, *, method='mean', axis=- 2)

Downmix signal to the provided number of channels.

The downmix is done by one of these methods:

  • 'mean' replace last desired channel by mean across itself and all remaining channels

  • 'crop' drop all remaining channels

Parameters
  • signal (numpy.ndarray) – audio signal

  • channels (int) – number of desired channels

  • method (str, optional) – downmix method. Default: ‘mean’

  • axis (int, optional) – axis to downmix. Default: -2

Returns

reshaped signal

Return type

numpy.ndarray

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> downmix(a, 1)
array([[2, 3]])

upmix

audtorch.transforms.functional.upmix(signal, channels, *, method='mean', axis=- 2)

Upmix signal to the provided number of channels.

The upmix is achieved by adding the same signal in the additional channels. The fixed signal is calculated by one of the following methods:

  • 'mean' mean across all input channels

  • 'zero' zeros

  • 'repeat' last input channel

Parameters
  • signal (numpy.ndarray) – audio signal

  • channels (int) – number of desired channels

  • method (str, optional) – upmix method. Default: ‘mean’

  • axis (int, optional) – axis to upmix. Default: -2

Returns

reshaped signal

Return type

numpy.ndarray

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> upmix(a, 3)
array([[1., 2.],
       [3., 4.],
       [2., 3.]])

additive_mix

audtorch.transforms.functional.additive_mix(signal1, signal2, ratio)

Mix two signals additively by given ratio.

If the power of one of the signals is below 1e-7, the signals are added without adjusting the signal-to-noise ratio.

Parameters
  • signal1 (numpy.ndarray) – audio signal

  • signal2 (numpy.ndarray) – audio signal

  • ratio (int) – ratio in dB of the second signal compared to the first one

Returns

mixture

Return type

numpy.ndarray

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> additive_mix(a, a, -10 * np.log10(0.5 ** 2))
array([[1.5, 3. ],
       [4.5, 6. ]])

mask

audtorch.transforms.functional.mask(signal, num_blocks, max_width, *, value=0.0, axis=- 1)

Randomly mask signal along axis.

Parameters
  • signal (torch.Tensor) – audio signal

  • num_blocks (int) – number of mask blocks

  • max_width (int) – maximum size of block

  • value (float, optional) – mask value. Default: 0.

  • axis (int, optional) – axis along which to mask. Default: -1

Returns

masked signal

Return type

torch.Tensor

normalize

audtorch.transforms.functional.normalize(signal, *, axis=None)

Normalize signal.

Ensure the maximum of the absolute value of the signal is 1.

Note

The signal will never be divided by a number smaller than 1e-7. Meaning signals which are nearly silent are only slightly amplified.

Parameters
  • signal (numpy.ndarray) – audio signal

  • axis (int, optional) – normalize only along the given axis. Default: None

Returns

normalized signal

Return type

numpy.ndarray

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> normalize(a)
array([[0.25, 0.5 ],
       [0.75, 1.  ]])

standardize

audtorch.transforms.functional.standardize(signal, *, mean=True, std=True, axis=None)

Standardize signal.

Ensure the signal has a mean value of 0 and a variance of 1.

Note

The signal will never be divided by a variance smaller than 1e-7.

Parameters
  • signal (numpy.ndarray) – audio signal

  • mean (bool, optional) – apply mean centering. Default: True

  • std (bool, optional) – normalize by standard deviation. Default: True

  • axis (int, optional) – standardize only along the given axis. Default: None

Returns

standardized signal

Return type

numpy.ndarray

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> standardize(a)
array([[-1.34164079, -0.4472136 ],
       [ 0.4472136 ,  1.34164079]])

stft

audtorch.transforms.functional.stft(signal, window_size, hop_size, *, fft_size=None, window='hann', axis=- 1)

Short-time Fourier transform.

The Short-time Fourier transform (STFT) is calculated by using librosa. It returns an array with the same shape as the input array, besides the axis chosen for STFT calculation is replaced by the two new ones of the spectrogram.

The chosen FFT size is set identical to window_size.

Parameters
  • signal (numpy.ndarray) – audio signal

  • window_size (int) – size of STFT window in samples

  • hop_size (int) – size of STFT window hop in samples

  • window (str, tuple, number, function, or numpy.ndarray, optional) – type of STFT window. Default: hann

  • axis (int, optional) – axis of STFT calculation. Default: -1

Returns

complex spectrogram with the shape of its last two dimensions as (window_size/2 + 1, np.ceil((len(signal) + window_size/2) / hop_size))

Return type

numpy.ndarray

Example

>>> a = np.array([1., 2., 3., 4.])
>>> stft(a, 2, 1)
array([[ 1.+0.j,  2.+0.j,  3.+0.j,  4.+0.j,  3.+0.j],
       [-1.+0.j, -2.+0.j, -3.+0.j, -4.+0.j, -3.+0.j]])

istft

audtorch.transforms.functional.istft(spectrogram, window_size, hop_size, *, window='hann', axis=- 2)

Inverse Short-time Fourier transform.

The inverse Short-time Fourier transform (iSTFT) is calculated by using librosa. It handles multi-dimensional inputs, but assumes that the two spectrogram axis are beside each other, starting with the axis corresponding to frequency bins. The returned audio signal has one dimension less than the spectrogram.

Parameters
  • spectrogram (numpy.ndarray) – complex spectrogram

  • window_size (int) – size of STFT window in samples

  • hop_size (int) – size of STFT window hop in samples

  • window (str, tuple, number, function, or numpy.ndarray, optional) – type of STFT window. Default: hann

  • axis (int, optional) – axis of frequency bins of the spectrogram. Time bins are expected at axis + 1. Default: -2

Returns

signal with shape (number_of_time_bins * hop_size - window_size/2)

Return type

numpy.ndarray

Example

>>> a = np.array([1., 2., 3., 4.])
>>> D = stft(a, 4, 1)
>>> istft(D, 4, 1)
array([1., 2., 3., 4.])

audtorch.utils

Utility functions.

flatten_list

audtorch.utils.flatten_list(nested_list)

Flatten an arbitrarily nested list.

Implemented without recursion to avoid stack overflows. Returns a new list, the original list is unchanged.

Parameters

nested_list (list) – nested list

Returns

flattened list

Return type

list

Example

>>> flatten_list([1, 2, 3, [4], [], [[[[[[[[[5]]]]]]]]]])
[1, 2, 3, 4, 5]
>>> flatten_list([[1, 2], 3])
[1, 2, 3]

to_tuple

audtorch.utils.to_tuple(input, *, tuple_len=2)

Convert to tuple of given length.

This utility function is used to convert single-value arguments to tuples of appropriate length, e.g. for multi-dimensional inputs where each dimension requires the same value. If the argument is already an iterable it is returned as a tuple if its length matches the desired tuple length. Otherwise a ValueError is raised.

Parameters
  • input (non-iterable or iterable) – argument to be converted to tuple

  • tuple_len (int) – required length of argument tuple. Default: 2

Returns

tuple of desired length

Return type

tuple

Example

>>> to_tuple(2)
(2, 2)

energy

audtorch.utils.energy(signal)

Energy of input signal.

\[E = \sum_n |x_n|^2 \]
Parameters

signal (numpy.ndarray) – signal

Returns

energy of signal

Return type

float

Example

>>> a = np.array([[2, 2]])
>>> energy(a)
8

power

audtorch.utils.power(signal)

Power of input signal.

\[P = {1 \over N} \sum_n |x_n|^2 \]
Parameters

signal (numpy.ndarray) – signal

Returns

power of signal

Return type

float

Example

>>> a = np.array([[2, 2]])
>>> power(a)
4.0

run_worker_threads

audtorch.utils.run_worker_threads(num_workers, task_fun, params, *, progress_bar=False)

Run parallel tasks using worker threads.

Parameters
  • num_workers (int) – number of worker threads

  • task_fun (Callable) – task function with one or more parameters, e.g. x, y, z, and optionally returning a value

  • params (list of tuples) – list of tuples holding parameters for each task, e.g. [(x1, y1, z1), (x2, y2, z2), …]

  • progress_bar (bool) – show a progress bar. Default: False

Returns

result values in order of params

Return type

list

Example

>>> power = lambda x, n: x ** n
>>> params = [(2, n) for n in range(10)]
>>> run_worker_threads(3, power, params)
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512]

References

BPR+05

Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter F Sendlmeier, and Benjamin Weiss. A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology. 2005.

NCZ17

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. In Proc. Interspeech 2017, 2616–2620. 2017. URL: https://arxiv.org/abs/1706.08612, doi:10.21437/Interspeech.2017-950.

RPS18

Dario Rethage, Jordi Pons, and Xavier Serra. A wavenet for speech denoising. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5069–5073. 2018. URL: https://arxiv.org/abs/1706.07162, doi:10.1109/ICASSP.2018.8462417.

Index