audtorch.transforms

The transforms can be provided to audtorch.datasets as an argument and work on the data before it will be returned.

Note

All of the transforms work currently only with numpy.array as inputs, not torch.Tensor.

Compose

class audtorch.transforms.Compose(transforms, *, fix_randomization=False)

Compose several transforms together.

Parameters:
  • transforms (list of object) – list of transforms to compose
  • fix_randomization (bool, optional) – controls randomization of underlying transforms. Default: False

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Compose([Crop(-1), Pad(1)])
>>> print(t)
Compose(
    Crop(idx=-1, axis=-1)
    Pad(padding=1, value=0, axis=-1)
)
>>> t(a)
array([[0, 2, 0],
       [0, 4, 0]])

Crop

class audtorch.transforms.Crop(idx, *, axis=-1)

Crop along an axis.

  • idx controls the index for cropping
  • axis controls axis of cropping
Parameters:
  • idx (int or tuple) – first (and last) index to return
  • axis (int, optional) – axis along to crop. Default: -1

Note

Indexing from the end with -1, -2, … is allowed. But you cannot use -1 in the second part of the tuple to specify the last entry. Instead you have to write (-2, signal.shape[axis]) to get the last two entries of axis, or simply -1 if you only want to get the last entry.

Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to crop and \(N_\text{out}\) is the output length, which is \(1\) for an integer as idx and \(\text{idx[1]} - \text{idx[0]}\) for a tuple with positive entries as idx. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Crop(1, axis=1)
>>> print(t)
Crop(idx=1, axis=1)
>>> t(a)
array([[2],
       [4]])

RandomCrop

class audtorch.transforms.RandomCrop(size, *, method='pad', axis=-1, fix_randomization=False)

Random crop of specified width along an axis.

If the signal is too short it is padded by trailing zeros first or replicated to fit specified size.

If the signal is shorter than the desired length, it can be expanded by one of these methods:

  • 'pad' expand the signal by adding trailing zeros
  • 'replicate' first replicate the signal so that it matches or exceeds the specified size
  • size controls the size of output signal
  • method holds expansion method
  • axis controls axis of cropping
  • fix_randomization controls the randomness
Parameters:
  • size (int) – desired width of spectrogram in samples
  • method (str, optional) – expansion method. Default: pad
  • axis (int, optional) – axis along to crop. Default: -1
  • fix_randomization (bool, optional) – fix random selection between different calls of transform. Default: False
Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to crop and \(N_\text{out}\) is the output length as given by size. \(*\) can be any additional number of dimensions.

Example

>>> random.seed(0)
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> t = RandomCrop(2)
>>> print(t)
RandomCrop(size=2, method=pad, axis=-1)
>>> t(a)
array([[2, 3],
       [6, 7]])
static random_index(input_size, output_size)

Random index for crop.

Parameters:
  • input_size (int) – input signal size
  • output_size (int) – expected output size
Returns:

random index for cropping

Return type:

tuple

Pad

class audtorch.transforms.Pad(padding, *, value=0, axis=-1)

Pad along an axis.

If padding is an integer it pads equally on the left and right of the signal. If padding is a tuple with two entries it uses the first for the left side and the second for the right side.

  • padding controls the padding to be applied
  • value controls the value used for padding
  • axis controls the axis of padding
Parameters:
  • padding (int or tuple) – padding to apply on the left and right
  • value (float, optional) – value to pad with. Default: 0
  • axis (int, optional) – axis along to pad. Default: -1
Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to pad and \(N_\text{out} = N_\text{in} + \sum \text{padding}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Pad((0, 1))
>>> print(t)
Pad(padding=(0, 1), value=0, axis=-1)
>>> t(a)
array([[1, 2, 0],
       [3, 4, 0]])

RandomPad

class audtorch.transforms.RandomPad(padding, *, value=0, axis=-1, fix_randomization=False)

Random pad along an axis.

It splits the padding value randomly between the left and right of the signal along the specified axis.

  • padding controls the size of padding to be applied
  • value controls the value used for padding
  • axis controls the axis of padding
  • fix_randomization controls the randomness
Parameters:
  • padding (int) – padding to apply randomly split on the left and right
  • value (float, optional) – value to pad with. Default: 0
  • axis (int, optional) – axis along to pad. Default: -1
  • fix_randomization (bool, optional) – fix random selection between different calls of transform. Default: False
Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to pad and \(N_\text{out} = N_\text{in} + \sum \text{padding}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> random.seed(0)
>>> a = np.array([[1, 2], [3, 4]])
>>> t = RandomPad(1)
>>> print(t)
RandomPad(padding=1, value=0, axis=-1)
>>> t(a)
array([[0, 1, 2],
       [0, 3, 4]])
static random_split(number)

Split number randomly into two which sum up to number.

Parameters:number (int) – input number to be split
Returns:randomly splitted number
Return type:tuple

Replicate

class audtorch.transforms.Replicate(repetitions, *, axis=-1)

Replicate along an axis.

  • repetitions controls number of signal replications
  • axis controls the axis of replication
Parameters:
  • repetitions (int or tuple) – number of times to replicate signal
  • axis (int, optional) – axis along which to replicate. Default: -1
Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to replicate and \(N_\text{out} = N_\text{in} \cdot \text{repetitions}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2, 3]])
>>> t = Replicate(3)
>>> print(t)
Replicate(repetitions=3, axis=-1)
>>> t(a)
array([[1, 2, 3, 1, 2, 3, 1, 2, 3]])

RandomReplicate

class audtorch.transforms.RandomReplicate(*, max_repetitions=100, axis=-1, fix_randomization=False)

Replicate by a random number of times along an axis.

  • repetitions holds number of times to replicate signal
  • axis controls the axis of replication
  • fix_randomization controls the randomness
Parameters:
  • max_repetitions (int, optional) – controls the maximum number of times a signal is allowed to be replicated. Default: 100
  • axis (int, optional) – axis along which to pad. Default: -1
  • fix_randomization (bool, optional) – fix random selection between different calls of transform. Default: False
Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to pad and \(N_\text{out} = N_\text{in} \cdot \text{repetitions}\) is the output length. \(*\) can be any additional number of dimensions.

Example

>>> random.seed(0)
>>> a = np.array([1, 2, 3])
>>> t = RandomReplicate(max_repetitions=3)
>>> print(t)
RandomReplicate(max_repetitions=3, repetitions=None, axis=-1)
>>> t(a)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])

Expand

class audtorch.transforms.Expand(size, *, method='pad', axis=-1)

Expand signal.

Ensures that the signal matches the desired output size by padding or replicating it.

  • size controls the size of output signal
  • method controls whether to replicate signal or pad it
  • axis controls axis of expansion

The expansion is done by one of these methods:

  • 'pad' expand the signal by adding trailing zeros
  • 'replicate' replicate the signal to match the specified size. If result exceeds specified size after replication, the signal will then be cropped
Parameters:
  • size (int) – desired length of output signal in samples
  • method (str, optional) – expansion method. Default: pad
  • axis (int, optional) – axis along to crop. Default: -1
Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_\text{out}, *)\), where \(N_\text{in}\) is the input length of the axis to expand and \(N_\text{out}\) is the output length as given by size. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2, 3]])
>>> t = Expand(6)
>>> print(t)
Expand(size=6, method=pad, axis=-1)
>>> t(a)
array([[1, 2, 3, 0, 0, 0]])

RandomMask

class audtorch.transforms.RandomMask(coverage, max_width, value, axis)

Randomly masks signal along axis.

The signal is masked by multiple blocks (i.e. consecutive units) size of which is uniformly sampled given an upper limit on the block size. The algorithm for a single block is as follows:

  1. \(\text{width} ~ U[0, {\text{maximum\_width}}]\)
  2. \(\text{start} ~ U[0, {\text{signal\_size}} - \text{width})\)

The number of blocks is approximated by the specified coverage of the masking and the average size of a block.

  • coverage controls how large the proportion of masking is relative to the signal size
  • max_width controls the maximum size of a masked block
  • value controls the value to mask the signal with
  • axis controls the axis to mask the signal along
Parameters:
  • coverage (float) – proportion of signal to mask
  • max_width (int) – maximum block size. The unit depends on the signal and axis. See MaskSpectrogramTime and MaskSpectrogramFrequency
  • value (float) – mask value
  • axis (int) – axis to mask signal along

Example

>>> a = torch.empty((1, 4, 10)).uniform_(1, 2)
>>> t = RandomMask(0.1, max_width=1, value=0, axis=2)
>>> print(t)
RandomMask(coverage=0.1, max_width=1, value=0, axis=2)
>>> len((t(a) == 0).nonzero())  # number of 0 elements
4

MaskSpectrogramTime

class audtorch.transforms.MaskSpectrogramTime(coverage, *, max_width=11, value=0)

Randomly masks spectrogram along time axis.

See RandomMask for more details.

Note

The time axis is derived from Spectrogram’s output shape.

Parameters:
  • coverage (float) – proportion of signal to mask
  • max_width (int) – maximum block size in number of samples. The default value corresponds to a time span of 0.1 seconds of a signal with sr=16000 and stft-specifications of window_size=320 and hop_size=160. Default: 11
  • value (float) – mask value

Example

>>> from librosa.display import specshow  # doctest: +SKIP
>>> import matplotlib.pyplot as plt  # doctest: +SKIP
>>> a = torch.empty(65000).uniform_(-1, 1)
>>> t = Compose([Spectrogram(320, 160), MaskSpectrogramTime(0.1)])
>>> magnitude = t(a).squeeze().numpy()
>>> specshow(np.log10(np.abs(magnitude) + 1e-4)) # doctest: +SKIP
>>> plt.show()  # doctest: +SKIP

MaskSpectrogramFrequency

class audtorch.transforms.MaskSpectrogramFrequency(coverage, *, max_width=8, value=0)

Randomly masks spectrogram along frequency axis.

See RandomMask for more details.

Note

The frequency axis is derived from Spectrogram’s output shape.

Parameters:
  • coverage (float) – proportion of signal to mask
  • max_width (int, optional) – maximum block size in number of frequency bins. The default value corresponds to approximately 5% of all frequency bins with stft-specifications of window_size=320 and hop_size=160. Default: 8
  • value (float) – mask value

Example

>>> from librosa.display import specshow  # doctest: +SKIP
>>> import matplotlib.pyplot as plt  # doctest: +SKIP
>>> a = torch.empty(65000).uniform_(-1, 1)
>>> t = Compose([Spectrogram(320, 160), MaskSpectrogramFrequency(0.1)])
>>> magnitude = t(a).squeeze().numpy()
>>> specshow(np.log10(np.abs(magnitude) + 1e-4)) # doctest: +SKIP
>>> plt.show()  # doctest: +SKIP

Downmix

class audtorch.transforms.Downmix(channels, *, method='mean', axis=-2)

Downmix to the provided number of channels.

The downmix is done by one of these methods:

  • 'mean' replace last desired channel by mean across itself and all remaining channels
  • 'crop' drop all remaining channels
  • channels controls the number of desired channels
  • method controls downmixing method
  • axis controls axis of downmix
Parameters:
  • channels (int) – number of desired channels
  • method (str, optional) – downmix method. Default: ‘mean’
  • axis (int, optional) – axis to downmix. Default: -2
Shape:
  • Input: \((*, C_\text{in}, *)\)
  • Output: \((*, C_\text{out}, *)\), where \(C_\text{in}\) is the number of input channels and \(C_\text{out}\) is the number of output channels as given by channels. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Downmix(1, axis=0)
>>> print(t)
Downmix(channels=1, method=mean, axis=0)
>>> t(a)
array([[2, 3]])

Upmix

class audtorch.transforms.Upmix(channels, *, method='mean', axis=-2)

Upmix to the provided number of channels.

The upmix is achieved by adding the same signal in the additional channels. This signal is calculated by one of the following methods:

  • 'mean' mean across all input channels
  • 'zero' zeros
  • 'repeat' last input channel
  • channels controls the number of desired channels
  • method controls downmixing method
  • axis controls axis of upmix
Parameters:
  • channels (int) – number of desired channels
  • method (str, optional) – upmix method. Default: ‘mean’
  • axis (int, optional) – axis to upmix. Default: -2
Shape:
  • Input: \((*, C_\text{in}, *)\)
  • Output: \((*, C_\text{out}, *)\), where \(C_\text{in}\) is the number of input channels and \(C_\text{out}\) is the number of output channels as given by channels. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Upmix(3, axis=0)
>>> print(t)
Upmix(channels=3, method=mean, axis=0)
>>> t(a)
array([[1., 2.],
       [3., 4.],
       [2., 3.]])

Remix

class audtorch.transforms.Remix(channels, *, method='mean', axis=-2)

Remix to the provided number of channels.

The remix is achieved by repeating the mean of all other channels or by replacing the last desired channel by the mean across all channels.

It is internally achieved by running Upmix or Downmix with method mean.

  • channels controls the number of desired channels
  • axis controls axis of upmix
Parameters:
  • channels (int) – number of desired channels
  • axis (int, optional) – axis to upmix. Default: -2
Shape:
  • Input: \((*, C_\text{in}, *)\)
  • Output: \((*, C_\text{out}, *)\), where \(C_\text{in}\) is the number of input channels and \(C_\text{out}\) is the number of output channels as given by channels. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([[1, 2], [3, 4]])
>>> t = Remix(3, axis=0)
>>> print(t)
Remix(channels=3, axis=0)
>>> t(a)
array([[1., 2.],
       [3., 4.],
       [2., 3.]])

Normalize

class audtorch.transforms.Normalize(*, axis=-1)

Normalize signal.

Ensure the maximum of the absolute value of the signal is 1.

  • axis controls axis for normalization
Parameters:axis (int, optional) – axis for normalization. Default: -1
Shape:
  • Input: \((*)\)
  • Output: \((*)\), where \(*\) can be any number of dimensions.

Example

>>> a = np.array([1, 2, 3, 4])
>>> t = Normalize()
>>> print(t)
Normalize(axis=-1)
>>> t(a)
array([0.25, 0.5 , 0.75, 1.  ])

Standardize

class audtorch.transforms.Standardize(*, mean=True, std=True, axis=-1)

Standardize signal.

Ensure the signal has a mean value of 0 and a variance of 1.

  • mean controls whether mean centering will be applied
  • std controls whether standard deviation normalization will be applied
  • axis controls axis for standardization
Parameters:
  • mean (bool, optional) – apply mean centering. Default: True
  • std (bool, optional) – normalize by standard deviation. Default: True
  • axis (int, optional) – standardize only along the given axis. Default: -1
Shape:
  • Input: \((*)\)
  • Output: \((*)\), where \(*\) can be any number of dimensions.

Example

>>> a = np.array([1, 2, 3, 4])
>>> t = Standardize()
>>> print(t)
Standardize(axis=-1, mean=True, std=True)
>>> t(a)
array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])

Resample

class audtorch.transforms.Resample(input_sampling_rate, output_sampling_rate, *, method='kaiser_best', axis=-1)

Resample to new sampling rate.

The signal is resampled by one of the following methods.

  • 'kaiser_best' as implemented by resampy
  • 'kaiser_fast' as implemented by resampy
  • 'scipy' uses scipy for resampling
  • input_sampling_rate controls input sample rate in Hz
  • output_sampling_rate controls output sample rate in Hz
  • method controls the resample method
  • axis controls axis for resampling
Parameters:
  • input_sampling_rate (int) – input sample rate in Hz
  • output_sampling_rate (int) – output sample rate in Hz
  • method (str, optional) – resample method. Default: kaiser_best
  • axis (int, optional) – axis for resampling. Default: -1

Note

If the default method kaiser_best is too slow for your purposes, you should try scipy instead. scipy is the fastest method, but might crash for very long signals.

Shape:
  • Input: \((*)\)
  • Output: \((*)\), where \(*\) can be any number of dimensions.

Example

>>> a = np.array([1, 2, 3, 4])
>>> t = Resample(4, 2)
>>> print(t)
Resample(input_sampling_rate=4, output_sampling_rate=2, method=kaiser_best, axis=-1)
>>> t(a)
array([0, 2])

Spectrogram

class audtorch.transforms.Spectrogram(window_size, hop_size, *, fft_size=None, window='hann', axis=-1)

Spectrogram of an audio signal.

The spectrogram is calculated by librosa and its magnitude is returned as real valued matrix.

  • window_size controls FFT window size in samples
  • hop_size controls STFT window hop size in samples
  • fft_size controls number of frequency bins in STFT
  • window controls window function of spectrogram computation
  • axis controls axis of spectrogram computation
  • phase holds the phase of the spectrogram
Parameters:
  • window_size (int) – size of STFT window in samples
  • hop_size (int) – size of STFT window hop in samples
  • fft_size (int, optional) – number of frequency bins in STFT. If None, then it defaults to window_size. Default: None
  • window (str, tuple, number, function, or numpy.ndarray, optional) – type of STFT window. Default: hann
  • axis (int, optional) – axis of STFT calculation. Default: -1
Shape:
  • Input: \((*, N_\text{in}, *)\)
  • Output: \((*, N_f, N_t, *)\), where \(N_\text{in}\) is the number of input samples and \(N_f = {\text{window\_size} \over 2} + 1\) is the number of output samples along the frequency axis of the spectrogram, and \(N_t = \lceil {1 \over \text{hop\_size}} (N_\text{in} + {\text{window\_size} \over 2}) \rceil\) is the number of output samples along the time axis of the spectrogram. \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([1., 2., 3., 4.])
>>> t = Spectrogram(2, 2)
>>> print(t)
Spectrogram(window_size=2, hop_size=2, axis=-1)
>>> t(a)
array([[1., 3., 3.],
       [1., 3., 3.]], dtype=float32)

Log

class audtorch.transforms.Log(magnitude_boost=1e-07)

Logarithmic transform of an input signal.

  • magnitude_boost controls the non-negative value added to the magnitude of the signal before applying the logarithmus
Parameters:magnitude_boost (float, optional) – positive value added to the magnitude of the signal before applying the logarithmus. Default: 1e-7
Shape:
  • Input: \((*)\)
  • Output: \((*)\), where \(*\) can be any additional number of dimensions.

Example

>>> a = np.array([1., 2., 3., 4.])
>>> spect = Spectrogram(window_size=2, hop_size=2)
>>> t = Log()
>>> print(t)
Log(magnitude_boost=1e-07)
>>> t(spect(a))
array([[1.1920928e-07, 1.0986123e+00, 1.0986123e+00],
       [1.1920928e-07, 1.0986123e+00, 1.0986123e+00]], dtype=float32)

RandomAdditiveMix

class audtorch.transforms.RandomAdditiveMix(dataset, *, ratios=[0, 15, 30], normalize=False, expand_method='pad', crop_method='random', percentage_silence=0, time_axis=-1, channel_axis=-2, fix_randomization=False)

Mix two signals additively by a randomly picked ratio.

Randomly pick a signal from an augmentation data set and mix it with the actual signal by a signal-to-noise ratio in dB randomly selected from a list of possible ratios.

The signal from the augmentation data set is expanded, cropped, or has its number of channels adjusted by a downmix or upmix using Remix if necessary.

The signal can be expanded by:

  • 'multiple' loading multiple files from the augmentation data set and concatenating them along the time axis
  • 'pad' expand the signal by adding trailing zeros
  • 'replicate' replicate the signal to match the specified size. If result exceeds specified size after replication, the signal will then be cropped

The signal can be cropped by:

  • 'start' crop signal from the beginning of the file all the way to the necessary length
  • 'random' starts at a random offset from the beginning of the file
  • dataset controls the data set used for augmentation
  • ratio controls the ratio in dB between mixed signals
  • ratios controls the ratios to be randomly picked from
  • normalize controls if the mixed signal is normalized
  • expand_method controls if the signal from the augmented data set is automatically expanded according to an expansion rule. Default: pad
  • crop_method controls how the signal is cropped. Is only relevant if the augmentation signal is longer than the input one, or if expand_method is set to multiple. Default: random
  • percentage_silence controls the percentage of the input data that will be mixed with silence. Should be between 0 and 1. Default: 1
  • time_axis controls time axis for automatic signal adjustment
  • channel_axis controls channel axis for automatic signal adjustment
  • fix_randomization controls the randomness of the ratio selection

Note

fix_randomization covers only the selection of the ratio. The selection of a signal from the augmentation data set and its signal length adjustment will always be random.

Parameters:
  • dataset (torch.utils.data.Dataset) – data set for augmentation
  • ratios (list of int, optional) – mix ratios in dB to randomly pick from (e.g. SNRs). Default: [0, 15, 30]
  • normalize (bool, optional) – normalize mixture. Default: False
  • expand_method (str, optional) – controls the adjustment of the length data set that is added to the original data set. Default: pad
  • crop_method (str, optional) – controls the crop transform that will be called on the mix signal if it is longer than the input signal. Default: random
  • percentage_silence (float, optional) – controls the percentage of input data that should be augmented with silence. Default: 0
  • time_axis (int, optional) – length axis of both data sets. Default: -1
  • channel_axis (int, optional) – channels axis of both data sets. Default: -2
  • fix_randomization (bool, optional) – freeze random selection between different calls of transform. Default: False
Shape:
  • Input: \((*, C, N, *)\)
  • Output: \((*, C, N, *)\), where \(C\) is the number of channels and \(N\) is the number of samples. They don’t have to be placed in the order shown here, but the order is preserved during transformation. \(*\) can be any additional number of dimensions.

Example

>>> from audtorch import datasets
>>> np.random.seed(0)
>>> a = np.array([[1, 2], [3, 4]])
>>> noise = datasets.WhiteNoise(duration=1, sampling_rate=2)
>>> t = RandomAdditiveMix(noise, ratios=[3], expand_method='pad')
>>> print(t)
RandomAdditiveMix(dataset=WhiteNoise, ratios=[3], ratio=None, percentage_silence=0, expand_method=pad, crop_method=random, time_axis=-1, channel_axis=-2)
>>> t(a)
array([[3.67392992, 2.60655362],
       [5.67392992, 4.60655362]])