Audio Processing

Audio Processing#

Objective#

The objective of this project is to implement an automatic system for the determination of the severity level of Parkinson’s Disease (PD) of a patient by using speech features. This system takes a speech utterance from an unknown speaker and provides their level of PD by analyzing their voice by means of machine learning techniques.

Patients with PD usually have difficulties in speaking because of reduced coordination of the muscles involved in the human speech production system. This causes distortions in the phoneme articulation, prosody, etc., diminishing the subject’s speech intelligibility.

We have tried two main approaches along this project, one based on speech features of fixed length and another in time-varying speech features (sequences of features).

In the both approaches we have considered different models, and in the first one several feature extraction methods. Then we have selected the best alternative overall, in terms of predictive performance.

The models considered are the following:

In the approach based on speech features of fixed length:
- Random Forest (RF)
- Extreme Gradient Boosting (XGBoost)
- Multi-layer Perceptron (MLP)
- Two Neural Networks implemented by mean of PyTorch
In the approach based on time-varying speech features (sequences of features):
- Gaussian Mixture Models (GMM)
- Recurrent Neural Networks (RNN) implemented in PyTorch

And the feature extraction methods are the next ones:

Mel-Frequency Cepstrum Coefficients (MFCC)
Chromagram
Spectral Centroid
Spectral Bandwith
Spectral Contrast
Spectral Rolloff
Zero Crossing Rate
Tempogram

Requirements#

import numpy  as np
import polars as pl
import sys
import matplotlib.pyplot as plt
import librosa  # package for speech and audio analysis
import IPython.display as ipd
import seaborn as sns
sns.set_style('whitegrid')

sys.path.insert(0, r"C:\Users\fscielzo\Documents\Packages\PyAudio_Package_Private")
from PyAudio.preprocessing import get_X_audio_features, get_X_tensor_audio_features

Data#

We have a database composed by the following elements:

20 speakers
12 audios per speaker
We have speakers with different Parkinson disease levels:
- normal (0)
- slight (1)
- moderate (2)
- severe (3)
hat has been recorded at a sampling frequency of 16000 Hz.

The dataset has been manually annotated following a subset of the Unified Parkinson’s Disease Rating Scale (UPDRS), a scoring scale utilized by neurologistics for clinical assessment of PD.

Reading Speech Files#

In this section we are going to show how to read speech files by mean od librosa.

Reading a class 0 (normal) speech#

Here we read a normal (0) speech.

# Reading a speech file from the database - Class 0

fs = 16000  # sampling frequency
audio_file = 'PDSpeechData/loc17/loc17_s01.wav'  # speech file

# 'audio_signal_1' is an array with the amplitude of the audio signal along the time:
audio_signal_1, sr = librosa.load(audio_file, sr=fs)

We have set a sampling frequency of 16000, this means that for each second we have 16000 values of amplitude. The time length of the read audio file is 13 seconds, so we have approximately 13*16000 = 208000 points of amplitude. This amplitudes point are saved in the array audio_signal_1.

Each audio file contains the amplitude of the audio expressed in a continuous scale, and what librosa.load(audio_file, sr=fs) does is to extract fs (16000) points for each 1 second interval of amplitudes of the original file. Each 1 sec interval is a continues interval, therefore, with infinity values, and the algorithm selects 16000 samples equally spaced.

audio_signal_1

array([ 0.04125977,  0.05276489,  0.03128052, ..., -0.00030518,
       -0.00030518, -0.00033569], dtype=float32)

time_audio_signal_1 = 13
time_audio_signal_1 * fs

audio_signal_1.shape

(216500,)

Plotting the audio signal amplitudes along time

audio_signal = audio_signal_1
filter = range(0,500)
fig, axes = plt.subplots(2, 1, figsize=(12,7))
axes = axes.flatten()  

sns.lineplot(y=audio_signal, x=range(len(audio_signal)), color='blue', ax=axes[0])
sns.lineplot(y=audio_signal[filter], x=filter, color='blue', ax=axes[1])

axes[0].set_title(audio_file.split('/')[-1] + ' - Class 0 (normal)')
axes[1].set_title(audio_file.split('/')[-1] + f' - {str(filter)}')

for i in range(len(axes)):
    axes[i].set_ylabel('Amplitude', size=11)
    axes[i].set_xlabel('Time', size=11)
plt.subplots_adjust(hspace=0.4, wspace=0.5) 

_images/cf1ef0fdacc618c6d255480fc6c8f418daae1ea187aaeab0c2bfa81c85094d20.png

Displaying the audio file as sound:

ipd.Audio(audio_signal_1, rate=fs)

Reading a class 3 (severe) speech#

Now we read a severe (3) speech.

# Reading a speech file from the database - Class 3

fs = 16000  # sampling frequency
audio_file = 'PDSpeechData/loc18/loc18_s01.wav'  # speech file

audio_signal_2, sr = librosa.load(audio_file, sr=fs)
# 'audio_signal_2' is an array with the amplitude of the audio signal along the time

audio_signal_2

array([-0.25128174, -0.37490845, -0.24560547, ..., -0.04568481,
       -0.04675293, -0.04760742], dtype=float32)

time_audio_signal_2 = 7
time_audio_signal_2 * fs

audio_signal_2.shape

(122500,)

Plotting the audio signal amplitudes along time

audio_signal = audio_signal_2
filter = range(0,500)
fig, axes = plt.subplots(2, 1, figsize=(12,7))
axes = axes.flatten()  

sns.lineplot(y=audio_signal, x=range(len(audio_signal)), color='blue', ax=axes[0])
sns.lineplot(y=audio_signal[filter], x=filter, color='blue', ax=axes[1])

axes[0].set_title(audio_file.split('/')[-1] + ' - Class 0 (normal)')
axes[1].set_title(audio_file.split('/')[-1] + f' - {str(filter)}')

for i in range(len(axes)):
    axes[i].set_ylabel('Amplitude', size=11)
    axes[i].set_xlabel('Time', size=11)
plt.subplots_adjust(hspace=0.4, wspace=0.5) 

_images/07c5fb91c8f74515d35a8b11f29dcb69919c2488de64e307fc2f37730444c542.png

Displaying the audio file as sound:

# Play the audio data
ipd.Audio(audio_signal_2, rate=fs)

Feature extraction#

In the following section we show an example of feature extraction for the previous speech signal audio_signal_2.

The point is, given an audio signal, extract features that characterize it, to be used along with Machine Learning algorithms, in this case to classify a new signal in one of the four PD levels mentioned above.

We distinguish two types of audio features:

Time-varying features (sequencies)
- This type are suitable to be used with models that work well with sequential data, like Recurrent Neural Networks and Gaussian Mixture Models.
Fix length features
- This type is suitable for models that work with tabular data, like Random Forest, XGBoost and Multi-Layer Perceptron Neural Networks.
- This features are basically statistics computed on the time-varying features.

The methods for features extraction that we are going to use along this projects are the following:

Mel-Frequency Cepstrum Coefficients (MFCC)
Chromagram
Spectral Centroid
Spectral Bandwith
Spectral Contrast Spectral Rolloff
Zero Crossing Rate
Tempogram

These methods are applied directly to the amplitudes series of a given audio signal and return a matrix with the time-varying features for that signal. Then, we can retrieve a vector with the statistic about the features of that matrix, these will be the fixed length features of that audio signal.

This process can be done for all the \(n\) available audio signals, then we obtain \(n\) matrices with time-varying features.

Then, this \(n\) matrices can be accommodated in a 3D array, also known as tensor, that can be used as input by different Machine Learning, typically deep learning algorithms, like Recurrent Neural Networks (RNN).

But these \(n\) matrices (2D arrays) can also be transformed in vectors (1D arrays) computing statistics like mean or standard deviation for the features contained in those matrices. Then, we have \(n\) 1D arrays that can be concatenated to build a 2D array that could be interpreted as a predictor matrix (tabular data) to be used as input by classic Machine Learning algorithms, that work well with tabular data.

Mel-Frequency Cepstrum Coefficients (MFCC)#

MFCCs are a feature representation that captures the power spectrum of an audio signal, using a cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. This Mel scale aims to mimic the human ear’s response more closely than the linearly-spaced frequency bands used in the typical Fourier transform. This makes MFCCs particularly useful for applications like speech recognition or music analysis, where the perception-like representation of audio can be more beneficial than purely physical representations.

The MFCC components refer to the number of coefficients extracted from each frame of the audio signal. The choice of the number of components is somewhat arbitrary but is based on empirical evidence suggesting that the first few coefficients (usually the first 12 to 20) capture most of the useful information about the spectral envelope of the audio signal. The higher coefficients, which represent finer details of the spectrum, are often discarded.

In this case, we want to compute a sequence of Mel-Frequency Cepstrum Coefficients (MFCC) with the following configuration:

Size of the analysis window = 32 ms = 0.032 secs
Frame period or hop length = 8 ms = 0.008 secs
Number of filters in the mel filterbank = 40
Number of MFCC components = 20

For doing that, we are going to use the function mfcc from the module feature of the librosa package. This function has, among others, the following input arguments:

y: speech signal
sr: sampling frequency
n_fft: window size (in samples)
hop_length: frame period or hop length (in samples)
n_mels: number of filters in the mel filterbank
n_mfcc: number of MFCC components

Note that in this function the window size and the hop length must be expressed in samples. Taking into account that the sampling frequency (fs) indicates that 1 second corresponds to fs samples (in our case, as fs=16000 Hz, 1 second corresponds to 16000 samples), the conversion from seconds to samples is performed by:

samples = seconds*fs = seconds*16000

# Specifying variables for feature extraction
fs = 16000 # Sampling frequency
wst = 0.032 # Window size (seconds)
fpt = 0.008 # Frame period (seconds)
nfft = int(np.ceil(wst*fs)) # Window size (samples)
fp = int(np.ceil(fpt*fs)) # Frame period (samples)
nbands = 40 # Number of filters in the filterbank
ncomp = 20 # Number of MFCC components

# Feature extraction with MFCC 
x_MFCC = librosa.feature.mfcc(y=audio_signal_1, sr=fs, n_fft=nfft, hop_length=fp, n_mels=nbands, n_mfcc=ncomp).T

x_MFCC

array([[-1.5723482e+02,  8.1401794e+01, -1.0822573e+01, ...,
        -1.4883176e+00, -2.3103115e-01, -5.7656441e+00],
       [-1.4987587e+02,  8.2267227e+01, -1.5025454e+01, ...,
        -4.9614630e+00, -5.0628245e-01, -1.0207132e+01],
       [-1.5724088e+02,  7.8481644e+01, -2.0651829e+01, ...,
        -8.5300264e+00, -1.4817656e+00, -1.2810444e+01],
       ...,
       [-4.8450891e+02,  1.6448849e+01,  1.2042265e+01, ...,
         1.0532631e+00,  6.7879003e-01,  3.1149516e-01],
       [-4.8156741e+02,  2.0192478e+01,  1.4441982e+01, ...,
         8.4327966e-01, -4.1927201e-01,  1.3737071e-01],
       [-4.7447165e+02,  2.9626501e+01,  2.1557457e+01, ...,
         5.4327607e-01, -3.4168136e-01, -1.5610456e-01]], dtype=float32)

x_MFCC.shape

(1692, 20)

x_MFCC is a time-samples x ncomp = 1692 x 20 matrix, where its columns represent features and its rows observations.

20 is the number of Mel-Frequency Cepstrum Coefficients (MFCC) components, and is defined apriori as a hyper-parameter.
1692 is the number of time-samples fort that audio signal, and is defined as the number of samples of the audio signal divided between the frame period (in samples).

ncomp

n_samples_audio_signal_1 = len(audio_signal_1)
time_samples_audio_signal_1 = np.ceil(n_samples_audio_signal_1 / fp)
time_samples_audio_signal_1

1692.0

The columns of x_MFCC, which are the MFCC components could be interpreted as features of the audio signal, and its rows as the value of these features along time, and this is why we called these features as ‘time-varying’ features.

It’s important to realize that x_MFCC is the MFCC matrix for one single audio signal, concretely for audio_signal_1.

But we want to obtain this matrix for the available data, let say, for all the available audio signals, in order to build a tensor or a predictors matrix to be used along with Machine Learning models, to carry out the classification task of this project, which is our main goal.

From MFCC matrices to tensor#

The task is to transform the \(n\) MFCC matrices (one per audio) into a tensor (3D array) of shape n x ncomp x max-n_time_samples.

For doing that we are going to use our custom function get_X_tensor_audio_features, that takes a list of audio file paths, a method for features extraction and the parameters for that method, then process the audio files as signals, and apply the specified feature extraction method on the signal, obtaining time-varying features of it.

This process is repeated for each audio file, obtaining \(n\) MFCC matrices of size ncomp x max-n_time_samples, and the results are allocated in a 3D array of size n x ncomp x max-n_time_samples, that is, in a tensor of size. The final output of out function is the desired tensor.

Later is explained why the las dimension of the tensor is max-n_time_samples.

We read all the audio file paths along with their belonging class, and accommodate them in a data-frame.

files_list_name = 'Files_List.txt'
files_df = pl.read_csv(files_list_name, separator='\t', has_header=False, new_columns=['path', 'level'])

files_df.head(3)

shape: (3, 2)

path	level
str	i64
"PDSpeechData/l…	0
"PDSpeechData/l…	0
"PDSpeechData/l…	0

We have 240 audio files.

files_df.shape

(240, 2)

Now we can process all those audio files, extract time-varying features form them and build a tensor to be used in ML models like RNN.

In this case we are using MFCC as features extraction method, with the parameters defined previously.

X_MFCC_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='MFCC', sr=fs, n_fft=nfft, hop_length=fp, n_mels=nbands, n_mfcc=ncomp)

As you can see this is indeed a tensor, since it is a 3D array with shape (240, 20, 4403).

This means that for each one of our 240 audios, we have a MFCC matrix, with time-varying features that characterize it.

X_MFCC_tensor

array([[[-3.25005737e+02, -3.13607208e+02, -3.10144470e+02, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 6.67731781e+01,  7.59160614e+01,  7.93242340e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 1.46149426e+01,  1.50402470e+01,  1.77627831e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [-4.79171467e+00, -1.09497318e+01, -9.35082436e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-3.61563778e+00, -1.05655603e+01, -6.97648907e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 8.39404225e-01, -1.94986522e+00,  3.36962867e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[-3.77771057e+02, -3.72950775e+02, -3.70798340e+02, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 1.37817211e+01,  1.93036613e+01,  2.15146694e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 8.53915405e+00,  1.09322910e+01,  1.14195042e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [ 2.99091190e-01,  1.31409812e+00, -1.37108326e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 1.42719173e+00,  2.57501340e+00,  6.80215955e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 1.89939249e+00,  3.12848997e+00,  3.17334318e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[-3.78744751e+02, -3.71267303e+02, -3.72638184e+02, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 3.59314842e+01,  4.38024063e+01,  4.41164932e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 8.90640831e+00,  1.22590275e+01,  1.53751431e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [-2.48288512e+00, -4.04505491e+00, -2.29102421e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 3.44525909e+00,  5.58475924e+00,  6.77276421e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 3.18025351e+00,  7.84416962e+00,  8.91281891e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       ...,

       [[-1.39822342e+02, -1.64667130e+02, -2.25252365e+02, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 1.03545357e+02,  1.08468513e+02,  9.81202545e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.58028412e+00,  1.53690796e+01,  3.00571747e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [ 5.99450636e+00,  1.21303453e+01,  3.11318607e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 3.47830296e+00,  6.51342058e+00,  1.70105896e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-4.55060124e-01,  6.44598126e-01,  3.88159275e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[-1.84438980e+02, -1.90262070e+02, -2.01296692e+02, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 7.01741333e+01,  7.43718948e+01,  7.84279633e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 1.31448584e+01,  1.41633148e+01,  1.33200607e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [-1.57305622e+00,  9.78493154e-01, -5.54744959e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 4.42293644e+00,  2.46918893e+00,  2.14608788e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 3.28937721e+00,  3.15117645e+00,  1.88868785e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[-1.69065125e+02, -1.80024323e+02, -1.97941238e+02, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 7.73344879e+01,  7.82881241e+01,  7.72418594e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 2.05024719e+01,  2.66848755e+01,  3.05285492e+01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [-5.02676678e+00, -7.10136032e+00, -8.99715805e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-4.17449474e-01, -1.66777277e+00, -2.24128127e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-2.11845565e+00, -3.21187496e+00, -4.22027206e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]]])

X_MFCC_tensor.shape

(240, 20, 4403)

For example, this is the MFCC matrix for our first audio file:

X_MFCC_tensor[0]

array([[-325.0057373 , -313.60720825, -310.14447021, ...,    0.        ,
           0.        ,    0.        ],
       [  66.7731781 ,   75.9160614 ,   79.32423401, ...,    0.        ,
           0.        ,    0.        ],
       [  14.61494255,   15.04024696,   17.76278305, ...,    0.        ,
           0.        ,    0.        ],
       ...,
       [  -4.79171467,  -10.94973183,   -9.35082436, ...,    0.        ,
           0.        ,    0.        ],
       [  -3.61563778,  -10.56556034,   -6.97648907, ...,    0.        ,
           0.        ,    0.        ],
       [   0.83940423,   -1.94986522,    3.36962867, ...,    0.        ,
           0.        ,    0.        ]])

X_MFCC_tensor[0].shape

(20, 4403)

As you can see there are zeros at the end of each row, this are the results of padding the MFCC matrix.

The point is that each audio has a different size MFCC matrix (in time-samples terms, since all of them have the same number of components), so, in order to allocate all of them in a 3D array we need to enforce the same size for all of them, and this is done by forcing all the MFCC to have the same number of time-samples, concretely the maximum one (max-n_time_samples), let say, the one of the largest MFCC matrix. So, now all the MFCC matrices will have extra positions in the time-samples (except the largest one), and those extra positions are filled with zeros. This process is called as padding, and is done automatically by our function get_X_tensor_audio_features.

From MFCC matrices to predictors matrix (tabular data)#

The task is to transform the \(n\) MFCC matrices (one per audio) into a predictors matrix (2D array) of shape n x ncomp.

For doing this we are going to use aur custom function get_X_audio_features, that takes a list of audio file paths, a method for features extraction and the parameters for that method, then process the audio files as signals, and apply the specified feature extraction method on the signal, obtaining time-varying features of it. Then statistics are computed for each feature along the time dimension, obtaining a vector (1D array) of size ncomp.

This process is repeated for each audio file, obtaining \(n\) vectors, and the results are allocated in a 2D array of size n x ncomp, that is, a matrix. The final output of our function is the desired predictors matrix.

Here, as example, we compute two possible predictors (features) matrices using the MFCC method for feature extraction and two different statistics configurations, one with the mean and another with both the mean and the standard deviation.

X_MFCC_stats_1 = get_X_audio_features(paths=files_df['path'], method='MFCC', stats='mean', sr=fs, n_fft=nfft, hop_length=fp, n_mels=nbands, n_mfcc=ncomp)
X_MFCC_stats_2 = get_X_audio_features(paths=files_df['path'], method='MFCC', stats='mean-std', sr=fs, n_fft=nfft, hop_length=fp, n_mels=nbands, n_mfcc=ncomp)

These matrices has predictors as columns and observations/samples as rows.

Since we are working with 240 audios these matrices will have 240 rows, and the number of predictors depends on the used statistics.

If only one statistic us used, as in the first case, we will have a number of predictors equal to the number of components fixed for the features extraction method, in this case 20.

If we use a combination of several statistics, let say \(k\), the number of predictors will be k*ncomp.

In the first case, since we have used only the mean as statistics the number of predictors is equal to the number of MFCC components, so 20. These predictors represent the mean of the time-varying components for each audio. For example, the first predictors contains the mean of the first time-varying MFCC component (the mean of that component along time) for each one of the 240 available audios.

In the second case we have used two statistics, therefore the number of predictors is 40 (2*20), the first 20 predictors represent the mean of the 20 time-varying components for each audio file, and the next 20 predictors represent the standard deviation.

Is pretty obvious that this idea can be generalized, so, we can use any combination of statistics as well as of features extraction to build predictors matrices to be used along with ML algorithms in predictive scenarios like this.

In the predictive part of this project we will explore these alternatives, both combining different statistics as well as features extraction methods.

Note: when we talk about combining feature extraction methods what we mean is to obtain matrices using different methods and then concatenate them to form a single predictors matrix that combines the information of them, which could improve the predictive performance of certain models. This option has been considered in the predictive part.

X_MFCC_stats_1

array([[-2.13158752e+02,  9.29840240e+01, -6.40546494e+01, ...,
        -2.17615294e+00,  2.29761638e-02, -5.47736108e-01],
       [-1.98984543e+02,  6.80483322e+01, -4.74277916e+01, ...,
         4.45366669e+00,  9.83190179e-01, -4.84144878e+00],
       [-2.40027390e+02,  5.97654343e+01, -2.60434890e+00, ...,
         7.12371826e-01, -3.60437250e+00, -7.20492887e+00],
       ...,
       [-2.91605225e+02,  8.18440247e+01,  2.95778332e+01, ...,
         1.58350534e+01,  1.41482115e+01,  1.16930342e+01],
       [-2.12697678e+02,  7.87770691e+01,  1.69434319e+01, ...,
        -4.82905912e+00,  2.94541955e+00,  1.33842838e+00],
       [-2.09183228e+02,  7.25918579e+01,  3.24228172e+01, ...,
        -1.01590958e+01,  1.20261431e+00, -3.76960754e+00]], dtype=float32)

X_MFCC_stats_1.shape

(240, 20)

X_MFCC_stats_2

array([[-213.15875  ,   92.984024 ,  -64.05465  , ...,    7.095715 ,
           6.177743 ,    5.351471 ],
       [-198.98454  ,   68.04833  ,  -47.42779  , ...,    6.012935 ,
           5.59839  ,    4.8842993],
       [-240.02739  ,   59.765434 ,   -2.604349 , ...,    5.646291 ,
           5.4601407,    6.5454264],
       ...,
       [-291.60522  ,   81.844025 ,   29.577833 , ...,   13.644189 ,
          11.3029785,    9.850929 ],
       [-212.69768  ,   78.77707  ,   16.943432 , ...,    2.2818298,
           2.13314  ,    2.4295764],
       [-209.18323  ,   72.59186  ,   32.422817 , ...,    2.3896735,
           3.0300555,    2.1277957]], dtype=float32)

X_MFCC_stats_2.shape

(240, 40)

Example of predictors matrix that combines different features extraction methods and statistics#

X_MFCC_stats = get_X_audio_features(paths=files_df['path'], method='MFCC', stats='mean-std', sr=fs, n_fft=nfft, hop_length=fp, n_mels=nbands, n_mfcc=ncomp)
X_chroma_stats = get_X_audio_features(paths=files_df['path'], method='chroma', stats='median-std', sr=fs, n_fft=nfft, hop_length=fp, n_mels=nbands, n_mfcc=ncomp)

X_MFCC_stats.shape

(240, 40)

X_chroma_stats.shape

(240, 24)

X_combined = np.concatenate((X_MFCC_stats, X_chroma_stats), axis=1)
X_combined

array([[-2.1315875e+02,  9.2984024e+01, -6.4054649e+01, ...,
         2.7830657e-01,  3.4890896e-01,  1.7714919e-01],
       [-1.9898454e+02,  6.8048332e+01, -4.7427792e+01, ...,
         3.7769288e-01,  1.5670222e-01,  2.0125444e-01],
       [-2.4002739e+02,  5.9765434e+01, -2.6043489e+00, ...,
         1.8112896e-01,  2.1978563e-01,  3.3394688e-01],
       ...,
       [-2.9160522e+02,  8.1844025e+01,  2.9577833e+01, ...,
         3.7720391e-01,  3.9648506e-01,  4.0834939e-01],
       [-2.1269768e+02,  7.8777069e+01,  1.6943432e+01, ...,
         4.0676277e-02,  7.6654352e-02,  9.9300966e-02],
       [-2.0918323e+02,  7.2591858e+01,  3.2422817e+01, ...,
         2.8915258e-02,  3.2900050e-02,  3.4263603e-02]], dtype=float32)

X_combined.shape

(240, 64)

Chromagram#

Chroma features are a powerful tool for analyzing music. They capture the essence of harmony, melody, and tonality of musical signals. By projecting the entire spectrum onto 12 different bins representing the 12 distinct semitones (or chromatic scale) in Western music, chroma features provide a high-level representation of music or audio in terms of octaves. This can be particularly useful for capturing the musical aspects of speech which could correlate with disease states.

# Feature extraction with Chromagram 
x_chroma = librosa.feature.chroma_stft(y=audio_signal_1, sr=fs, n_fft=nfft, hop_length=fp).T

x_chroma

array([[0.76920897, 0.776894  , 0.34586996, ..., 1.        , 0.63658094,
        0.48888198],
       [0.5365836 , 0.57130456, 0.07953023, ..., 1.        , 0.46217796,
        0.23196535],
       [0.59507215, 0.52440584, 0.04829044, ..., 1.        , 0.4172734 ,
        0.19030185],
       ...,
       [0.9172087 , 0.7431331 , 0.6895156 , ..., 0.8187049 , 0.9443104 ,
        1.        ],
       [0.9935548 , 0.934088  , 0.8525207 , ..., 0.9157632 , 0.99230087,
        1.        ],
       [0.93939584, 1.        , 0.96149266, ..., 0.8938448 , 0.92480946,
        0.94353443]], dtype=float32)

x_chroma.shape

(1692, 12)

From Chroma matrices to tensor#

X_chroma_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='chroma', sr=fs, n_fft=nfft, hop_length=fp)

X_chroma_tensor

array([[[3.30834001e-01, 2.73084641e-01, 1.72916576e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [4.88729805e-01, 3.64229351e-01, 3.07732373e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [8.24708462e-01, 8.15815687e-01, 7.72476017e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [7.54282296e-01, 9.33874607e-01, 6.54246688e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [4.48958814e-01, 5.71549773e-01, 4.18673873e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [3.12655747e-01, 3.80960107e-01, 2.34689996e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[3.12040150e-01, 1.55773565e-01, 2.85095990e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [3.72878194e-01, 2.47185424e-01, 3.49979341e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [7.39971995e-01, 4.82317954e-01, 3.00103635e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [4.79578823e-01, 3.46601009e-01, 5.40694356e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [4.12601918e-01, 1.45907149e-01, 3.16213101e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [3.80797684e-01, 1.39935687e-01, 2.63296276e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[1.95253670e-01, 5.68398200e-02, 8.79043713e-02, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [2.79328316e-01, 1.09151907e-01, 1.37994885e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [4.16318625e-01, 2.32939079e-01, 2.45861769e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [1.15602165e-01, 4.24965061e-02, 6.30273521e-02, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [8.62279609e-02, 2.84093022e-02, 3.58258635e-02, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [1.22719601e-01, 3.72236483e-02, 4.10057195e-02, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       ...,

       [[5.85519336e-02, 5.56419091e-03, 3.07984083e-05, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [5.80269657e-02, 5.19187702e-03, 7.23518242e-05, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [7.61531293e-02, 5.30038262e-03, 1.78323127e-04, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [3.50973278e-01, 1.18097760e-01, 7.36766532e-02, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [1.37970760e-01, 1.42571330e-02, 1.69064559e-03, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [7.20229223e-02, 7.56322034e-03, 5.10804712e-05, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[5.96340179e-01, 4.91394937e-01, 4.66173530e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [7.40738869e-01, 6.75898135e-01, 6.55857086e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [8.93028915e-01, 8.82704973e-01, 8.74643266e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [4.46513116e-01, 2.38409176e-01, 1.97379783e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [4.71661240e-01, 3.15472454e-01, 2.87048072e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [5.34890950e-01, 3.94979298e-01, 3.70775521e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[5.93712926e-01, 3.97959381e-01, 3.34024280e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [6.57446086e-01, 5.45625567e-01, 5.14605403e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [8.18063438e-01, 7.82413363e-01, 7.81226754e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        ...,
        [4.74603772e-01, 2.39262685e-01, 2.04746559e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [5.70781469e-01, 3.43587726e-01, 2.86463976e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [6.04692757e-01, 3.66815865e-01, 2.94027925e-01, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]])

X_chroma_tensor.shape

(240, 12, 4403)

From Chroma matrices to predictors matrix (tabular data)#

X_chroma_stats = get_X_audio_features(paths=files_df['path'], method='chroma', stats='mean-std', sr=fs, n_fft=nfft, hop_length=fp)

X_chroma_stats

array([[0.32919034, 0.48674947, 0.13121736, ..., 0.27830657, 0.34890896,
        0.17714919],
       [0.11371233, 0.19766188, 0.09701521, ..., 0.37769288, 0.15670222,
        0.20125444],
       [0.5519331 , 0.1438093 , 0.08654997, ..., 0.18112896, 0.21978563,
        0.33394688],
       ...,
       [0.25067928, 0.23875663, 0.26547787, ..., 0.3772039 , 0.39648506,
        0.4083494 ],
       [0.3393735 , 0.5189489 , 0.7807806 , ..., 0.04067628, 0.07665435,
        0.09930097],
       [0.30499187, 0.47429892, 0.7501884 , ..., 0.02891526, 0.03290005,
        0.0342636 ]], dtype=float32)

X_chroma_stats.shape

(240, 24)

Spectral Centroid#

The Spectral Centroid represents the center of mass of the spectrum, providing a measure of the brightness of a sound. It is calculated as the weighted mean of the frequencies present in the sound, with their magnitudes as the weights. This feature gives an idea of how high or low the majority of the energy is in a sound spectrum.

# Feature extraction with spectral centroid 
x_spectral_centroid = librosa.feature.spectral_centroid(y=audio_signal_1, sr=fs, n_fft=nfft, hop_length=fp).T

x_spectral_centroid

array([[1173.49156085],
       [1173.44240948],
       [1225.6018241 ],
       ...,
       [2030.97788411],
       [2028.13073426],
       [1640.92396239]])

x_spectral_centroid.shape

(1692, 1)

From Spectral Centroid matrices to tensor#

X_spectral_centroid_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='spectral_centroid', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_centroid_tensor

array([[[ 860.68242317,  737.32612638,  702.74733212, ...,
            0.        ,    0.        ,    0.        ]],

       [[1722.98555129, 1677.03902985, 1615.59102764, ...,
            0.        ,    0.        ,    0.        ]],

       [[1479.32504338, 1345.68771268, 1360.28338621, ...,
            0.        ,    0.        ,    0.        ]],

       ...,

       [[ 663.09360913,  586.61483978,  520.02672314, ...,
            0.        ,    0.        ,    0.        ]],

       [[1086.38030366,  971.71842423,  898.1939858 , ...,
            0.        ,    0.        ,    0.        ]],

       [[ 796.84943399,  704.26379079,  679.87846212, ...,
            0.        ,    0.        ,    0.        ]]])

X_spectral_centroid_tensor.shape

(240, 1, 4403)

From Spectral Centroid matrices to predictors matrix#

X_spectral_centroid_stats = get_X_audio_features(paths=files_df['path'], method='spectral_centroid', stats='mean-std', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_centroid_stats

array([[1222.51022012,  106.02869666],
       [1352.95675704,  292.90811358],
       [1573.29651356,  452.12319515],
       [1148.70689305,  102.08901252],
       [1030.46170316,  187.55238243],
       [1295.44084773,  208.27129492],
       [1263.05181668,  248.44794161],
       [1003.50312509,  133.57028631],
       [1070.73699491,  129.43194965],
       [1069.19199417,  149.36018421],
       [ 770.5005775 ,   59.60935519],
       [ 453.24031129,   29.71792047],
       [1344.7035196 ,  725.68509036],
       [1336.46868934,  608.80623466],
       [1227.09023004,  106.67904364],
       [1434.53104501,  117.39653373],
       [1932.02713538,  170.23029854],
       [ 932.59136027,  124.93733827],
       [ 696.62066277,  276.71068651],
       [1963.60256868,  720.26894443],
       [1915.64251365,  665.98189964],
       [ 919.65891344,  431.09062568],
       [1187.01412327,  447.66393129],
       [1199.32941239,  480.20315422],
       [ 852.61529121,  514.50788938],
       [ 546.81791865,  518.49770286],
       [1510.02991708,  871.70219926],
       [1461.9026807 ,  800.3389051 ],
       [1223.82662023,  393.717687  ],
       [1475.41466613,  469.09860929],
       [1659.40294229,  447.99165796],
       [ 925.68728539,  265.80684328],
       [ 548.3999444 ,  194.34276981],
       [1107.86299172,  579.81563808],
       [1111.05775534,  619.85001803],
       [ 998.67234754,  466.06555326],
       [1286.78972652,  485.63595525],
       [1117.61042445,  546.66797142],
       [ 827.13857334,  437.22395128],
       [ 625.32340127,  591.58583135],
       [1674.71503742,  880.832784  ],
       [1720.21758856,  893.16075877],
       [1103.06324294,  433.56105328],
       [ 844.4973583 ,  351.04323765],
       [ 718.10124674,  252.39332035],
       [ 673.64820996,  287.14242312],
       [ 410.11190074,  201.49233214],
       [1376.72184891,  792.21863795],
       [1599.82076939,  909.11019598],
       [1154.67824106,  125.49427835],
       [1282.40223179,  160.06453406],
       [1130.8653972 ,  131.49315532],
       [ 912.28588184,  266.4720747 ],
       [ 558.01275111,  173.14414971],
       [2232.41678324,  817.17077469],
       [2066.70869079,  803.96679082],
       [ 837.62234866,   38.27332706],
       [1088.90998097,  137.43539107],
       [1505.98407692,  183.41097446],
       [ 556.38792632,  146.56856867],
       [ 495.70141716,   75.9556951 ],
       [1325.96928205,  774.79824914],
       [1292.60576071,  750.76985192],
       [1043.15599798,  247.88217992],
       [1293.25415854,  223.72766107],
       [1199.97054074,  287.51526491],
       [ 827.96868198,  155.45311023],
       [ 518.68214836,  278.34153612],
       [1947.82163181,  661.609216  ],
       [2018.97652252,  634.50370732],
       [1455.71173113,  182.93397549],
       [2016.76583394,  176.43360352],
       [2351.53392258,  128.64121373],
       [1240.41281588,  123.93831135],
       [ 657.52351135,  125.86747875],
       [1457.03697135,  896.46501713],
       [1575.61867723,  651.81612992],
       [1016.49224637,  267.06308765],
       [1111.07640479,  178.33280696],
       [1366.00258619,  178.94406355],
       [ 696.8660634 ,   93.10377053],
       [ 467.16446452,  201.30926384],
       [1717.63722669,  633.03784541],
       [1974.89563535,  629.50697992],
       [ 936.66694381,  107.73159875],
       [1050.76892774,   87.84057546],
       [ 920.38419263,  206.05959952],
       [ 671.67588225,  228.64260454],
       [ 425.20081084,  169.67614849],
       [1594.78178032,  541.29980977],
       [1433.15712446,  588.8918704 ],
       [1305.59875918,  408.09844578],
       [1301.20365956,  243.24999637],
       [1262.78224327,  342.20760944],
       [ 869.19306719,  440.34424592],
       [ 670.95620752,  167.79673256],
       [1295.06480512,  765.94187184],
       [1207.9721599 ,  589.34359313],
       [1107.48436685,   87.46255333],
       [ 904.51082148,  138.23301117],
       [ 659.61637497,   60.49428587],
       [ 598.46330699,   58.91508254],
       [ 358.6928751 ,  118.92731077],
       [1214.56923866,  621.80405993],
       [1299.88392768,  765.12484925],
       [1120.55709123,  150.41430585],
       [1131.68454831,  183.76642489],
       [ 878.98036993,  217.63520394],
       [ 745.92270901,  123.39398383],
       [ 482.80653207,  294.521634  ],
       [1612.57562457,  796.35755135],
       [1619.05712347,  817.94451602],
       [1240.23108306,  105.99526918],
       [1412.34261618,  140.48262655],
       [1816.10740793,  173.0388888 ],
       [ 958.43213814,  125.73785492],
       [ 611.5171495 ,  127.67713156],
       [1513.01507093,  515.21928873],
       [1499.68837371,  448.9238126 ],
       [1086.24913696,  238.71201068],
       [1265.17937979,  397.08646852],
       [1211.67043901,  357.72861428],
       [1185.79322403,  497.15659448],
       [ 618.70811719,  248.36974082],
       [1195.64309399,  633.63411393],
       [1343.92304274,  435.36640078],
       [1081.99046491,  180.17088106],
       [1153.27757916,  158.65034769],
       [ 838.15272545,  342.98686133],
       [ 850.70770902,  287.40833282],
       [ 431.01773158,  191.55889467],
       [1147.73215026,  417.77017472],
       [1238.99110037,  386.54781237],
       [1004.98338201,  264.83963487],
       [ 919.22122622,  382.17909503],
       [ 895.12495843,  229.97896092],
       [ 789.85249873,  264.13786958],
       [ 503.55409286,  304.52139403],
       [1484.23600109,  642.2053158 ],
       [1271.51657377,  549.83873064],
       [ 954.67145971,  234.82784253],
       [ 857.57593056,   63.403128  ],
       [ 463.48546104,   59.01714627],
       [1019.09077456,  518.44780856],
       [1006.55020954,   77.57806083],
       [ 724.72350649,   65.99740948],
       [ 802.21762527,   67.01015058],
       [ 527.28815924,   83.46680564],
       [ 995.2176487 ,   47.12128062],
       [ 927.11164285,   48.01420616],
       [1051.79370871,   99.30393497],
       [ 539.37208744,   34.57917517],
       [ 411.45882376,   57.40506949],
       [ 986.18066408,  409.05198979],
       [1265.35478019,  485.8809656 ],
       [1061.94645059,  119.36894347],
       [ 802.38742318,   44.76000009],
       [ 672.86472897,   39.03110342],
       [ 845.10526287,   71.06031366],
       [ 652.83534576,  186.35787338],
       [1152.88158988,  609.00476796],
       [1146.19868864,  392.66038521],
       [ 879.9539933 ,  376.53670465],
       [1035.28409297,  454.65176759],
       [ 853.44004307,  356.04717158],
       [ 587.15479794,  272.61904313],
       [ 411.3928329 ,  268.69779277],
       [ 978.23016554,  430.44802835],
       [ 794.02097327,  358.97610528],
       [1305.449728  ,  130.67317489],
       [1085.74939654,  267.43239657],
       [1587.86977289,  465.04677322],
       [1106.49953298,  216.79341511],
       [ 972.45273034,  206.25977577],
       [1037.14410916,  133.76358633],
       [1118.91778742,   69.02621605],
       [ 949.21525276,   83.45813494],
       [ 715.89600178,   30.42736107],
       [ 485.03935647,  115.2463449 ],
       [1337.72019182,  169.79429395],
       [1388.97182208,  111.96171378],
       [1919.63043694,  144.22263796],
       [1201.63152769,  563.07556183],
       [ 820.95957046,  455.07968708],
       [ 522.48685935,  544.76495817],
       [1174.78984287,  370.00687267],
       [1520.36950547,  421.93178626],
       [1546.38885385,  481.50915732],
       [ 858.88042607,  223.6199381 ],
       [ 554.7293444 ,  176.71613639],
       [ 637.84029378,  270.84900721],
       [ 606.00574906,  259.01397586],
       [ 402.00421966,  180.32844886],
       [1047.91094996,   83.76691236],
       [1451.93038954,  211.62007407],
       [1169.71254647,  107.64251841],
       [ 821.60202288,   62.4673394 ],
       [ 568.13483024,   25.36348971],
       [ 846.09733504,  145.28262361],
       [1088.22458778,   51.95388124],
       [1199.58404603,   96.86233842],
       [ 584.10389212,  198.87244551],
       [ 410.27813316,   34.64098124],
       [ 968.22362454,   46.11834746],
       [1200.02490971,   50.67794376],
       [1291.47840853,  116.71885134],
       [ 776.78830327,   54.43044729],
       [ 520.73293129,   69.82289885],
       [1749.12939565,  113.14448623],
       [2141.48645895,  179.18077818],
       [2451.50580028,  254.43382404],
       [1518.5467943 ,   65.97152251],
       [ 744.13249207,   43.9597291 ],
       [1233.95619608,  128.76257334],
       [1703.33919116,  231.93582054],
       [1645.17497616,  142.02142305],
       [ 615.46242043,   41.83108505],
       [ 415.43269335,   36.21164266],
       [ 987.57728743,   62.81949447],
       [ 877.38299317,   52.83187359],
       [ 829.57967244,   73.9722045 ],
       [ 499.07556106,   41.32029723],
       [ 364.52448202,   60.77233679],
       [1217.35533547,  126.38074794],
       [1047.59734963,  110.17178051],
       [ 720.80693174,   65.85325014],
       [ 699.72368554,   17.16116139],
       [ 524.44371865,  433.14241086],
       [1274.51490416,   40.97357169],
       [1591.47096766,  100.78604912],
       [1468.84288746,  130.4952661 ],
       [ 930.91711822,   43.56742745],
       [ 557.03528738,   38.03243915],
       [1086.24913696,  238.71201068],
       [1265.17937979,  397.08646852],
       [1211.67043901,  357.72861428],
       [1185.79322403,  497.15659448],
       [ 618.70811719,  248.36974082],
       [1282.07792881,   64.07181308],
       [1123.23739168,   66.49508231]])

X_spectral_centroid_stats.shape

(240, 2)

Spectral Bandwidth#

Spectral Bandwidth measures the width of a band of frequencies and is defined as the width of the range of frequencies at which the magnitude of the spectrum is greater than a certain percentage of the peak magnitude. It reflects the spread of the spectrum and can indicate the complexity of a sound. A wider bandwidth signifies a noise-like or complex sound, while a narrow bandwidth indicates a tonal or simple sound.

# Feature extraction with spectral bandwith 
x_spectral_bandwidth = librosa.feature.spectral_bandwidth(y=audio_signal_1, sr=fs, n_fft=nfft, hop_length=fp).T

x_spectral_bandwidth

array([[1309.30089921],
       [1255.1158018 ],
       [1275.03885178],
       ...,
       [2427.90186889],
       [2420.073342  ],
       [2256.56008742]])

x_spectral_bandwidth.shape

(1692, 1)

From Spectral Bandwidth matrices to tensor#

X_spectral_bandwidth_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='spectral_bandwidth', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_bandwidth_tensor

array([[[1143.32497568,  942.13602371,  870.97126742, ...,
            0.        ,    0.        ,    0.        ]],

       [[1785.06408858, 1760.67818379, 1734.84937353, ...,
            0.        ,    0.        ,    0.        ]],

       [[1574.27623182, 1486.74798403, 1545.91856   , ...,
            0.        ,    0.        ,    0.        ]],

       ...,

       [[ 901.28159703,  722.00737352,  537.66098854, ...,
            0.        ,    0.        ,    0.        ]],

       [[1617.36931068, 1553.21232249, 1469.45092787, ...,
            0.        ,    0.        ,    0.        ]],

       [[1417.14548475, 1323.33541644, 1265.27486656, ...,
            0.        ,    0.        ,    0.        ]]])

X_spectral_bandwidth_tensor.shape

(240, 1, 4403)

From Spectral Bandwidth matrices to predictors matrix (tabular data)#

X_spectral_bandwidth_stats = get_X_audio_features(paths=files_df['path'], method='spectral_bandwidth', stats='mean-std', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_bandwidth_stats

array([[ 547.71039463,  190.68833226],
       [ 949.60789304,  210.65348963],
       [1257.37962494,  221.11233883],
       [ 655.10901947,  266.61427025],
       [1036.72287665,  183.3495748 ],
       [1087.49948991,  499.1671853 ],
       [1112.26886734,  453.83392324],
       [1008.52278684,  169.65089348],
       [1136.31978971,  114.58698542],
       [1369.45105255,  137.89868859],
       [ 733.60358082,   90.93676414],
       [ 600.8755911 ,   55.60126607],
       [1797.74599018,  734.56558902],
       [1776.10775885,  612.33648523],
       [ 944.46518321,  149.55301631],
       [1434.2496765 ,  105.46849527],
       [1954.60189389,  100.56700799],
       [ 910.6004008 ,  154.15121229],
       [ 765.39650212,  290.37160189],
       [1983.59218098,  620.01437049],
       [1981.69727227,  581.72095076],
       [ 998.51875333,  408.45410957],
       [1450.18425688,  454.85092884],
       [1518.25136553,  482.11063008],
       [1094.91762071,  458.69315579],
       [ 781.96866483,  474.70930895],
       [1680.16490268,  888.4293138 ],
       [1668.06768676,  872.49278575],
       [ 928.87365496,  333.25661316],
       [1344.69901462,  431.16109861],
       [1744.15188681,  462.07526467],
       [ 752.74092272,  240.32506969],
       [ 576.19869778,  231.88184117],
       [1192.73007265,  740.45643954],
       [1200.84935927,  759.63084532],
       [1179.96885612,  430.67727142],
       [1414.42977622,  473.60777341],
       [1379.5033354 ,  547.01904526],
       [ 997.58955261,  458.86070221],
       [ 780.89645038,  545.48013567],
       [1834.09953118,  878.99621659],
       [1802.03908387,  843.64936634],
       [1391.77918637,  472.65750026],
       [1122.56229877,  425.60571427],
       [1130.8964514 ,  380.10071557],
       [ 924.95368606,  360.77284215],
       [ 658.94058718,  281.22860998],
       [1747.89250412,  837.00998138],
       [1876.11907605,  878.56767846],
       [1150.51058542,  163.12005018],
       [1391.4578276 ,  155.95806739],
       [1421.86726724,  116.50717496],
       [1016.2546602 ,  277.75449182],
       [ 760.3493956 ,  190.70772893],
       [2153.65553644,  625.44801062],
       [2118.80433895,  579.84525812],
       [1069.7433535 ,   56.92207264],
       [1291.27802182,  163.95136861],
       [1696.43408677,  156.86229481],
       [ 606.35181409,  174.97106457],
       [ 607.38289993,  113.47758636],
       [1611.89720517,  627.53012741],
       [1547.5327357 ,  714.24148487],
       [1041.59028291,  272.68499906],
       [1363.07790584,  178.53626408],
       [1423.32899711,  202.9157379 ],
       [ 869.5176206 ,  229.47322246],
       [ 624.08077935,  347.96106504],
       [2087.82096024,  525.85130172],
       [2121.54004258,  493.39515192],
       [1223.00502404,  185.80370432],
       [1574.62273867,  146.79569926],
       [1648.41297682,   88.48103811],
       [1135.67503014,  150.54060209],
       [ 776.96386065,  148.87374308],
       [1682.99049988,  680.78042209],
       [1743.76985717,  540.17759968],
       [1163.50460259,  224.07209218],
       [1210.70608323,  175.91447923],
       [1592.28224404,  168.65178026],
       [ 922.67344763,  157.202565  ],
       [ 666.76944657,  262.93292986],
       [1956.48308592,  488.23418425],
       [2095.05972543,  456.60586165],
       [1026.10393124,  131.87057915],
       [1287.98723388,  131.12888481],
       [1302.89833322,  201.50687481],
       [ 792.34975876,  270.71595399],
       [ 610.823858  ,  256.12631197],
       [1827.38726949,  548.96591199],
       [1618.47020489,  591.53908473],
       [1594.61260724,  488.85688559],
       [1432.35396378,  248.86526491],
       [1531.56994973,  315.53675823],
       [ 948.51498137,  435.15046369],
       [ 792.11089674,  251.55952899],
       [1639.67946997,  643.69785873],
       [1600.16586035,  572.76353883],
       [1131.27915623,  112.95820103],
       [ 982.40600485,  155.45666168],
       [ 887.76595649,   62.21072306],
       [ 722.65582589,  110.08114967],
       [ 431.76410788,  184.1218538 ],
       [1512.36677254,  628.8341862 ],
       [1509.77070167,  622.34071884],
       [1030.26075925,  160.63266403],
       [1262.29908095,  185.85317336],
       [1297.77434675,  241.76533414],
       [ 764.37461079,  188.07892445],
       [ 683.35134368,  301.70878573],
       [1829.97074471,  612.43151148],
       [1876.31467085,  614.45525045],
       [1108.41622112,  135.34501357],
       [1271.97741246,  121.13148386],
       [1679.86350265,   95.0385343 ],
       [ 941.60400606,  173.17222557],
       [ 726.77203105,  160.9577211 ],
       [1812.53980695,  453.54587947],
       [1837.47311421,  400.29246761],
       [1181.20945294,  264.87839746],
       [1339.13617025,  281.22975215],
       [1431.87077073,  300.64451574],
       [1473.11111715,  469.10600388],
       [ 765.28143359,  485.76426098],
       [1409.63193315,  573.17811736],
       [1640.4477585 ,  477.29232806],
       [1388.54848504,  221.80750609],
       [1492.73102686,  200.05235862],
       [1256.56298565,  330.96153532],
       [1192.32237196,  248.60279257],
       [ 703.39447663,  240.42795929],
       [1693.1146841 ,  387.49392021],
       [1801.11448512,  388.751357  ],
       [1216.93121413,  231.77629827],
       [1299.51681861,  335.36366565],
       [1293.62730691,  204.6915478 ],
       [1098.41744346,  239.14238942],
       [ 691.07720828,  328.55285461],
       [1887.37511249,  526.61287901],
       [1626.01162602,  551.22178193],
       [1404.84954745,  237.46471246],
       [1205.37324391,   91.12948719],
       [ 747.92342639,   89.61994738],
       [1266.49210553,  442.63119496],
       [1381.69916656,   99.82853962],
       [1113.57876763,   91.18169824],
       [1137.64027651,  120.18712096],
       [ 742.21175805,  115.58692178],
       [1231.78409791,   76.50641214],
       [1215.58342974,   67.6930879 ],
       [1362.75532412,   85.21833493],
       [ 635.46891407,   58.44085211],
       [ 536.71555155,   73.82526798],
       [1191.76321633,  332.56765386],
       [1470.35956029,  373.72426834],
       [1439.61783807,  140.32685033],
       [ 849.68501895,  162.7774635 ],
       [ 740.16355888,   94.14465483],
       [ 787.83229561,  109.67583212],
       [ 696.84896944,  199.117318  ],
       [1227.96684984,  577.06680691],
       [1433.67157464,  434.44200338],
       [1066.80459777,  468.9600619 ],
       [1195.8029285 ,  524.38543667],
       [1132.34848677,  489.94946752],
       [ 719.38047665,  338.51121742],
       [ 544.73690499,  339.57961169],
       [1252.72449374,  548.3030673 ],
       [1045.3532014 ,  474.423362  ],
       [ 546.60899718,  217.87787246],
       [ 856.17924212,  252.99940333],
       [1308.76355506,  206.04033045],
       [ 708.20961252,  313.05600769],
       [ 991.19036361,  152.4540067 ],
       [ 980.93440482,   96.21251148],
       [1157.47657461,   89.9392409 ],
       [1242.7218421 ,   82.83212673],
       [ 654.05963361,   54.67993009],
       [ 636.10528493,  148.59982779],
       [1100.26653408,  159.92706773],
       [1405.34438759,  103.9836515 ],
       [1933.27891787,  105.62788057],
       [1615.43907393,  566.80756349],
       [1028.6059865 ,  434.36789787],
       [ 690.21078958,  449.58579489],
       [ 902.52586509,  322.6426786 ],
       [1437.8354143 ,  405.07221715],
       [1646.75598976,  500.43886899],
       [ 684.41904321,  228.31053949],
       [ 571.03412365,  223.37679209],
       [1023.75064056,  440.86647935],
       [ 841.17986868,  371.71613754],
       [ 647.2927289 ,  315.58535673],
       [1067.50132366,   98.40618407],
       [1551.7717964 ,  167.8117326 ],
       [1454.25681417,   79.24770319],
       [ 898.37609186,   77.95295461],
       [ 807.11121289,   53.51446775],
       [1037.32271118,  155.88011782],
       [1241.43686083,   68.20723684],
       [1517.42713872,  103.66003445],
       [ 620.35148557,  270.95841517],
       [ 490.10342053,   54.01293507],
       [1102.42860971,   67.18981059],
       [1241.68262464,   85.18896581],
       [1387.5167652 ,   78.82992207],
       [ 774.11574743,   78.36547621],
       [ 614.72394938,   89.71526665],
       [1335.08933938,   78.1218587 ],
       [1529.72164262,   70.41657334],
       [1712.55220471,   68.95098386],
       [1370.8300994 ,   44.90133942],
       [ 831.70460643,   52.90542831],
       [1386.43632492,  154.02155861],
       [1684.60163611,  135.32612933],
       [1944.06337635,  109.00798274],
       [ 715.42204263,   80.54445512],
       [ 624.43774291,   70.63284885],
       [ 957.89068915,  101.27828858],
       [ 954.28540925,   64.33790172],
       [1057.14680113,   64.67690657],
       [ 541.41027048,   57.86087652],
       [ 431.83741107,   82.06721887],
       [1013.56250988,  150.78494287],
       [1172.32433873,  127.75601552],
       [1075.96804169,  114.88108511],
       [ 682.82001004,   58.45651973],
       [ 718.01197724,  432.07213832],
       [1129.95979784,   46.54439183],
       [1461.68454365,   50.35623328],
       [1554.21693409,   55.26445159],
       [ 903.88887738,   51.53510667],
       [ 642.34101444,   51.66781636],
       [1181.20945294,  264.87839746],
       [1339.13617025,  281.22975215],
       [1431.87077073,  300.64451574],
       [1473.11111715,  469.10600388],
       [ 765.28143359,  485.76426098],
       [1582.67326195,   83.16492397],
       [1462.25168166,   84.83617958]])

X_spectral_bandwidth_stats.shape

(240, 2)

Spectral Contrast#

Spectral Contrast considers the difference in amplitude between peaks and valleys in the spectrum. This feature can be used to distinguish between different types of sound textures and timbres, as it effectively captures the dynamics of the spectral peaks and troughs over time.

# Feature extraction with Spectral Contrast 
x_spectral_contrast = librosa.feature.spectral_contrast(y=audio_signal_1, sr=fs, n_fft=nfft, hop_length=fp).T

x_spectral_contrast

array([[ 5.63810593,  5.66707538,  6.30149297, ..., 13.96765089,
        12.25343919, 10.17588324],
       [ 6.21691956, 11.38575072, 11.8819769 , ..., 21.75429392,
        18.58376962, 15.36240143],
       [25.49436508, 18.267316  , 25.67031318, ..., 19.25143146,
        17.4834322 , 15.83951655],
       ...,
       [ 9.64391107,  3.7931424 , 10.92962245, ..., 17.23052401,
        15.0653737 , 11.47613243],
       [11.83635255, 10.54209236,  8.97554807, ..., 17.5021864 ,
        10.951186  , 12.79228362],
       [ 8.21206286,  2.71809055,  4.90784151, ..., 17.91953245,
        11.35885709, 12.94873939]])

x_spectral_contrast.shape

(1692, 7)

From Spectral Contrast matrices to tensor#

X_spectral_contrast_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='spectral_contrast', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_contrast_tensor

array([[[ 5.11662149, 10.292627  , 18.93497539, ...,  0.        ,
          0.        ,  0.        ],
        [ 6.59230729, 15.67208379,  8.04153712, ...,  0.        ,
          0.        ,  0.        ],
        [ 8.45520947, 17.5710727 , 18.92689921, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [12.37949689, 17.93070151, 20.61663416, ...,  0.        ,
          0.        ,  0.        ],
        [19.41221672, 15.34366631, 17.42559859, ...,  0.        ,
          0.        ,  0.        ],
        [17.22184445, 15.558392  , 17.60068867, ...,  0.        ,
          0.        ,  0.        ]],

       [[ 8.85085044,  9.03121079, 11.46311676, ...,  0.        ,
          0.        ,  0.        ],
        [ 6.52008624, 12.60085564,  4.4187557 , ...,  0.        ,
          0.        ,  0.        ],
        [ 6.68093446,  8.71320422, 13.95862644, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [16.7197144 , 13.70447852, 10.81675158, ...,  0.        ,
          0.        ,  0.        ],
        [11.91122822, 11.42524776, 18.37071563, ...,  0.        ,
          0.        ,  0.        ],
        [13.56250443, 21.8735152 , 20.07728048, ...,  0.        ,
          0.        ,  0.        ]],

       [[ 5.21670479, 13.65392556, 14.58397327, ...,  0.        ,
          0.        ,  0.        ],
        [ 4.65187137,  9.55822925, 10.14831792, ...,  0.        ,
          0.        ,  0.        ],
        [10.39356395, 11.80113395, 11.72104283, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [13.04850063, 20.12447109, 14.33627266, ...,  0.        ,
          0.        ,  0.        ],
        [13.14200444, 14.82449875, 12.20986094, ...,  0.        ,
          0.        ,  0.        ],
        [11.37148345, 13.42516093, 17.88007679, ...,  0.        ,
          0.        ,  0.        ]],

       ...,

       [[ 0.64248729,  2.0961406 ,  6.65306162, ...,  0.        ,
          0.        ,  0.        ],
        [ 7.20312337, 11.00841146, 23.68205907, ...,  0.        ,
          0.        ,  0.        ],
        [13.08904135, 17.47975944, 33.00536649, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [11.80279665,  9.73920911, 20.00545211, ...,  0.        ,
          0.        ,  0.        ],
        [ 6.41256743,  9.66989794, 26.14086535, ...,  0.        ,
          0.        ,  0.        ],
        [ 9.23506418, 12.4874168 , 25.51032203, ...,  0.        ,
          0.        ,  0.        ]],

       [[ 9.32894184, 13.2649533 , 29.12562629, ...,  0.        ,
          0.        ,  0.        ],
        [ 4.84771882, 10.07119587,  9.77597993, ...,  0.        ,
          0.        ,  0.        ],
        [ 4.30891259,  9.2015415 , 16.8332916 , ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [10.13391008, 14.07278776, 16.34504072, ...,  0.        ,
          0.        ,  0.        ],
        [12.76159498, 19.44753392, 18.42114983, ...,  0.        ,
          0.        ,  0.        ],
        [12.27059947, 35.52131359, 18.83002358, ...,  0.        ,
          0.        ,  0.        ]],

       [[ 5.41242976, 11.35503993, 29.50033658, ...,  0.        ,
          0.        ,  0.        ],
        [ 8.64307322,  9.49400294, 13.23320113, ...,  0.        ,
          0.        ,  0.        ],
        [ 7.9839711 , 15.35658127, 22.49312214, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [11.60944505, 16.85032802, 12.1691571 , ...,  0.        ,
          0.        ,  0.        ],
        [13.09084623, 17.7903069 , 20.16377032, ...,  0.        ,
          0.        ,  0.        ],
        [ 7.19329074, 11.87063984, 18.35145274, ...,  0.        ,
          0.        ,  0.        ]]])

X_spectral_contrast_tensor.shape

(240, 7, 4403)

From Spectral Contrast matrices to predictors matrix (tabular data)#

X_spectral_contrast_stats = get_X_audio_features(paths=files_df['path'], method='spectral_contrast', stats='mean-std', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_contrast_stats

array([[15.06088705, 13.66975386, 21.81056366, ...,  4.84396642,
         4.33531148,  3.91902237],
       [10.96765446, 10.62080312, 21.04077107, ...,  4.59791976,
         4.76635799,  3.87400562],
       [14.57345365, 13.03160549, 20.7638555 , ...,  5.36559019,
         4.00373309,  5.56484739],
       ...,
       [ 8.84012428, 20.74519205, 22.22548112, ...,  3.60031112,
         4.46534307,  5.43222996],
       [22.70844871, 13.80892128, 13.44845706, ...,  3.08630459,
         2.73138099,  4.70273429],
       [24.06449453, 14.47909753, 17.42508059, ...,  3.08591616,
         2.60190525,  3.64455257]])

X_spectral_contrast_stats.shape

(240, 14)

Spectral Rolloff#

Spectral Rolloff is a measure of the shape of the signal. It represents the frequency below which a certain percentage of the total spectral energy, typically between 85% and 95%, is contained. This can indicate whether the sound is noise-like or tone-like.

# Feature extraction with spectral rolloff 
x_spectral_rolloff = librosa.feature.spectral_rolloff(y=audio_signal_1, sr=fs, n_fft=nfft, hop_length=fp).T

x_spectral_rolloff

array([[1750.  ],
       [1687.5 ],
       [1718.75],
       ...,
       [5468.75],
       [5250.  ],
       [4562.5 ]])

x_spectral_rolloff.shape

(1692, 1)

From Spectral Rolloff matrices to tensor#

X_spectral_rolloff_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='spectral_rolloff', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_rolloff_tensor

array([[[1375.  , 1312.5 , 1281.25, ...,    0.  ,    0.  ,    0.  ]],

       [[3781.25, 3750.  , 3656.25, ...,    0.  ,    0.  ,    0.  ]],

       [[3031.25, 2750.  , 2906.25, ...,    0.  ,    0.  ,    0.  ]],

       ...,

       [[ 843.75,  750.  ,  750.  , ...,    0.  ,    0.  ,    0.  ]],

       [[2343.75, 1843.75, 1593.75, ...,    0.  ,    0.  ,    0.  ]],

       [[1406.25,  937.5 ,  781.25, ...,    0.  ,    0.  ,    0.  ]]])

X_spectral_rolloff_tensor.shape

(240, 1, 4403)

From Spectral Rolloff matrices to tensor#

X_spectral_rolloff_stats = get_X_audio_features(paths=files_df['path'], method='spectral_rolloff', stats='mean-std', sr=fs, n_fft=nfft, hop_length=fp)

X_spectral_rolloff_stats

array([[1578.6367205 ,  294.68382051],
       [2307.13619403,  645.09551762],
       [2839.43103941,  709.64216797],
       [1382.05467372,  513.57783209],
       [1389.07657658,  490.49725542],
       [2164.73642173,  946.00923455],
       [2131.14017572,  893.32214866],
       [1198.98875753,  441.14167795],
       [1719.21762126,  266.29955367],
       [1973.86210005,  596.13730246],
       [ 935.67607004,   58.1070769 ],
       [ 559.38546901,   43.88905301],
       [3432.1086262 , 2201.60451981],
       [3117.26238019, 2041.97472162],
       [1454.17764396,  311.29167221],
       [2514.0600159 ,  344.8225258 ],
       [3772.02406923,  242.67752925],
       [1033.64672482,  314.1257139 ],
       [ 857.72165698,  737.38720186],
       [4216.0543131 , 1976.77369894],
       [4163.7879393 , 1871.63426645],
       [1150.74325951,  957.16552888],
       [1879.84330484, 1069.92905233],
       [1613.3592832 , 1206.71699409],
       [1071.52650823, 1173.53613121],
       [ 742.73830935, 1209.61572622],
       [3458.38881491, 2228.09735205],
       [3318.49201065, 2146.37621317],
       [1563.86181193,  587.91731402],
       [2377.50536865,  794.41661805],
       [3324.47444612,  935.64808442],
       [1070.78607743,  450.20903218],
       [ 578.27019299,  320.58401002],
       [2141.57133244, 1623.96398923],
       [2255.82067371, 1731.8901964 ],
       [1338.34609684, 1144.35899745],
       [2138.61094762, 1135.79701637],
       [1576.4553429 , 1465.20702938],
       [1064.16693445, 1133.08929299],
       [ 890.07429164, 1388.3017005 ],
       [3858.66389914, 2224.35858576],
       [3799.33621718, 2181.85116565],
       [1435.80493741,  981.24560147],
       [ 819.73659717,  865.00084828],
       [ 387.73311049,  542.12799179],
       [ 707.5970201 ,  693.06733494],
       [ 367.19939117,  453.92438608],
       [3398.85831382, 2151.01256025],
       [3854.65730676, 2264.88491817],
       [1484.47888963,  424.14674694],
       [2131.84875328,  495.68605309],
       [1553.65861292,  872.72498138],
       [1139.91434203,  817.36289266],
       [ 511.07848372,  485.98713341],
       [4873.64682003, 1927.01127398],
       [4627.18563988, 1885.06710895],
       [ 885.60267857,  102.62042625],
       [1622.40740741,  405.69647845],
       [2545.17067124,  524.42803916],
       [ 728.3779985 ,  435.20021324],
       [ 614.2288197 ,  241.25372691],
       [2866.61161335, 2073.9756643 ],
       [2820.19297636, 2160.34752419],
       [1350.7054849 ,  731.81093646],
       [2099.0182803 ,  659.98831582],
       [2017.0386309 ,  922.31060334],
       [ 940.95595127,  541.41208085],
       [ 707.85501701,  802.86068199],
       [4471.5379494 , 1783.17038779],
       [4600.5859375 , 1677.91194826],
       [2507.421875  ,  568.81464629],
       [3827.36703682,  340.24512245],
       [3772.27489867,  219.35218214],
       [1837.5743205 ,  605.67155563],
       [ 803.093292  ,  381.85386404],
       [3169.83103198, 2294.27586906],
       [3678.22265625, 1740.43691076],
       [1223.09149184,  714.45317936],
       [1768.66442398,  561.82312587],
       [2358.01282051,  702.80736945],
       [ 863.04925157,  245.87019308],
       [ 639.68005498,  610.64071127],
       [3918.12586685, 1802.59150778],
       [4367.35048679, 1738.39212478],
       [1250.77033689,  334.76705459],
       [1385.63724193,  359.19361123],
       [ 746.30905512,  940.9836373 ],
       [ 993.31825658,  676.06153863],
       [ 494.01735624,  567.95022001],
       [3541.8667467 , 1619.45666857],
       [3040.68877551, 1681.95144235],
       [2542.87732042, 1530.66887727],
       [1905.97716588,  951.12761808],
       [1653.24039421, 1049.17231359],
       [1206.449877  , 1138.26826948],
       [ 839.93267629,  615.15670034],
       [2860.92862216, 1905.56375762],
       [2702.61310452, 1755.80060827],
       [1594.22451193,  327.23041889],
       [1675.16727494,  439.57349985],
       [ 518.33890031,  264.28908366],
       [ 658.03993694,  220.93226459],
       [ 458.62950763,  369.56606731],
       [2662.11752434, 1811.07551733],
       [2807.83913352, 1832.78082539],
       [1584.4495552 ,  404.61854163],
       [1358.60736926,  659.89217544],
       [ 495.10287486,  509.62082461],
       [ 972.95966229,  371.32572981],
       [ 574.41271963,  803.2310604 ],
       [3557.82312925, 2144.49266002],
       [3726.19047619, 2119.92357497],
       [1515.05245272,  301.74537836],
       [2398.69698992,  299.92106988],
       [3535.37936091,  318.58815764],
       [1169.01221455,  388.05427157],
       [ 713.34050721,  374.83249456],
       [3332.69583843, 1454.03962884],
       [3392.69950565, 1332.2369463 ],
       [1602.72051148,  738.35120943],
       [2125.28564899,  936.30804244],
       [2020.64620758, 1175.09501504],
       [2207.29041013, 1483.82146809],
       [ 999.25947867,  831.29376621],
       [2314.53894807, 1610.61560691],
       [2829.20396419, 1426.87374346],
       [1655.62157221,  728.62985827],
       [1711.31921824,  823.40781998],
       [ 838.90608181, 1046.28110475],
       [ 901.42413607,  781.104817  ],
       [ 451.66782087,  532.05512018],
       [2486.060253  , 1527.98495191],
       [2798.51190476, 1567.08923551],
       [1284.55305533,  760.17585358],
       [1079.35138081, 1303.49345057],
       [ 713.57725892,  832.30255882],
       [ 865.06689233,  770.61642087],
       [ 611.8787092 ,  853.21561376],
       [3565.29017857, 1940.59921084],
       [2920.45454545, 1728.230626  ],
       [ 708.52359209,  981.93363476],
       [ 792.44056464,  137.08473086],
       [ 372.12171053,  111.73951411],
       [1690.58035714, 1539.2421707 ],
       [ 922.27450284,  290.62243435],
       [ 360.97301136,   97.36575466],
       [ 736.77721088,   62.23804375],
       [ 529.17268786,  126.82099239],
       [1234.23549107,   72.30946634],
       [ 825.94722598,  259.25639394],
       [1294.78561047,  677.6963999 ],
       [ 715.99786932,   80.38751358],
       [ 423.50685379,   96.94177263],
       [1363.87343533,  964.31350655],
       [2075.49307036, 1381.86850838],
       [ 617.22452607,  329.05594076],
       [ 879.95173103,  183.29925875],
       [ 903.49786932,   80.90353576],
       [ 951.75289312,  112.24318425],
       [ 827.81668428,  508.84341295],
       [1702.40902965, 1509.62467979],
       [1802.83434232,  877.59688545],
       [1009.65698393,  601.88516404],
       [1498.53936039,  813.20652401],
       [ 785.60419236,  680.99685271],
       [ 696.26383764,  505.39312914],
       [ 511.87730627,  712.10340385],
       [1087.21187943,  774.4801005 ],
       [ 716.38492556,  694.43894157],
       [1642.89459885,  406.59378677],
       [1658.82402995,  885.7156183 ],
       [2958.08649289,  784.06241789],
       [1472.41257089,  718.68630045],
       [1217.11198094,  416.04509203],
       [1211.59306908,  222.08965433],
       [1761.02669783,  131.25278913],
       [1794.42084542,  511.0219202 ],
       [ 917.73177593,   37.24245717],
       [ 640.26889244,  348.26059351],
       [1551.57215558,  465.66772474],
       [2396.99777238,  326.26780501],
       [3725.06671588,  327.63435257],
       [1485.78973843, 1488.31734151],
       [1031.66986564, 1149.73662961],
       [ 694.88324176, 1182.76746538],
       [1541.82284876,  611.18272829],
       [2386.76205654,  710.5439903 ],
       [3066.91817301, 1004.65072277],
       [1062.02084332,  375.19180662],
       [ 591.82375823,  380.57416718],
       [ 316.61285363,  235.69780354],
       [ 623.47931873,  445.01984414],
       [ 378.21691176,  437.90651105],
       [1320.53450609,   61.98142231],
       [2320.18757688,  427.8118124 ],
       [2035.59348093,  688.9114003 ],
       [ 956.03241297,  110.41399843],
       [ 440.09182464,   28.70635485],
       [ 971.26582994,  455.02806315],
       [1676.99829932,  135.11377637],
       [1734.31258322,  557.0219678 ],
       [ 822.23701731,  644.89906846],
       [ 494.94363395,   53.96194693],
       [1147.77835408,   95.13631694],
       [1950.9422545 ,  190.85299681],
       [2314.54100145,  296.0713833 ],
       [ 894.4868608 ,   88.69579563],
       [ 580.39196568,   91.79478257],
       [3156.78206583,  332.77839626],
       [3688.32579972,  313.18270765],
       [3933.54485396,  155.21424184],
       [2919.3452381 ,  228.22044916],
       [ 846.37150466,   44.9196164 ],
       [1517.17687075,  671.18269131],
       [3185.18350291,  665.77052429],
       [3591.88771802,  834.20846904],
       [ 803.58403955,   88.53154462],
       [ 489.20068027,   75.8860565 ],
       [1400.94866071,   68.51900701],
       [1552.53120666,  319.54741893],
       [1524.29552023,  688.20548111],
       [ 643.68472585,   64.15240438],
       [ 475.21989175,  109.41470265],
       [1646.94888734,  136.62034684],
       [1394.30930398,  453.16206269],
       [ 423.51740057,  143.80506776],
       [ 910.06747159,   45.05668258],
       [ 761.76286073, 1183.23349718],
       [1596.90525588,   71.92158386],
       [2794.00510204,  118.31368454],
       [3195.28061224,  153.84387488],
       [1079.81178977,   23.01793203],
       [ 524.84151047,   87.25957695],
       [1602.72051148,  738.35120943],
       [2125.28564899,  936.30804244],
       [2020.64620758, 1175.09501504],
       [2207.29041013, 1483.82146809],
       [ 999.25947867,  831.29376621],
       [1644.75835756,  199.74250599],
       [1567.00680272,  338.52262425]])

X_spectral_rolloff_stats.shape

(240, 2)

Zero Crossing Rate#

The Zero Crossing Rate is the rate at which the signal changes from positive to negative or back. This feature is often used to measure the noisiness or the frequency content of a sound. A higher zero-crossing rate indicates a noisier signal or a higher frequency content.

# Feature extraction with spectral centroid 
x_zero_crossing_rate = librosa.feature.zero_crossing_rate(y=audio_signal_1, hop_length=fp).T

x_zero_crossing_rate

array([[0.0390625 ],
       [0.04443359],
       [0.04882812],
       ...,
       [0.04443359],
       [0.04443359],
       [0.04443359]])

x_zero_crossing_rate.shape

(1692, 1)

From Zero Crossing Rate matrices to tensor#

X_zero_crossing_rate_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='zero_crossing_rate', hop_length=fp)

X_zero_crossing_rate_tensor

array([[[0.03857422, 0.04638672, 0.05566406, ..., 0.        ,
         0.        , 0.        ]],

       [[0.04882812, 0.05175781, 0.05566406, ..., 0.        ,
         0.        , 0.        ]],

       [[0.03125   , 0.03466797, 0.03759766, ..., 0.        ,
         0.        , 0.        ]],

       ...,

       [[0.02294922, 0.02587891, 0.02880859, ..., 0.        ,
         0.        , 0.        ]],

       [[0.00927734, 0.01074219, 0.01171875, ..., 0.        ,
         0.        , 0.        ]],

       [[0.01123047, 0.01220703, 0.015625  , ..., 0.        ,
         0.        , 0.        ]]])

X_zero_crossing_rate_tensor.shape

(240, 1, 4403)

From Zero Crossing Rate matrices to predictors matrix#

X_zero_crossing_rate_stats = get_X_audio_features(paths=files_df['path'], method='zero_crossing_rate', stats='mean-std', hop_length=fp)

X_zero_crossing_rate_stats

array([[0.13968108, 0.01811889],
       [0.12713428, 0.05410704],
       [0.16139104, 0.09023381],
       [0.13278552, 0.0150368 ],
       [0.06845982, 0.01230139],
       [0.12291739, 0.02418128],
       [0.11398559, 0.03383005],
       [0.04721614, 0.0100022 ],
       [0.06379654, 0.00702339],
       [0.03594709, 0.00396019],
       [0.06673346, 0.00343725],
       [0.03630212, 0.00146984],
       [0.06123172, 0.03862254],
       [0.05755557, 0.03619177],
       [0.11102839, 0.01048388],
       [0.06200357, 0.00509845],
       [0.04921273, 0.01197553],
       [0.06549767, 0.00737761],
       [0.05590522, 0.01658055],
       [0.12777993, 0.04488347],
       [0.11122282, 0.03122729],
       [0.05421168, 0.02222161],
       [0.02710364, 0.03646258],
       [0.0251287 , 0.0306713 ],
       [0.02515645, 0.02913848],
       [0.02123953, 0.02857775],
       [0.07877582, 0.04501191],
       [0.06794651, 0.0396902 ],
       [0.10538364, 0.03681736],
       [0.08328533, 0.03037691],
       [0.05997303, 0.02158588],
       [0.07708809, 0.01790234],
       [0.04999805, 0.0154705 ],
       [0.0608518 , 0.03226082],
       [0.05940261, 0.03613893],
       [0.03698103, 0.0348455 ],
       [0.04904827, 0.03648653],
       [0.03641628, 0.03814897],
       [0.0438722 , 0.02820492],
       [0.04142663, 0.03438384],
       [0.07770218, 0.04534396],
       [0.0969215 , 0.06244353],
       [0.02222664, 0.01578338],
       [0.02415738, 0.01778952],
       [0.02024907, 0.00672536],
       [0.02500724, 0.01242779],
       [0.01912175, 0.00606162],
       [0.05932331, 0.03690528],
       [0.06844134, 0.0444455 ],
       [0.05044686, 0.00393384],
       [0.05008215, 0.00904949],
       [0.02599162, 0.00591953],
       [0.04956218, 0.01153753],
       [0.04733304, 0.00612864],
       [0.12416814, 0.05707454],
       [0.11248489, 0.07045394],
       [0.0385546 , 0.00645496],
       [0.04952745, 0.00487907],
       [0.04086106, 0.0131053 ],
       [0.05485935, 0.00818578],
       [0.04154545, 0.007506  ],
       [0.07722042, 0.05328447],
       [0.0638752 , 0.04123212],
       [0.07812189, 0.01389549],
       [0.05926286, 0.03108077],
       [0.03692485, 0.02160651],
       [0.06672064, 0.00761959],
       [0.03954165, 0.01084108],
       [0.11126636, 0.04463073],
       [0.11067893, 0.03837935],
       [0.08972039, 0.01590184],
       [0.14395165, 0.02377742],
       [0.09803964, 0.01855576],
       [0.059948  , 0.00446716],
       [0.04035064, 0.00278837],
       [0.09153002, 0.06177192],
       [0.09320068, 0.04530613],
       [0.03140502, 0.01943803],
       [0.04172315, 0.01466549],
       [0.03991436, 0.01126458],
       [0.02519894, 0.00839441],
       [0.01699947, 0.00671597],
       [0.09296577, 0.04268649],
       [0.10410509, 0.04983582],
       [0.03064898, 0.01241794],
       [0.02730601, 0.01510759],
       [0.02487658, 0.01435425],
       [0.03393201, 0.01164079],
       [0.0239655 , 0.0050883 ],
       [0.06358676, 0.03265331],
       [0.06741337, 0.03683844],
       [0.05646525, 0.02461556],
       [0.06902146, 0.01508865],
       [0.04853818, 0.00593081],
       [0.05983517, 0.01235953],
       [0.05137608, 0.00635608],
       [0.0541583 , 0.0339362 ],
       [0.05042856, 0.0248835 ],
       [0.06452007, 0.00909583],
       [0.05463855, 0.00777592],
       [0.03005288, 0.00136787],
       [0.0474926 , 0.00606619],
       [0.02826783, 0.00242811],
       [0.05432893, 0.02864796],
       [0.06451069, 0.02921676],
       [0.06711172, 0.01126371],
       [0.05211474, 0.00967578],
       [0.02653931, 0.01364669],
       [0.05070375, 0.00636863],
       [0.02818509, 0.01122906],
       [0.0873379 , 0.05396432],
       [0.08751794, 0.04881271],
       [0.07529112, 0.00654264],
       [0.07020491, 0.01723818],
       [0.04758654, 0.01144851],
       [0.08261003, 0.00918995],
       [0.03713323, 0.00767515],
       [0.06179716, 0.02684996],
       [0.05616752, 0.02458286],
       [0.08429478, 0.02419465],
       [0.09040344, 0.04778768],
       [0.05908203, 0.04133373],
       [0.06453676, 0.04603129],
       [0.03852678, 0.01008504],
       [0.06419566, 0.04955133],
       [0.06270543, 0.03789536],
       [0.02649751, 0.00803221],
       [0.02576678, 0.01562968],
       [0.02386902, 0.00755392],
       [0.026354  , 0.00931457],
       [0.02286716, 0.00867855],
       [0.04174707, 0.02722968],
       [0.03848985, 0.02107703],
       [0.03408534, 0.02487225],
       [0.02655455, 0.0291366 ],
       [0.02094493, 0.00716199],
       [0.02362545, 0.01310268],
       [0.02340092, 0.0119515 ],
       [0.07906015, 0.04495811],
       [0.06888025, 0.0442522 ],
       [0.02407665, 0.01066833],
       [0.03110707, 0.00732316],
       [0.02085261, 0.00096723],
       [0.04219727, 0.04750712],
       [0.01970742, 0.00096511],
       [0.01981423, 0.00196725],
       [0.02229153, 0.00542349],
       [0.02281374, 0.00604543],
       [0.03097825, 0.0064902 ],
       [0.02132051, 0.00103194],
       [0.02956018, 0.00963568],
       [0.02231737, 0.00410791],
       [0.02439048, 0.00455354],
       [0.05000706, 0.02383377],
       [0.064298  , 0.03223356],
       [0.04895482, 0.00267739],
       [0.07360824, 0.01488585],
       [0.05343212, 0.00351828],
       [0.06551245, 0.00669362],
       [0.04941723, 0.00875284],
       [0.04785902, 0.03845829],
       [0.0199244 , 0.01894239],
       [0.03905646, 0.01865368],
       [0.04302701, 0.02121043],
       [0.0276773 , 0.01179262],
       [0.03782888, 0.01656477],
       [0.03066322, 0.01703873],
       [0.02207308, 0.01316066],
       [0.02220953, 0.01314401],
       [0.15428638, 0.01690589],
       [0.09630176, 0.02930831],
       [0.15509438, 0.09276395],
       [0.11556943, 0.02297215],
       [0.06960384, 0.00895991],
       [0.06418306, 0.01284583],
       [0.06860784, 0.00750579],
       [0.03513338, 0.00232827],
       [0.06745328, 0.00315454],
       [0.03571403, 0.00211246],
       [0.11620473, 0.00912301],
       [0.06388811, 0.00827902],
       [0.05453125, 0.02038243],
       [0.02789524, 0.0417326 ],
       [0.02776713, 0.02875004],
       [0.02412834, 0.03925213],
       [0.09714187, 0.03477876],
       [0.07630937, 0.02205347],
       [0.05836017, 0.01956413],
       [0.07699613, 0.01966548],
       [0.05165145, 0.01457113],
       [0.01814448, 0.00765121],
       [0.01896218, 0.00942553],
       [0.01808077, 0.00771513],
       [0.04567775, 0.00221061],
       [0.02458043, 0.00284532],
       [0.0265816 , 0.00321473],
       [0.04837032, 0.00380908],
       [0.02571981, 0.00106344],
       [0.04332938, 0.01047251],
       [0.04883743, 0.00245721],
       [0.03439034, 0.00389592],
       [0.05813668, 0.00572955],
       [0.03028121, 0.00294335],
       [0.05586397, 0.00980335],
       [0.06183207, 0.00775425],
       [0.04027009, 0.00794348],
       [0.06504198, 0.00434342],
       [0.03683134, 0.00313853],
       [0.10267928, 0.00872451],
       [0.14867247, 0.01975351],
       [0.1257674 , 0.03768157],
       [0.06207882, 0.00630047],
       [0.04324247, 0.0021459 ],
       [0.03616404, 0.00538445],
       [0.06617276, 0.01587198],
       [0.03377799, 0.00847464],
       [0.02790307, 0.00540516],
       [0.01475672, 0.00068555],
       [0.06856428, 0.00596784],
       [0.0527039 , 0.00594065],
       [0.03204028, 0.00491591],
       [0.05535936, 0.00240478],
       [0.02998272, 0.00337078],
       [0.09314423, 0.01300252],
       [0.04803051, 0.00247526],
       [0.02579498, 0.00435521],
       [0.04774683, 0.00245418],
       [0.03134067, 0.01968811],
       [0.07954392, 0.00361803],
       [0.08026746, 0.02211988],
       [0.02960512, 0.00139591],
       [0.07879223, 0.00932616],
       [0.03649997, 0.00809681],
       [0.08429478, 0.02419465],
       [0.09040344, 0.04778768],
       [0.05908203, 0.04133373],
       [0.06453676, 0.04603129],
       [0.03852678, 0.01008504],
       [0.0267845 , 0.00542232],
       [0.02100207, 0.00238617]])

X_zero_crossing_rate_stats.shape

(240, 2)

Tempogram#

A tempogram provides a time-tempo representation, showing how the tempo of a music piece or any audio signal varies over time. It is essentially a two-dimensional feature that maps tempo changes over time, offering a detailed view of the rhythmic dynamics within the audio. This analysis is crucial for understanding the structure and expression in music, as well as the articulation in speech or other sounds.

# Feature extraction with tempogram
x_tempogram = librosa.feature.tempogram(y=audio_signal_1, hop_length=fp).T

x_tempogram

array([[ 1.00000000e+00,  9.41933158e-01,  8.76072103e-01, ...,
         1.59134570e-17, -3.85780776e-17,  6.93578724e-17],
       [ 1.00000000e+00,  9.42431885e-01,  8.76818837e-01, ...,
        -7.22920084e-17, -5.24033103e-17, -2.94649892e-17],
       [ 1.00000000e+00,  9.42927130e-01,  8.77559350e-01, ...,
         2.42083309e-17,  4.45072418e-17,  6.77565953e-17],
       ...,
       [ 1.00000000e+00,  9.82748236e-01,  9.40765518e-01, ...,
         1.59796469e-13,  1.67667967e-14, -6.19583679e-17],
       [ 1.00000000e+00,  9.82782184e-01,  9.40875757e-01, ...,
         1.23167723e-13,  1.18727972e-14, -2.76223874e-17],
       [ 1.00000000e+00,  9.82816140e-01,  9.40986261e-01, ...,
         8.18619518e-14,  6.58631321e-15, -5.36289546e-17]])

x_tempogram.shape

(1692, 384)

From Tempogram matrices to tensor#

X_tempogram_tensor = get_X_tensor_audio_features(paths=files_df['path'], method='tempogram', sr=fs, hop_length=fp)

X_tempogram_tensor

array([[[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.74025825e-01,  9.73998168e-01,  9.73970572e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.24148727e-01,  9.23955753e-01,  9.23763154e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [-1.28282802e-17,  1.08569700e-16,  1.05896581e-16, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-5.90230469e-17,  2.06738282e-17,  7.99315540e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 1.64078887e-17,  3.83782012e-17,  8.48954562e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.53294547e-01,  9.53346317e-01,  9.53398304e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 8.48126253e-01,  8.48283160e-01,  8.48440825e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [ 4.15936158e-17,  2.20724551e-17,  3.39125953e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 4.23709512e-17,  1.30717984e-17,  1.46523866e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 6.67027846e-17,  3.10240619e-17,  4.69924712e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.64979117e-01,  9.65065593e-01,  9.65152299e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.16047076e-01,  9.16234321e-01,  9.16422126e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [ 4.93275170e-17, -2.60084241e-17,  9.42100486e-19, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 6.90084451e-17,  5.23157957e-18, -4.78983721e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 2.60784817e-17,  5.12694797e-17, -1.12246315e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       ...,

       [[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.82744870e-01,  9.82783156e-01,  9.82820940e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.46394249e-01,  9.46493980e-01,  9.46592393e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [ 4.48524823e-17, -6.33349599e-17,  2.18787322e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-5.60620420e-17, -1.08045290e-16, -3.81772827e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-6.32051096e-19, -1.12831175e-16, -3.46585914e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.42840089e-01,  9.43384184e-01,  9.43925014e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 8.96773308e-01,  8.97585307e-01,  8.98392989e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [-1.00166280e-16, -1.63361461e-16, -3.34007391e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-8.14673642e-17, -1.15189864e-16, -6.16683675e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-7.50167651e-17, -7.13336301e-17, -5.65139450e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

       [[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.51249855e-01,  9.51799326e-01,  9.52344081e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [ 9.10201027e-01,  9.11061540e-01,  9.11915205e-01, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        ...,
        [-1.94385523e-16,  9.77787920e-17,  3.71273595e-18, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-1.31127758e-16,  1.35390064e-16, -9.61137305e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
        [-1.75107854e-16,  1.55994476e-16, -3.74720293e-17, ...,
          0.00000000e+00,  0.00000000e+00,  0.00000000e+00]]])

X_tempogram_tensor.shape

(240, 384, 4403)

From Tempogram matrices to predictors matrix#

X_tempogram_stats = get_X_audio_features(paths=files_df['path'], method='tempogram', stats='mean-std', sr=fs, hop_length=fp)

X_tempogram_stats

array([[1.00000000e+00, 9.82917312e-01, 9.40125605e-01, ...,
        1.97084272e-10, 2.59539915e-11, 9.25080531e-17],
       [1.00000000e+00, 9.82942999e-01, 9.42317549e-01, ...,
        1.96826479e-10, 2.51983908e-11, 9.82804325e-17],
       [1.00000000e+00, 9.88717836e-01, 9.62690178e-01, ...,
        2.05286095e-10, 2.61386952e-11, 1.08429054e-16],
       ...,
       [1.00000000e+00, 9.84447901e-01, 9.48210334e-01, ...,
        1.78410464e-11, 2.02061978e-12, 8.59678112e-17],
       [1.00000000e+00, 9.88605317e-01, 9.69314379e-01, ...,
        1.18753587e-10, 1.49967598e-11, 1.25303448e-16],
       [1.00000000e+00, 9.90035415e-01, 9.72834162e-01, ...,
        1.21425552e-10, 1.53642920e-11, 1.30194034e-16]])

X_tempogram_stats.shape

(240, 768)

Audio Processing

Contents

Audio Processing#

Objective#

Requirements#

Data#

Reading Speech Files#

Reading a class 0 (normal) speech#

Reading a class 3 (severe) speech#

Feature extraction#

Mel-Frequency Cepstrum Coefficients (MFCC)#

From MFCC matrices to tensor#

From MFCC matrices to predictors matrix (tabular data)#

Example of predictors matrix that combines different features extraction methods and statistics#

Chromagram#

From Chroma matrices to tensor#

From Chroma matrices to predictors matrix (tabular data)#

Spectral Centroid#

From Spectral Centroid matrices to tensor#

From Spectral Centroid matrices to predictors matrix#

Spectral Bandwidth#

From Spectral Bandwidth matrices to tensor#

From Spectral Bandwidth matrices to predictors matrix (tabular data)#

Spectral Contrast#

From Spectral Contrast matrices to tensor#

From Spectral Contrast matrices to predictors matrix (tabular data)#

Spectral Rolloff#

From Spectral Rolloff matrices to tensor#

From Spectral Rolloff matrices to tensor#

Zero Crossing Rate#

From Zero Crossing Rate matrices to tensor#

From Zero Crossing Rate matrices to predictors matrix#

Tempogram#

From Tempogram matrices to tensor#

From Tempogram matrices to predictors matrix#