{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Inner Evaluation 1**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Evaluation of many feature extraction methods along with Random Forest**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this part of the project we are going to search the best combination of feature extraction methods and statistics for classifying the Parkinson disease level.\n", "\n", "We are going to apply a methodology for selecting the best alternative that is a classical approach in Machine Learning and is called inner evaluation.\n", "\n", "The basic idea is to split the data in train an test partitions (outer), and using the training one to select the best alternative from a predictive perspective. \n", "This will be done using KFold Cross Validation, a robust and accurate method for estimating the predictive error of the model when predicting new data (validation data), along with grid search methods to optimize the hyper-parameters of the model. \n", "\n", "Along the project inner evaluation will be applied in a iterative way (in several rounds), in order to find a good enough model for our classification problem.\n", "\n", "This is the first round of the inner evaluation and is oriented to find the features extraction methods that seem to work better for our classification task, testing all of them along with Random Forest.\n", "\n", "We have selected Random Forest to conduct this evaluation since is a model that usually works well with any kind of tabular data, both in regression and classification problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Requirements**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import polars as pl\n", "import sys\n", "import pickle\n", "import matplotlib.pyplot as plt\n", "from sklearn.ensemble import RandomForestClassifier\n", "import seaborn as sns\n", "sns.set_style('whitegrid')\n", "from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score\n", "from itertools import combinations\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sys.path.insert(0, r'C:\\Users\\fscielzo\\Documents\\Packages\\PyML_Package_Private')\n", "from PyML.evaluation import SimpleEvaluation" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sys.path.insert(0, r'C:\\Users\\fscielzo\\Documents\\Packages\\PyAudio_Package_Private')\n", "from PyAudio import get_X_audio_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Response and Predictors definition**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we define the data to be used. Specifically we define the response variable and a set of predictors matrices to be used as different alternatives, each one associate to a combination of features extraction methods and statistics." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "files_list_name = 'Files_List.txt'\n", "files_df = pl.read_csv(files_list_name, separator='\\t', has_header=False, new_columns=['path', 'level'])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
path | level |
---|---|
str | i64 |
"PDSpeechData/l… | 0 |
"PDSpeechData/l… | 0 |
"PDSpeechData/l… | 0 |