Applying inner evaluation

Applying inner evaluation#

Requirements#

Here we gather the required libraries, classes and function for this notebook.

import polars as pl
import numpy as np
import os
import sys
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

PyImageML is a Python package that has been developed under this project, which has several utils for plotting images and extracting features from them, features that later could be used along with Machine Learning algorithms to solve typical ML tasks.

sys.path.insert(0, r"C:\Users\fscielzo\Documents\Packages\PyImageML_Package_Private")
from PyImageML.preprocessing import ImageFeaturesExtraction

PyMachineLearning is another custom Python package that contains efficient utils to be used in real Machine Learning workflows.

sys.path.insert(0, r'C:\Users\fscielzo\Documents\Packages\PyMachineLearning_Package_Private')
from PyMachineLearning.evaluation import SimpleEvaluation
from PyMachineLearning.preprocessing import scaler, pca

Reading the data#

In this section we are going to read an process the data a little bit.

files_list.txt is a txt file with two ‘columns’, the first one with the old paths of the images of the data-set, and the second with the images class (neutral (0) or fire (1)).
We read `files_list.txt as a data-frame.
We extract the names of the images files.
We build a list with the new path of the images.

# Extracting the names of the images files as well as their class/category.
files_list_name = r'C:\Users\fscielzo\Documents\DataScience-GitHub\Image Analysis\Image-Classification\Fire-Detection\files_list.txt'
files_df = pl.read_csv(files_list_name, separator='\t', has_header=False, new_columns=['path', 'class'])
img_files_names = [files_df['path'][i].split('/')[1] for i in range(len(files_df))]

# building a list with the current paths of the data-set images.
img_path_list = []
folder_path = r'C:\Users\fscielzo\Documents\DataScience-GitHub\Image Analysis\Image-Classification\Fire-Detection\Data'
for filename in img_files_names:
    img_path_list.append(os.path.join(folder_path, filename))

Defining Response and Predictors#

In this section we define the response and predictors.

Predictors: a list with the paths of the images files.
Response: a vector (1D array) that identify the category of each image.

Y = files_df['class'].to_numpy()
X = img_path_list 

Defining the outer validation method#

In this section the validation method to be used in the outer evaluation.

The outer evaluation consist of estimating the future performance of a ML model, that is, measure how the model will work predicting new data. This evaluation is usually done only with the best model (or pipeline) according to the inner evaluation.

The validation method is the procedure used to estimate that performance, and in this case train-test split (also known as hold-out or simple validation) will be used.

Train-Test Split#

We randomly divided the whole predictor matrix X and response Y into two sets, a training set (75%) and a test one (25%).

The training will be used for the training phase, also known as inner evaluation, the test in the testing one, also known as outer evaluation.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=123, stratify=Y)

X_train[0:5]

['C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_121.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_131.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_194.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_34.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_92.jpg']

Y_train[0:5]

array([1, 1, 1, 0, 0], dtype=int64)

X_test[0:5]

['C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_93.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_248.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_270.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_3.jpg',
 'C:\\Users\\fscielzo\\Documents\\DataScience-GitHub\\Image Analysis\\Image-Classification\\Fire-Detection\\Data\\image_18.jpg']

Y_test[0:5]

array([0, 0, 1, 0, 0], dtype=int64)

Defining the inner validation method#

In this section we define the validation method for the inner evaluation.

The inner evaluation consist on compare different ML alternatives (also known as pipelines) and select the best one base on their predictive performance. All this procedure is part of the training phase, so that it is done using the training set.

Once the best pipeline is selected, outer evaluation applied to it, for estimating its future performance.

In this project we are going to use KFold Cross Validations as validation method for the inner evaluation.

KFold Cross Validation#

We define an stratified KFold CV with 4 folds and random shuffled, since it is much more precise than simple validation (hold-out).

inner = StratifiedKFold(n_splits=4, shuffle=True, random_state=123)

Defining the pipelines#

In this section we are going to define the ML pipelines that will be tested along the project.

The pipelines are a combination of preprocessing steps (transformers) plus a model (estimator), and are applied in a sequential way, form the first step to the last.

# Here some parameters to use along with ImageFeaturesExtraction are defined:

CELLS_PER_BLOCK_HOR = 2
CELLS_PER_BLOCK_VER = 2
PIXELS_PER_CELL_HOR = 8
PIXELS_PER_CELL_VER = 8
orientations = 8
pixels_per_cell=(PIXELS_PER_CELL_HOR, PIXELS_PER_CELL_VER)
cells_per_block=(CELLS_PER_BLOCK_HOR, CELLS_PER_BLOCK_VER)

mask_3 = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])

img_height = 240
img_width = 184

Let’s define the pipelines:

pipelines = {} 

models = {'knn': KNeighborsClassifier(n_jobs=-1), 
          'trees': DecisionTreeClassifier(random_state=123), 
          'extra_trees': ExtraTreesClassifier(random_state=123),
          'RF': RandomForestClassifier(random_state=123), 
          'HGB': HistGradientBoostingClassifier(random_state=123), 
          'MLP': MLPClassifier(random_state=123),
          'LinearSVM': LinearSVC(random_state=123),  
          'XGB': XGBClassifier(random_state=123),
          'Logistic': LogisticRegression(max_iter=250, solver='saga', random_state=123),
          'LGBM': LGBMClassifier(random_state=123, verbose=-1),
          'SVM': SVC(random_state=123)
}

for model_name, model in models.items():

    
    pipelines[model_name] = Pipeline([
                ('feature_extraction', ImageFeaturesExtraction(method='pixels', image_height=img_height, image_width=img_width, convert_to_gray=True, 
                                                               filter='equalized', weights=mask_3, format='array', orientations=orientations, pixels_per_cell=pixels_per_cell, 
                                                               cells_per_block=cells_per_block, transform_sqrt=True, reshape=False, statistics=None, n_clusters=100)),
                ('scaler', scaler(apply=False, method='standard')),
                ('pca', pca(apply=False, n_components=5, random_state=123)),
                (model_name, model) 
            ])

Applying inner evaluation with pixels features#

In this section inner evaluation will be applied considering only the pixels method for features extraction.

We could do a more general inner evaluation considering the feature extraction method as a hyper-parameter to be optimize, but instead we prefer to apply a more exhaustive hyper-parameter search over each specific features extraction method between the three addressed in this project, since we consider it as a more proper way for achieving a better model, as well as for understanding how different parameters and alternatives work with each feature extraction method, so that we will be able to asses more precisely how those parameters affect to de features extraction method.

Grids for HPO#

A possible general pipeline for the preprocessing part.

If we would have followed a more general HPO where the features extraction method were another alternative to explore, instead of being fixed, this grid could have been used.

'''
def preprocessing_param_grid(trial):

    # Fixed Grid
    param_grid = {
        'feature_extraction__method': trial.suggest_categorical('feature_extraction__method', ['pixels', 'HOG', 'CNN']),
        'scaler__apply': trial.suggest_categorical('scaler__apply', [True, False]),
        'pca__apply': trial.suggest_categorical('pca__apply', [True, False])
    }

    # Conditioned Grid
     
    ################################################################################################################ 
    if param_grid['scaler__apply'] == True:

        param_grid.update({'scaler__method': trial.suggest_categorical('scaler__method', ['standard', 'min-max'])})

    ################################################################################################################
    if param_grid['pca__apply'] == True:

        param_grid.update({'pca__n_components': trial.suggest_int('pca__n_components', 2, 80)})

    ################################################################################################################
    if param_grid['feature_extraction__method'] != 'CNN': # We CNN filters are not allowed in our implementations (at least yet).

        param_grid.update({'feature_extraction__filter': trial.suggest_categorical('feature_extraction__filter', [None, 'equalized', 'sobel', 'canny'])})

    ################################################################################################################
    if param_grid['feature_extraction__method'] == 'pixels':

        param_grid.update({'feature_extraction__convert_to_gray': trial.suggest_categorical('feature_extraction__convert_to_gray', [True, False])})
   
    ################################################################################################################
    if param_grid['feature_extraction__method'] == 'HOG':

        if param_grid['feature_extraction__reshape'] == False:

            param_grid.update({'feature_extraction__statistics': trial.suggest_categorical('feature_extraction__statistics', ['BVW', 
                                                                                                                            'mean', 'mean-std',
                                                                                                                            'mean-median-std',
                                                                                                                            'mean-Q25-median-Q75-std'])})

            if param_grid['feature_extraction__statistics'] == 'BVW':

                param_grid.update({'feature_extraction__n_clusters': trial.suggest_int('feature_extraction__n_clusters', 2, 100)})

        else:

            param_grid.update({'pca__apply': trial.suggest_categorical('pca__apply', [True])})

    return param_grid
'''

"\ndef preprocessing_param_grid(trial):\n\n    # Fixed Grid\n    param_grid = {\n        'feature_extraction__method': trial.suggest_categorical('feature_extraction__method', ['pixels', 'HOG', 'CNN']),\n        'scaler__apply': trial.suggest_categorical('scaler__apply', [True, False]),\n        'pca__apply': trial.suggest_categorical('pca__apply', [True, False])\n    }\n\n    # Conditioned Grid\n     \n    ################################################################################################################ \n    if param_grid['scaler__apply'] == True:\n\n        param_grid.update({'scaler__method': trial.suggest_categorical('scaler__method', ['standard', 'min-max'])})\n\n    ################################################################################################################\n    if param_grid['pca__apply'] == True:\n\n        param_grid.update({'pca__n_components': trial.suggest_int('pca__n_components', 2, 80)})\n\n    ################################################################################################################\n    if param_grid['feature_extraction__method'] != 'CNN': # We CNN filters are not allowed in our implementations (at least yet).\n\n        param_grid.update({'feature_extraction__filter': trial.suggest_categorical('feature_extraction__filter', [None, 'equalized', 'sobel', 'canny'])})\n\n    ################################################################################################################\n    if param_grid['feature_extraction__method'] == 'pixels':\n\n        param_grid.update({'feature_extraction__convert_to_gray': trial.suggest_categorical('feature_extraction__convert_to_gray', [True, False])})\n   \n    ################################################################################################################\n    if param_grid['feature_extraction__method'] == 'HOG':\n\n        if param_grid['feature_extraction__reshape'] == False:\n\n            param_grid.update({'feature_extraction__statistics': trial.suggest_categorical('feature_extraction__statistics', ['BVW', \n                                                                                                                            'mean', 'mean-std',\n                                                                                                                            'mean-median-std',\n                                                                                                                            'mean-Q25-median-Q75-std'])})\n\n            if param_grid['feature_extraction__statistics'] == 'BVW':\n\n                param_grid.update({'feature_extraction__n_clusters': trial.suggest_int('feature_extraction__n_clusters', 2, 100)})\n\n        else:\n\n            param_grid.update({'pca__apply': trial.suggest_categorical('pca__apply', [True])})\n\n    return param_grid\n"

Defining the preprocessing grid for applying hyper-parameter optimization (HPO) fixing the feature extraction method to pixels.

def preprocessing_pixels_param_grid(trial):

    # Fixed Grid
    param_grid = {
        'feature_extraction__method': trial.suggest_categorical('feature_extraction__method', ['pixels']),
        'scaler__apply': trial.suggest_categorical('scaler__apply', [True, False]),
        'pca__apply': trial.suggest_categorical('pca__apply', [True]) # To avoid high-dimensionality on p (num. features)
    }

    # Conditioned Grid
     
    ################################################################################################################ 
    if param_grid['scaler__apply'] == True:

        param_grid.update({'scaler__method': trial.suggest_categorical('scaler__method', ['standard', 'min-max'])})

    ################################################################################################################
    if param_grid['pca__apply'] == True:

        param_grid.update({'pca__n_components': trial.suggest_int('pca__n_components', 2, 150)})

    ################################################################################################################
    if param_grid['feature_extraction__method'] != 'CNN': # We CNN filters are not allowed in our implementations (at least yet).

        param_grid.update({'feature_extraction__filter': trial.suggest_categorical('feature_extraction__filter', [None, 'equalized', 'sobel', 'canny',
                                                                                                                  'convolve', 'hessian', 'prewitt'])})

    ################################################################################################################
    if param_grid['feature_extraction__method'] == 'pixels':

        param_grid.update({'feature_extraction__convert_to_gray': trial.suggest_categorical('feature_extraction__convert_to_gray', [True, False])})

    return param_grid

Defining grids for Machine Learning models.

def param_grid_knn_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'knn__n_neighbors': trial.suggest_int('knn__n_neighbors', 1, 25),
        'knn__metric': trial.suggest_categorical('knn__metric', ['cosine', 'minkowski', 'cityblock'])
    })

    if param_grid['knn__metric'] == 'minkowski':
        param_grid['knn__p'] = trial.suggest_int('knn__p', 1, 4)

    return param_grid

def param_grid_trees_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'trees__max_depth': trial.suggest_categorical('trees__max_depth', [None, 2, 5, 7, 10, 20, 30]),
        'trees__min_samples_split': trial.suggest_int('trees__min_samples_split', 2, 25),
        'trees__min_samples_leaf': trial.suggest_int('trees__min_samples_leaf', 2, 25),
        'trees__splitter': trial.suggest_categorical('trees__splitter', ['best', 'random']),
        'trees__criterion': trial.suggest_categorical('trees__criterion', ['log_loss', 'gini', 'entropy'])
    })

    return param_grid

def param_grid_extra_trees_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'extra_trees__n_estimators': trial.suggest_categorical('extra_trees__n_estimators', [30, 50, 75, 100, 120]),
        'extra_trees__max_depth': trial.suggest_categorical('extra_trees__max_depth', [3, 5, 7, 10, 20, 30]),
        'extra_trees__min_samples_split': trial.suggest_int('extra_trees__min_samples_split', 2, 20),
        'extra_trees__min_samples_leaf': trial.suggest_int('extra_trees__min_samples_leaf', 2, 20),
        'extra_trees__criterion': trial.suggest_categorical('extra_trees__criterion', ['gini']),
        'extra_trees__max_features': trial.suggest_categorical('extra_trees__max_features', [0.7, 0.8, 0.9, 1.0])
    })
    
    return param_grid

def param_grid_HGB_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'HGB__max_depth': trial.suggest_categorical('HGB__max_depth', [5, 10, 20, 30, 40, 50]),
        'HGB__l2_regularization': trial.suggest_float('HGB__l2_regularization', 0.01, 0.7, log=True),
        'HGB__max_iter': trial.suggest_categorical('HGB__max_iter', [50, 70, 100, 130, 150])
    })

    return param_grid

def param_grid_XGB_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'XGB__max_depth': trial.suggest_categorical('XGB__max_depth', [10, 20, 30, 40, 50, 70, 100]),
        'XGB__reg_lambda': trial.suggest_float('XGB__reg_lambda', 0, 1, step=0.05, log=False),
        'XGB__n_estimators': trial.suggest_categorical('XGB__n_estimators', [50, 70, 100, 130, 150]),
        'XGB__eta': trial.suggest_float('XGB__eta', 0, 0.3, step=0.02, log=False),
        'XGB__alpha': trial.suggest_float('XGB__alpha', 0.2, 1, step=0.01, log=False)
    })

    return param_grid

def param_grid_RF_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'RF__n_estimators': trial.suggest_categorical('RF__n_estimators', [30, 50, 75, 100, 120, 150, 200, 250]),
        'RF__max_depth': trial.suggest_categorical('RF__max_depth', [3, 4, 5, 7, 10, 20, 30]),
        'RF__min_samples_split': trial.suggest_int('RF__min_samples_split', 2, 20),
        'RF__min_samples_leaf': trial.suggest_int('RF__min_samples_leaf', 2, 20),
        'RF__criterion': trial.suggest_categorical('RF__criterion', ['gini', 'entropy']),
    })
    
    return param_grid

def param_grid_linear_SVM_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'LinearSVM__C': trial.suggest_float('SVM__C', 0.001, 2, log=True),
        'LinearSVM__class_weight': trial.suggest_categorical('LinearSVM__class_weight', ['balanced', None])
    })

    return param_grid

def param_grid_MLP_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'MLP__learning_rate_init': trial.suggest_float('MLP__learning_rate_init', 0.0001, 0.2, log=True),
        'MLP__alpha': trial.suggest_float('MLP__alpha', 0.01, 1, log=True)
    })

    return param_grid

def param_grid_logistic_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'Logistic__penalty':  trial.suggest_categorical('Logistic__penalty', ['l1', 'l2', 'elasticnet', None]),
        'Logistic__C': trial.suggest_float('Logistic__C', 0.001, 2, log=True),
        'Logistic__class_weight': trial.suggest_categorical('Logistic__class_weight', ['balanced', None])
    })

    if param_grid['Logistic__penalty'] == 'elasticnet':
        param_grid.update({'Logistic__l1_ratio': trial.suggest_float('Logistic__l1_ratio', 0.1, 1, log=True)})

    return param_grid

def param_grid_SVM_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'SVM__C': trial.suggest_float('SVM__C', 0.1, 5, log=True),
        'SVM__kernel': trial.suggest_categorical('SVM__kernel', ['poly', 'rbf', 'sigmoid']),
    })

    if param_grid['SVM__kernel'] == 'poly':

        param_grid.update({
            'SVM__degree': trial.suggest_int('SVM__degree', 1, 5)
        })

    return param_grid

def param_grid_LGBM_pixels(trial):

    param_grid = preprocessing_pixels_param_grid(trial)

    param_grid.update({
        'LGBM__max_depth': trial.suggest_int('LGBM__max_depth', 2, 200),
        'LGBM__num_leaves': trial.suggest_int('LGBM__num_leaves', 2, 200),
        'LGBM__n_estimators': trial.suggest_categorical('LGBM__n_estimators', [30, 50, 70, 100, 120, 150, 180, 200, 250, 300]),
        'LGBM__learning_rate': trial.suggest_float('LGBM__learning_rate', 0.0001, 0.1, log=True),
        'LGBM__lambda_l1': trial.suggest_float('LGBM__lambda_l1', 0.001, 1, log=True),
        'LGBM__lambda_l2': trial.suggest_float('LGBM__lambda_l2', 0.001, 1, log=True),
        'LGBM__min_split_gain': trial.suggest_float('LGBM__min_split_gain', 0.001, 0.01, log=True),
        'LGBM__min_child_weight': trial.suggest_int('LGBM__min_child_weight', 5, 60),
        'LGBM__feature_fraction': trial.suggest_float('LGBM__feature_fraction', 0.1, 0.9, step=0.05)
    })

    return param_grid

Hyper-parameter Optimization (HPO)#

inner_score, best_params, inner_results = {}, {}, {}

model_name = 'knn'
dict_name = 'knn-pixels'
param_grid = param_grid_knn_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'trees'
dict_name = 'trees-pixels'
param_grid = param_grid_trees_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'extra_trees'
dict_name = 'extra-trees-pixels'
param_grid = param_grid_extra_trees_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'HGB'
dict_name = 'HGB-pixels'
param_grid = param_grid_HGB_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'RF'
dict_name = 'RF-pixels'
param_grid = param_grid_RF_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'XGB'
dict_name = 'XGB-pixels'
param_grid = param_grid_XGB_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'Logistic'
dict_name = 'Logistic-pixels'
param_grid = param_grid_logistic_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LinearSVM'
dict_name = 'LinearSVM-pixels'
param_grid = param_grid_linear_SVM_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'SVM'
dict_name = 'SVM-pixels'
param_grid = param_grid_SVM_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LGBM'
dict_name = 'LGBM-pixels'
param_grid = param_grid_LGBM_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=40, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LGBM'
dict_name = 'LGBM-pixels'
param_grid = param_grid_LGBM_pixels

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=40, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

Applying inner evaluation with HOG features#

In this section inner evaluation will be applied considering only the HOG method for features extraction.

We could do a more general inner evaluation considering the feature extraction method as a hyper-parameter to be optimize, but instead we prefer to apply a more exhaustive hyper-parameter search over each specific features extraction method between the three addressed in this project, since we consider it as a more proper way for achieving a better model, as well as for understanding how different parameters and alternatives work with each feature extraction method, so that we will be able to asses more precisely how those parameters affect to de features extraction method.

Grids for HPO#

Defining the preprocessing grid for applying hyper-parameter optimization (HPO) fixing the feature extraction method to HOG, but we will divide the exploration in two parts, one assuming a reshaped HOG features vector, and another without reshaping.

def preprocessing_HOG_reshaped_param_grid(trial):

    # Fixed Grid
    param_grid = {
        'feature_extraction__method': trial.suggest_categorical('feature_extraction__method', ['HOG']),
        'scaler__apply': trial.suggest_categorical('scaler__apply', [True, False]),
        'pca__apply': trial.suggest_categorical('pca__apply', [True, False])
    }

    # Conditioned Grid
     
    ################################################################################################################ 
    if param_grid['scaler__apply'] == True:

        param_grid.update({'scaler__method': trial.suggest_categorical('scaler__method', ['standard', 'min-max'])})

################################################################################################################

    if param_grid['pca__apply'] == True:

        param_grid.update({'pca__n_components': trial.suggest_int('pca__n_components', 2, 25)})

    ################################################################################################################
    if param_grid['feature_extraction__method'] != 'CNN': # We CNN filters are not allowed in our implementations (at least yet).

        param_grid.update({'feature_extraction__filter': trial.suggest_categorical('feature_extraction__filter', [None, 'equalized', 'sobel', 'canny',
                                                                                                                  'hessian', 'prewitt'])})

    ################################################################################################################
    if param_grid['feature_extraction__method'] == 'HOG':

        param_grid.update({'feature_extraction__reshape': trial.suggest_categorical('feature_extraction__reshape', [True])}) # Forcing reshape to be true

        param_grid.update({'feature_extraction__statistics': trial.suggest_categorical('feature_extraction__statistics', ['mean', 'mean-std',
                                                                                                                          'mean-median-std',
                                                                                                                          'mean-Q25-median-Q75-std'])})

        if param_grid['feature_extraction__statistics'] == 'BVW':

            param_grid.update({'feature_extraction__n_clusters': trial.suggest_int('feature_extraction__n_clusters', 50, 100)})

    return param_grid

def preprocessing_HOG_not_reshaped_param_grid(trial):

    # Fixed Grid
    param_grid = {
        'feature_extraction__method': trial.suggest_categorical('feature_extraction__method', ['HOG']),
        'scaler__apply': trial.suggest_categorical('scaler__apply', [True, False]),
        'pca__apply': trial.suggest_categorical('pca__apply', [True]) # Forcing PCA to combat high dimensionality on p due to not reshaped HGO features
    }

    # Conditioned Grid
     
    ################################################################################################################ 
    if param_grid['scaler__apply'] == True:

        param_grid.update({'scaler__method': trial.suggest_categorical('scaler__method', ['standard', 'min-max'])})

################################################################################################################

    if param_grid['pca__apply'] == True:

        param_grid.update({'pca__n_components': trial.suggest_int('pca__n_components', 2, 150)})

    ################################################################################################################
    if param_grid['feature_extraction__method'] != 'CNN': # We CNN filters are not allowed in our implementations (at least yet).

        param_grid.update({'feature_extraction__filter': trial.suggest_categorical('feature_extraction__filter', [None, 'equalized', 'sobel', 'canny',
                                                                                                                  'hessian', 'prewitt'])})

    ################################################################################################################
    if param_grid['feature_extraction__method'] == 'HOG':

        param_grid.update({'feature_extraction__reshape': trial.suggest_categorical('feature_extraction__reshape', [False])}) # Forcing reshape to be False

    return param_grid

Defining grids for Machine Learning models.

def param_grid_knn_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'knn__n_neighbors': trial.suggest_int('knn__n_neighbors', 1, 25),
        'knn__metric': trial.suggest_categorical('knn__metric', ['cosine', 'minkowski', 'cityblock'])
    })

    if param_grid['knn__metric'] == 'minkowski':
        param_grid['knn__p'] = trial.suggest_int('knn__p', 1, 4)

    return param_grid

def param_grid_knn_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'knn__n_neighbors': trial.suggest_int('knn__n_neighbors', 1, 25),
        'knn__metric': trial.suggest_categorical('knn__metric', ['cosine', 'minkowski', 'cityblock'])
    })

    if param_grid['knn__metric'] == 'minkowski':
        param_grid['knn__p'] = trial.suggest_int('knn__p', 1, 4)

    return param_grid

def param_grid_trees_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'trees__max_depth': trial.suggest_categorical('trees__max_depth', [None, 2, 5, 7, 10, 20, 30]),
        'trees__min_samples_split': trial.suggest_int('trees__min_samples_split', 2, 25),
        'trees__min_samples_leaf': trial.suggest_int('trees__min_samples_leaf', 2, 25),
        'trees__splitter': trial.suggest_categorical('trees__splitter', ['best', 'random']),
        'trees__criterion': trial.suggest_categorical('trees__criterion', ['log_loss', 'gini', 'entropy'])
    })

    return param_grid

def param_grid_trees_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'trees__max_depth': trial.suggest_categorical('trees__max_depth', [None, 2, 5, 7, 10, 20, 30]),
        'trees__min_samples_split': trial.suggest_int('trees__min_samples_split', 2, 25),
        'trees__min_samples_leaf': trial.suggest_int('trees__min_samples_leaf', 2, 25),
        'trees__splitter': trial.suggest_categorical('trees__splitter', ['best', 'random']),
        'trees__criterion': trial.suggest_categorical('trees__criterion', ['log_loss', 'gini', 'entropy'])
    })

    return param_grid

def param_grid_extra_trees_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'extra_trees__n_estimators': trial.suggest_categorical('extra_trees__n_estimators', [30, 50, 75, 100, 120]),
        'extra_trees__max_depth': trial.suggest_categorical('extra_trees__max_depth', [3, 5, 7, 10, 20, 30]),
        'extra_trees__min_samples_split': trial.suggest_int('extra_trees__min_samples_split', 2, 20),
        'extra_trees__min_samples_leaf': trial.suggest_int('extra_trees__min_samples_leaf', 2, 20),
        'extra_trees__criterion': trial.suggest_categorical('extra_trees__criterion', ['gini']),
        'extra_trees__max_features': trial.suggest_categorical('extra_trees__max_features', [0.7, 0.8, 0.9, 1.0])
    })
    
    return param_grid

def param_grid_extra_trees_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'extra_trees__n_estimators': trial.suggest_categorical('extra_trees__n_estimators', [30, 50, 75, 100, 120]),
        'extra_trees__max_depth': trial.suggest_categorical('extra_trees__max_depth', [3, 5, 7, 10, 20, 30]),
        'extra_trees__min_samples_split': trial.suggest_int('extra_trees__min_samples_split', 2, 20),
        'extra_trees__min_samples_leaf': trial.suggest_int('extra_trees__min_samples_leaf', 2, 20),
        'extra_trees__criterion': trial.suggest_categorical('extra_trees__criterion', ['gini']),
        'extra_trees__max_features': trial.suggest_categorical('extra_trees__max_features', [0.7, 0.8, 0.9, 1.0])
    })
    
    return param_grid

def param_grid_HGB_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'HGB__max_depth': trial.suggest_categorical('HGB__max_depth', [5, 10, 20, 30, 40, 50]),
        'HGB__l2_regularization': trial.suggest_float('HGB__l2_regularization', 0.01, 0.7, log=True),
        'HGB__max_iter': trial.suggest_categorical('HGB__max_iter', [50, 70, 100, 130, 150])
    })

    return param_grid

def param_grid_HGB_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'HGB__max_depth': trial.suggest_categorical('HGB__max_depth', [5, 10, 20, 30, 40, 50]),
        'HGB__l2_regularization': trial.suggest_float('HGB__l2_regularization', 0.01, 0.7, log=True),
        'HGB__max_iter': trial.suggest_categorical('HGB__max_iter', [50, 70, 100, 130, 150])
    })

    return param_grid

def param_grid_XGB_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'XGB__max_depth': trial.suggest_categorical('XGB__max_depth', [10, 20, 30, 40, 50, 70, 100]),
        'XGB__reg_lambda': trial.suggest_float('XGB__reg_lambda', 0, 1, step=0.05, log=False),
        'XGB__n_estimators': trial.suggest_categorical('XGB__n_estimators', [50, 70, 100, 130, 150]),
        'XGB__eta': trial.suggest_float('XGB__eta', 0, 0.3, step=0.02, log=False),
        'XGB__alpha': trial.suggest_float('XGB__alpha', 0.2, 1, step=0.01, log=False)
    })

    return param_grid

def param_grid_XGB_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'XGB__max_depth': trial.suggest_categorical('XGB__max_depth', [10, 20, 30, 40, 50, 70, 100]),
        'XGB__reg_lambda': trial.suggest_float('XGB__reg_lambda', 0, 1, step=0.05, log=False),
        'XGB__n_estimators': trial.suggest_categorical('XGB__n_estimators', [50, 70, 100, 130, 150]),
        'XGB__eta': trial.suggest_float('XGB__eta', 0, 0.3, step=0.02, log=False),
        'XGB__alpha': trial.suggest_float('XGB__alpha', 0.2, 1, step=0.01, log=False)
    })

    return param_grid

def param_grid_RF_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'RF__n_estimators': trial.suggest_categorical('RF__n_estimators', [30, 50, 75, 100, 120, 150, 200, 250]),
        'RF__max_depth': trial.suggest_categorical('RF__max_depth', [3, 4, 5, 7, 10, 20, 30]),
        'RF__min_samples_split': trial.suggest_int('RF__min_samples_split', 2, 20),
        'RF__min_samples_leaf': trial.suggest_int('RF__min_samples_leaf', 2, 20),
        'RF__criterion': trial.suggest_categorical('RF__criterion', ['gini', 'entropy']),
    })
    
    return param_grid

def param_grid_RF_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'RF__n_estimators': trial.suggest_categorical('RF__n_estimators', [30, 50, 75, 100, 120, 150, 200, 250]),
        'RF__max_depth': trial.suggest_categorical('RF__max_depth', [3, 4, 5, 7, 10, 20, 30]),
        'RF__min_samples_split': trial.suggest_int('RF__min_samples_split', 2, 20),
        'RF__min_samples_leaf': trial.suggest_int('RF__min_samples_leaf', 2, 20),
        'RF__criterion': trial.suggest_categorical('RF__criterion', ['gini', 'entropy']),
    })
    
    return param_grid

def param_grid_linear_SVM_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'LinearSVM__C': trial.suggest_float('SVM__C', 0.001, 2, log=True),
        'LinearSVM__class_weight': trial.suggest_categorical('LinearSVM__class_weight', ['balanced', None])
    })

    return param_grid

def param_grid_linear_SVM_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'LinearSVM__C': trial.suggest_float('SVM__C', 0.001, 2, log=True),
        'LinearSVM__class_weight': trial.suggest_categorical('LinearSVM__class_weight', ['balanced', None])
    })

    return param_grid

def param_grid_MLP_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'MLP__learning_rate_init': trial.suggest_float('MLP__learning_rate_init', 0.0001, 0.2, log=True),
        'MLP__alpha': trial.suggest_float('MLP__alpha', 0.01, 1, log=True)
    })

    return param_grid

def param_grid_MLP_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'MLP__learning_rate_init': trial.suggest_float('MLP__learning_rate_init', 0.0001, 0.2, log=True),
        'MLP__alpha': trial.suggest_float('MLP__alpha', 0.01, 1, log=True)
    })

    return param_grid

def param_grid_logistic_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'Logistic__penalty':  trial.suggest_categorical('Logistic__penalty', ['l1', 'l2', 'elasticnet', None]),
        'Logistic__C': trial.suggest_float('Logistic__C', 0.001, 2, log=True),
        'Logistic__class_weight': trial.suggest_categorical('Logistic__class_weight', ['balanced', None])
    })

    if param_grid['Logistic__penalty'] == 'elasticnet':
        param_grid.update({'Logistic__l1_ratio': trial.suggest_float('Logistic__l1_ratio', 0.1, 1, log=True)})

    return param_grid

def param_grid_logistic_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'Logistic__penalty':  trial.suggest_categorical('Logistic__penalty', ['l1', 'l2', 'elasticnet', None]),
        'Logistic__C': trial.suggest_float('Logistic__C', 0.001, 2, log=True),
        'Logistic__class_weight': trial.suggest_categorical('Logistic__class_weight', ['balanced', None])
    })

    if param_grid['Logistic__penalty'] == 'elasticnet':
        param_grid.update({'Logistic__l1_ratio': trial.suggest_float('Logistic__l1_ratio', 0.1, 1, log=True)})

    return param_grid

def param_grid_SVM_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'SVM__C': trial.suggest_float('SVM__C', 0.1, 5, log=True),
        'SVM__kernel': trial.suggest_categorical('SVM__kernel', ['poly', 'rbf', 'sigmoid']),
    })

    if param_grid['SVM__kernel'] == 'poly':

        param_grid.update({
            'SVM__degree': trial.suggest_int('SVM__degree', 1, 5)
        })

    return param_grid

def param_grid_SVM_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'SVM__C': trial.suggest_float('SVM__C', 0.1, 5, log=True),
        'SVM__kernel': trial.suggest_categorical('SVM__kernel', ['poly', 'rbf', 'sigmoid']),
    })

    if param_grid['SVM__kernel'] == 'poly':

        param_grid.update({
            'SVM__degree': trial.suggest_int('SVM__degree', 1, 5)
        })

    return param_grid

def param_grid_LGBM_HOG_reshaped(trial):

    param_grid = preprocessing_HOG_reshaped_param_grid(trial)

    param_grid.update({
        'LGBM__max_depth': trial.suggest_int('LGBM__max_depth', 2, 200),
        'LGBM__num_leaves': trial.suggest_int('LGBM__num_leaves', 2, 200),
        'LGBM__n_estimators': trial.suggest_categorical('LGBM__n_estimators', [30, 50, 70, 100, 120, 150, 180, 200, 250, 300]),
        'LGBM__learning_rate': trial.suggest_float('LGBM__learning_rate', 0.0001, 0.1, log=True),
        'LGBM__lambda_l1': trial.suggest_float('LGBM__lambda_l1', 0.001, 1, log=True),
        'LGBM__lambda_l2': trial.suggest_float('LGBM__lambda_l2', 0.001, 1, log=True),
        'LGBM__min_split_gain': trial.suggest_float('LGBM__min_split_gain', 0.001, 0.01, log=True),
        'LGBM__min_child_weight': trial.suggest_int('LGBM__min_child_weight', 5, 60),
        'LGBM__feature_fraction': trial.suggest_float('LGBM__feature_fraction', 0.1, 0.9, step=0.05)
    })

    return param_grid

def param_grid_LGBM_HOG_not_reshaped(trial):

    param_grid = preprocessing_HOG_not_reshaped_param_grid(trial)

    param_grid.update({
        'LGBM__max_depth': trial.suggest_int('LGBM__max_depth', 2, 200),
        'LGBM__num_leaves': trial.suggest_int('LGBM__num_leaves', 2, 200),
        'LGBM__n_estimators': trial.suggest_categorical('LGBM__n_estimators', [30, 50, 70, 100, 120, 150, 180, 200, 250, 300]),
        'LGBM__learning_rate': trial.suggest_float('LGBM__learning_rate', 0.0001, 0.1, log=True),
        'LGBM__lambda_l1': trial.suggest_float('LGBM__lambda_l1', 0.001, 1, log=True),
        'LGBM__lambda_l2': trial.suggest_float('LGBM__lambda_l2', 0.001, 1, log=True),
        'LGBM__min_split_gain': trial.suggest_float('LGBM__min_split_gain', 0.001, 0.01, log=True),
        'LGBM__min_child_weight': trial.suggest_int('LGBM__min_child_weight', 5, 60),
        'LGBM__feature_fraction': trial.suggest_float('LGBM__feature_fraction', 0.1, 0.9, step=0.05)
    })

    return param_grid

Hyper-parameter Optimization (HPO)#

model_name = 'knn'
dict_name = 'knn-HOG-reshaped'
param_grid = param_grid_knn_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'knn'
dict_name = 'knn-HOG-not-reshaped'
param_grid = param_grid_knn_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'trees'
dict_name = 'trees-HOG-reshaped'
param_grid = param_grid_trees_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'trees'
dict_name = 'trees-HOG-not-reshaped'
param_grid = param_grid_trees_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'extra_trees'
dict_name = 'extra-trees-HOG-reshaped'
param_grid = param_grid_extra_trees_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=15, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'extra_trees'
dict_name = 'extra-trees-HOG-not-reshaped'
param_grid = param_grid_extra_trees_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'HGB'
dict_name = 'HGB-HOG-reshaped'
param_grid = param_grid_HGB_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=15, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'HGB'
dict_name = 'HGB-HOG-not-reshaped'
param_grid = param_grid_HGB_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'RF'
dict_name = 'RF-HOG-reshaped'
param_grid = param_grid_RF_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'RF'
dict_name = 'RF-HOG-not-reshaped'
param_grid = param_grid_RF_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'XGB'
dict_name = 'XGB-HOG-reshaped'
param_grid = param_grid_XGB_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'XGB'
dict_name = 'XGB-HOG-not-reshaped'
param_grid = param_grid_XGB_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'Logistic'
dict_name = 'Logistic-HOG-reshaped'
param_grid = param_grid_logistic_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'Logistic'
dict_name = 'Logistic-HOG-not-reshaped'
param_grid = param_grid_logistic_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=30, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LinearSVM'
dict_name = 'LinearSVM-HOG-reshaped'
param_grid = param_grid_linear_SVM_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LinearSVM'
dict_name = 'LinearSVM-HOG-not-reshaped'
param_grid = param_grid_linear_SVM_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'SVM'
dict_name = 'SVM-HOG-reshaped'
param_grid = param_grid_SVM_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'SVM'
dict_name = 'SVM-HOG-not-reshaped'
param_grid = param_grid_SVM_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LGBM'
dict_name = 'LGBM-HOG-reshaped'
param_grid = param_grid_LGBM_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LGBM'
dict_name = 'LGBM-HOG-not-reshaped'
param_grid = param_grid_LGBM_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=25, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'MLP'
dict_name = 'MLP-HOG-reshaped'
param_grid = param_grid_MLP_HOG_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'MLP'
dict_name = 'MLP-HOG-not-reshaped'
param_grid = param_grid_MLP_HOG_not_reshaped

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=10, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

Applying inner evaluation with CNN features#

In this section inner evaluation will be applied considering only the CNN method for features extraction.

We could do a more general inner evaluation considering the feature extraction method as a hyper-parameter to be optimize, but instead we prefer to apply a more exhaustive hyper-parameter search over each specific features extraction method between the three addressed in this project, since we consider it as a more proper way for achieving a better model, as well as for understanding how different parameters and alternatives work with each feature extraction method, so that we will be able to asses more precisely how those parameters affect to de features extraction method.

Grids for HPO#

Defining the preprocessing grid for applying hyper-parameter optimization (HPO) fixing the feature extraction method to CNN.

def preprocessing_CNN_param_grid(trial):

    # Fixed Grid
    param_grid = {
        'feature_extraction__method': trial.suggest_categorical('feature_extraction__method', ['CNN']),
        'scaler__apply': trial.suggest_categorical('scaler__apply', [True, False]),
        'pca__apply': trial.suggest_categorical('pca__apply', [True]) # Forcing PCA to combat high dimensionality on p.
    }

    # Conditioned Grid
     
    ################################################################################################################ 
    if param_grid['scaler__apply'] == True:

        param_grid.update({'scaler__method': trial.suggest_categorical('scaler__method', ['standard', 'min-max'])})

    ################################################################################################################
    if param_grid['pca__apply'] == True:

        param_grid.update({'pca__n_components': trial.suggest_int('pca__n_components', 2, 150)})

    return param_grid

Defining grids for Machine Learning models.

def param_grid_knn_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'knn__n_neighbors': trial.suggest_int('knn__n_neighbors', 1, 25),
        'knn__metric': trial.suggest_categorical('knn__metric', ['cosine', 'minkowski', 'cityblock'])
    })

    if param_grid['knn__metric'] == 'minkowski':
        param_grid['knn__p'] = trial.suggest_int('knn__p', 1, 4)

    return param_grid

def param_grid_extra_trees_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'extra_trees__n_estimators': trial.suggest_categorical('extra_trees__n_estimators', [30, 50, 75, 100, 120]),
        'extra_trees__max_depth': trial.suggest_categorical('extra_trees__max_depth', [3, 5, 7, 10, 20, 30]),
        'extra_trees__min_samples_split': trial.suggest_int('extra_trees__min_samples_split', 2, 20),
        'extra_trees__min_samples_leaf': trial.suggest_int('extra_trees__min_samples_leaf', 2, 20),
        'extra_trees__criterion': trial.suggest_categorical('extra_trees__criterion', ['gini']),
        'extra_trees__max_features': trial.suggest_categorical('extra_trees__max_features', [0.7, 0.8, 0.9, 1.0])
    })
    
    return param_grid

def param_grid_XGB_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'XGB__max_depth': trial.suggest_categorical('XGB__max_depth', [10, 20, 30, 40, 50, 70, 100]),
        'XGB__reg_lambda': trial.suggest_float('XGB__reg_lambda', 0, 1, step=0.05, log=False),
        'XGB__n_estimators': trial.suggest_categorical('XGB__n_estimators', [50, 70, 100, 130, 150]),
        'XGB__eta': trial.suggest_float('XGB__eta', 0, 0.3, step=0.02, log=False),
        'XGB__alpha': trial.suggest_float('XGB__alpha', 0.2, 1, step=0.01, log=False)
    })

    return param_grid

def param_grid_RF_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'RF__n_estimators': trial.suggest_categorical('RF__n_estimators', [30, 50, 75, 100, 120, 150, 200, 250]),
        'RF__max_depth': trial.suggest_categorical('RF__max_depth', [3, 4, 5, 7, 10, 20, 30]),
        'RF__min_samples_split': trial.suggest_int('RF__min_samples_split', 2, 20),
        'RF__min_samples_leaf': trial.suggest_int('RF__min_samples_leaf', 2, 20),
        'RF__criterion': trial.suggest_categorical('RF__criterion', ['gini', 'entropy']),
    })
    
    return param_grid

def param_grid_linear_SVM_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'LinearSVM__C': trial.suggest_float('SVM__C', 0.001, 2, log=True),
        'LinearSVM__class_weight': trial.suggest_categorical('LinearSVM__class_weight', ['balanced', None])
    })

    return param_grid

def param_grid_MLP_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'MLP__learning_rate_init': trial.suggest_float('MLP__learning_rate_init', 0.0001, 0.2, log=True),
        'MLP__alpha': trial.suggest_float('MLP__alpha', 0.01, 1, log=True)
    })

    return param_grid

def param_grid_HGB_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'HGB__max_depth': trial.suggest_categorical('HGB__max_depth', [5, 10, 20, 30, 40, 50]),
        'HGB__l2_regularization': trial.suggest_float('HGB__l2_regularization', 0.01, 0.7, log=True),
        'HGB__max_iter': trial.suggest_categorical('HGB__max_iter', [50, 70, 100, 130, 150])
    })

    return param_grid

def param_grid_logistic_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'Logistic__penalty':  trial.suggest_categorical('Logistic__penalty', ['l1', 'l2', 'elasticnet', None]),
        'Logistic__C': trial.suggest_float('Logistic__C', 0.001, 2, log=True),
        'Logistic__class_weight': trial.suggest_categorical('Logistic__class_weight', ['balanced', None])
    })

    if param_grid['Logistic__penalty'] == 'elasticnet':
        param_grid.update({'Logistic__l1_ratio': trial.suggest_float('Logistic__l1_ratio', 0.1, 1, log=True)})

    return param_grid

def param_grid_SVM_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'SVM__C': trial.suggest_float('SVM__C', 0.1, 5, log=True),
        'SVM__kernel': trial.suggest_categorical('SVM__kernel', ['poly', 'rbf', 'sigmoid']),
    })

    if param_grid['SVM__kernel'] == 'poly':

        param_grid.update({
            'SVM__degree': trial.suggest_int('SVM__degree', 1, 5)
        })

    return param_grid

def param_grid_LGBM_CNN(trial):

    param_grid = preprocessing_CNN_param_grid(trial)

    param_grid.update({
        'LGBM__max_depth': trial.suggest_int('LGBM__max_depth', 2, 200),
        'LGBM__num_leaves': trial.suggest_int('LGBM__num_leaves', 2, 200),
        'LGBM__n_estimators': trial.suggest_categorical('LGBM__n_estimators', [30, 50, 70, 100, 120, 150, 180, 200, 250, 300]),
        'LGBM__learning_rate': trial.suggest_float('LGBM__learning_rate', 0.0001, 0.1, log=True),
        'LGBM__lambda_l1': trial.suggest_float('LGBM__lambda_l1', 0.001, 1, log=True),
        'LGBM__lambda_l2': trial.suggest_float('LGBM__lambda_l2', 0.001, 1, log=True),
        'LGBM__min_split_gain': trial.suggest_float('LGBM__min_split_gain', 0.001, 0.01, log=True),
        'LGBM__min_child_weight': trial.suggest_int('LGBM__min_child_weight', 5, 60),
        'LGBM__feature_fraction': trial.suggest_float('LGBM__feature_fraction', 0.1, 0.9, step=0.05)
    })

    return param_grid

Hyper-parameter Optimization (HPO)#

model_name = 'knn'
dict_name = 'knn-CNN'
param_grid = param_grid_knn_CNN

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=5, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'RF'
dict_name = 'RF-CNN'
param_grid = param_grid_RF_CNN

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=2, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'Logistic'
dict_name = 'Logistic-CNN'
param_grid = param_grid_logistic_CNN

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=1, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'LinearSVM'
dict_name = 'LinearSVM-CNN'
param_grid = param_grid_linear_SVM_CNN

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=1, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

model_name = 'MLP'
dict_name = 'MLP-CNN'
param_grid = param_grid_MLP_CNN

simple_eval = SimpleEvaluation(estimator=pipelines[model_name],  
                               cv=inner, 
                               param_grid=param_grid,
                               search_method='optuna',
                               scoring='accuracy', 
                               direction='maximize', 
                               n_trials=1, 
                               random_state=123)

simple_eval.fit(X=X_train, y=Y_train)

inner_score[dict_name] = simple_eval.inner_score
best_params[dict_name] = simple_eval.inner_best_params
inner_results[dict_name] = simple_eval.inner_results

Saving the inner results

'''
with open('results/inner_score.pkl', 'wb') as file:
    pickle.dump(inner_score, file)
with open('results/best_params.pkl', 'wb') as file:
    pickle.dump(best_params, file)
with open('results/inner_results.pkl', 'wb') as file:
    pickle.dump(inner_results, file)
'''    

Applying inner evaluation

Contents

Applying inner evaluation#

Requirements#

Reading the data#

Defining Response and Predictors#

Defining the outer validation method#

Train-Test Split#

Defining the inner validation method#

KFold Cross Validation#

Defining the pipelines#

Applying inner evaluation with pixels features#

Grids for HPO#

Hyper-parameter Optimization (HPO)#

Applying inner evaluation with HOG features#

Grids for HPO#

Hyper-parameter Optimization (HPO)#

Applying inner evaluation with CNN features#

Grids for HPO#

Hyper-parameter Optimization (HPO)#