Imbalance classification problem

Imbalance classification problem#

unique_values, frequencies = np.unique(Y, return_counts=True)
print(f'Unique values of the response and their frequencies in the available data: {dict(zip(unique_values,frequencies))}')
Unique values of the response and their frequencies in the available data: {1: 300, 2: 1836, 3: 2613, 4: 2536, 5: 1551}

The classification problem is imbalance since the frequencies distribution of the response variable is imbalance.

These is an important fact to consider since affects to both the ML models and the error metric.

  • Error metric: in imbalanced problems accuracy is bias by the most dominant classes of the response, and, therefore, it mostly reflects the predictive performance of the models in those dominant classes, so that, if a model perform outstanding in the dominant clases but poorly in the not dominant the accuracy will be outstanding when it shouldn’t be like this. To mitigate this issue balanced accuracy will be used.

  • Models: in imbalanced classification problems ML models usually learn patterns related with the dominant classes, and due to this perform well classifying instances belonging to dominant classes but poorly those belonging to not dominant. To overcome this issue there are several strategies like over/under sampling and weighting observations according to classes frequencies. In this project the second approach will be considered for certain models.