models
#
FastKmedoidsGGower
#
Implements the Fast-K-medoids algorithm based on the Generalized Gower distance.
-------------------
Constructor method
-------------------
Parameters: (inputs)
-----------
n_clusters: the number of clusters.
method: the k-medoids clustering method. Must be in ['pam', 'alternate']. PAM is the classic one, more accurate but slower.
init: the k-medoids initialization method. Must be in ['heuristic', 'random']. Heuristic is the classic one, smarter burt slower.
max_iter: the maximum number of iterations run by k-medodis.
frac_sample_size: the sample size in proportional terms.
p1, p2, p3: number of quantitative, binary and multi-class variables in the considered data matrix, respectively. Must be a non negative integer.
d1: name of the distance to be computed for quantitative variables. Must be an string in ['euclidean', 'minkowski', 'canberra', 'mahalanobis', 'robust_mahalanobis'].
d2: name of the distance to be computed for binary variables. Must be an string in ['sokal', 'jaccard'].
d3: name of the distance to be computed for multi-class variables. Must be an string in ['matching'].
q: the parameter that defines the Minkowski distance. Must be a positive integer.
robust_method: the method to be used for computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
alpha : a real number in [0,1] that is used if `method` is 'trimmed' or 'winsorized'. Only needed when d1 = 'robust_mahalanobis'.
epsilon: parameter used by the Delvin algorithm that is used when computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
n_iters: maximum number of iterations used by the Delvin algorithm. Only needed when d1 = 'robust_mahalanobis'.
fast_VG: whether the geometric variability estimation will be full (False) or fast (True).
VG_sample_size: sample size to be used to make the estimation of the geometric variability.
VG_n_samples: number of samples to be used to make the estimation of the geometric variability.
random_state: the random seed used for the (random) sample elements.
y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
-----------
Fit method:
-----------
Fits the fast k-medoids algorithm to `X`, and `y`, if needed.
Parameters: (inputs)
-----------
X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
y: a pandas/polars series or a numpy array. Represents a response variable. Is not required.
weights: the sample weights. Only used if provided and d1 = 'robust_mahalanobis'.
---------------
Predict method:
---------------
Predicts clusters for `X` observation by assigning them to their nearest cluster (medoid) according to Generalized Gower distance.
Parameters: (inputs)
-----------
X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
Returns: (outputs)
--------
predicted_clusters: the predicted clusters of each observation of `X`.
Example#
import pandas as pd
from FastKmedoids.models import FastKmedoidsGGower
data_url = "https://raw.githubusercontent.com/FabioScielzoOrtiz/FastKmedoids-demo/refs/heads/main/data/madrid_houses_processed.csv"
quant_cols = ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors', 'buy_price']
binary_cols = ['is_renewal_needed', 'has_lift', 'is_exterior', 'has_parking']
multiclass_cols = ['energy_certificate', 'house_type']
p1 = len(quant_cols)
p2 = len(binary_cols)
p3 = len(multiclass_cols)
data = pd.read_csv(data_url)
fast_kmedoids = FastKmedoidsGGower(n_clusters=3, method='pam', init='heuristic', max_iter=100, random_state=123,
frac_sample_size=0.01, p1=5, p2=4, p3=2,
d1='robust_mahalanobis', d2='jaccard', d3='hamming',
robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20)
fast_kmedoids.fit(X=data_pd)
fast_kmedoids.labels_
array([2, 1, 1, ..., 0, 0, 0], dtype=int64)
fast_kmedoids = FastKmedoidsGGower(n_clusters=3, method='pam', init='heuristic', max_iter=100, random_state=123,
frac_sample_size=0.01, p1=5, p2=4, p3=2,
d1='euclidean', d2='sokal', d3='hamming',
robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20)
fast_kmedoids.fit(X=data_pd)
fast_kmedoids.labels_
array([0, 2, 2, ..., 0, 1, 2], dtype=int64)
fast_kmedoids = FastKmedoidsGGower(n_clusters=3, method='pam', init='heuristic', max_iter=100, random_state=123,
frac_sample_size=0.01, p1=5, p2=4, p3=2,
d1='canberra', d2='sokal', d3='hamming',
robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20)
fast_kmedoids.fit(X=data_pd)
fast_kmedoids.labels_
array([1, 0, 0, ..., 0, 2, 0], dtype=int64)
FoldFastKmedoidsGGower
#
Implements the K-Fold Fast-K-medoids algorithm based on the Generalized Gower distance.
-------------------
Constructor method
-------------------
Parameters: (inputs)
-----------
n_clusters: the number of clusters.
method: the k-medoids clustering method. Must be in ['pam', 'alternate']. PAM is the classic one, more accurate but slower.
init: the k-medoids initialization method. Must be in ['heuristic', 'random']. Heuristic is the classic one, smarter burt slower.
max_iter: the maximum number of iterations run by k-medodis.
frac_sample_size: the sample size in proportional terms.
p1, p2, p3: number of quantitative, binary and multi-class variables in the considered data matrix, respectively. Must be a non negative integer.
d1: name of the distance to be computed for quantitative variables. Must be an string in ['euclidean', 'minkowski', 'canberra', 'mahalanobis', 'robust_mahalanobis'].
d2: name of the distance to be computed for binary variables. Must be an string in ['sokal', 'jaccard'].
d3: name of the distance to be computed for multi-class variables. Must be an string in ['matching'].
q: the parameter that defines the Minkowski distance. Must be a positive integer.
robust_method: the method to be used for computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
alpha : a real number in [0,1] that is used if `method` is 'trimmed' or 'winsorized'. Only needed when d1 = 'robust_mahalanobis'.
epsilon: parameter used by the Delvin algorithm that is used when computing the robust covariance matrix. Only needed when d1 = 'robust_mahalanobis'.
n_iters: maximum number of iterations used by the Delvin algorithm. Only needed when d1 = 'robust_mahalanobis'.
fast_VG: whether the geometric variability estimation will be full (False) or fast (True).
VG_sample_size: sample size to be used to make the estimation of the geometric variability.
VG_n_samples: number of samples to be used to make the estimation of the geometric variability.
random_state: the random seed used for the (random) sample elements.
y_type: the type of response variable. Must be in ['quantitative', 'binary', 'multiclass'].
n_splits: number of folds to be used.
shuffle: whether data is shuffled before applying KFold or not, must be in [True, False].
kfold_random_state: the random seed for KFold if shuffle = True.
-----------
Fit method:
-----------
Fits the fast k-medoids algorithm to `X` (and `y` if needed).
Parameters: (inputs)
-----------
X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
y: a pandas/polars series or a numpy array. Represents a response variable. Is not required.
weights: the sample weights. Only used if provided and d1 = 'robust_mahalanobis'.
---------------
Predict method:
---------------
Predicts clusters for `X` observation by assigning them to their nearest cluster (medoid) according to Generalized Gower distance.
Parameters: (inputs)
-----------
X: a pandas/polars data-frame or a numpy array. Represents a predictors matrix. Is required.
Example#
import pandas as pd
from FastKmedoids.models import FoldFastKmedoidsGGower
data_url = "https://raw.githubusercontent.com/FabioScielzoOrtiz/FastKmedoids-demo/refs/heads/main/data/madrid_houses_processed.csv"
quant_cols = ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors', 'buy_price']
binary_cols = ['is_renewal_needed', 'has_lift', 'is_exterior', 'has_parking']
multiclass_cols = ['energy_certificate', 'house_type']
p1 = len(quant_cols)
p2 = len(binary_cols)
p3 = len(multiclass_cols)
data = pd.read_csv(data_url)
kfold_fast_kmedoids = FoldFastKmedoidsGGower(n_clusters=3, method='pam', init='heuristic', max_iter=100, random_state=123,
frac_sample_size=0.1, n_splits=10, shuffle=True, kfold_random_state=123,
p1=p1, p2=p2, p3=p3, d1='robust_mahalanobis', d2='jaccard', d3='hamming',
robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20,
fast_VG=False, VG_sample_size=1000, VG_n_samples=5)
kfold_fast_kmedoids.fit(X=data)
kfold_fast_kmedoids.labels_
array([0, 1, 1, ..., 1, 1, 1])
kfold_fast_kmedoids = FoldFastKmedoidsGGower(n_clusters=3, method='pam', init='heuristic', max_iter=100, random_state=123,
frac_sample_size=0.1, n_splits=10, shuffle=True, kfold_random_state=123,
p1=p1, p2=p2, p3=p3, d1='euclidean', d2='sokal', d3='hamming',
robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20,
fast_VG=False, VG_sample_size=1000, VG_n_samples=5)
kfold_fast_kmedoids.fit(X=data_pd)
kfold_fast_kmedoids.labels_
array([1, 2, 2, ..., 2, 2, 2])
kfold_fast_kmedoids = FoldFastKmedoidsGGower(n_clusters=3, method='pam', init='heuristic', max_iter=100, random_state=123,
frac_sample_size=0.1, n_splits=10, shuffle=True, kfold_random_state=123,
p1=p1, p2=p2, p3=p3, d1='canberra', d2='sokal', d3='hamming',
robust_method='trimmed', alpha=0.05, epsilon=0.05, n_iters=20,
fast_VG=False, VG_sample_size=1000, VG_n_samples=5)
kfold_fast_kmedoids.fit(X=data_pd)
kfold_fast_kmedoids.labels_
array([2, 2, 2, ..., 0, 1, 2])