Scaling dataset
Scaling dataset columns
Scaling refers to transforming features so they have similar ranges or distributions. Common scaling methods include:
Min-Max Scaling (Normalization) Rescales features to a fixed range, usually [0, 1].
Standardization (Z-score Normalization) Transforms features to have a mean of 0 and standard deviation of 1.
Why Scaling is Useful
Scaling is crucial for many machine learning algorithms, especially those sensitive to the magnitude of features:
Principal Component Analysis (PCA): PCA projects data onto directions of maximum variance. If features are on different scales, PCA will be biased toward features with larger magnitudes.
K-Nearest Neighbors (KNN) and K-Means Clustering: These rely on distance metrics. Unscaled features can dominate the distance computation.
Gradient Descent Optimization (used in Logistic Regression, Neural Networks, etc.): Feature scaling ensures faster convergence because the loss function contours are more symmetrical.
Support Vector Machines (SVM): Scaling affects the computation of the margin and support vectors.
Using preprocess
Preprocess offers you two ways to apply scaling operation on your dataset :
- Selected numeric columns
- All numeric columns
On selected numeric columns
If you wish to apply scaling operation, then apply op = "scaling"
operation using the selected method : minmax
or zscore
.
Here is an example where I wish to apply minmax scaling to the numeric columns age
and wage_eur
of the fifa dataset.
[[preprocess.columns]]
name = "age"
operations = [
{op = "scale", method = "minmax"}
]
[[preprocess.columns]]
name = "wage_eur"
operations = [
{op = "scale", method = "minmax"}
]
You need to apply the operation on each of the column. This gives you full flexibility as to which operations to apply on which column. You may wish for example to apply the fillna
operation on wage_eur in addition to the scale operation.
[[preprocess.columns]]
name = "wage_eur"
operations = [
{op = "fillna", method = "mean"},
{op = "scale", method = "minmax"}
]
On all numeric columns
In some case, you prefer to apply this same operation on all the numeric columns. To achieve this, you just need to apply the operation on the [preprocess.numeric] selector.
Here’s an example :
[preprocess.numerics]
operations = [
{op = "scale", method = "minmax"}
]
You just don’t need to set the column name.
An alternative syntax that TOML allows us to express this operation is the following
[[preprocess.numerics.operations]]
op = "scale",
method = "minmax"
Since [[preprocess.numerics.operations]] is a list of operations, you can use this syntax to define individually the operation. If you choose to use that syntax then you will need to add as many operations you like.
Last updated 31 May 2025, 21:32 +0200 .