Feature Vector Normalisation

This post covers two techniques for normalising feature vectors. Normalisation is needed for some machine learning algos

Min-max normalisation

Min-max normalisation is a simple kind of normalisation which maps the feature vector to have values between 0 and 1.

For a vector `X` the rescaled vector `X'` is given by:

`X' = (X - min(X)) / (max(X) - min(X))`

This formula is a bit like scaling the vector by its max value, except that by subtracting the min value from the numerator and denominator it handles negative numbers correctly and also causes the range to start from 0.

Example:

`X = [80, 42, 91, 27, 92, 88, 2]`

`min(X) = 2, max(X) = 92`

`X -> [(80 - 2)/90,` ` (42 - 2)/90,` ` (91 - 2)/90,` ` (27 - 2)/90,` ` 90/90,` ` (88 - 2)/90,` ` (2 - 2)/90]`

`X -> [0.867,` ` 0.444,` ` 0.989,` ` 0.278,` ` 1,` ` 0.956,` ` 0]`

Future values:

If you don't know all your data ahead of time then future observations may end up with more extreme values that exceed the min or max from your original population.

This would mean you'd end up with values outside the [0, 1] range.

Z-score normalisation

Z-score normalisation is a feature scaling method where the values are converted to z-scores.

A Z-score is the number of standard deviations a value is from the mean.

For a vector `X` the rescaled vector `Z` is given by:

`Z = (X - mu) / sigma`

Where `mu` is the mean of X and `sigma` is the standard deviation

Example:

`X = [80, 40, 91, 27, 92, 88, 2]`

`sigma = 36.565, mu = 60`

`X -> [(80 - 60)/36.565,` ` (40 - 60)/36.565,` ` (91 - 60)/36.565,` ` (27 - 60)/36.565,` ` (90 - 60)/36.565,` ` (88 - 60)/36.565,` ` (2 - 60)/36.565]`

`X -> [0.547,` ` -0.547,` ` 0.848,` ` -0.902,` ` 0.821,` ` 0.766,` ` -1.586]`

Future values:

Since Z-scores have no lower or upper bound future values we don't have to worry about future values being outside any range

Population statistics vs future values:

Future values should have the same standard deviation and mean as the original population for the Z-scores to be valid.