Most machine learning algorithms have a hard time dealing with features which contian values on a widely differeing scale. As a result, it is fairly important to scale our data before fitting and predicting.

Most common methods used:

  • min-max scaling
  • standardization

Below, the various scaler estimators from sklearn are demonstrated. At the end of the page, transformations are made using the function calls alone (without initializing an estimator).

Note:
When preparing data for a ML algorithm, fit scaler object to training data only and use that scaler to transform both the training and validation data.

Generate data to scale

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'Small values':np.random.random(3),
              'Medium values':np.random.random(3)*10,
              'Large values':np.random.random(3)*100})

df.head()
Out[1]:
Large values Medium values Small values
0 95.349624 8.042178 0.696948
1 61.598142 2.939906 0.206422
2 31.911581 7.073023 0.710572

Standard Scaler

Removes mean and scales to ‘unit variance’ (i.e. all values are converted to their z-score: a value of 1 means that that value is 1 standard deviation from the mean of their column).

Parameters:

  • with_mean = True (centers values at 0 by default)
  • with_std = True (scales values to unit variance)
  • copy = True (creates copy rather than in place scaling)
In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()   #Initalize scaler estimator

scaler.fit(df) #Remember to only fit scaler to training data
Out[2]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [3]:
#Use scaler function to transform all data
pd.DataFrame(scaler.transform(df),columns=df.columns)
Out[3]:
Large values Medium values Small values
0 1.250049 0.914729 0.677860
1 -0.052283 -1.391416 -1.413816
2 -1.197766 0.476687 0.735956
In [4]:
#Retreive parameters
scaler.get_params()
Out[4]:
{'copy': True, 'with_mean': True, 'with_std': True}
In [5]:
#Change parameters with set_params
scaler.set_params(with_mean=False)
Out[5]:
StandardScaler(copy=True, with_mean=False, with_std=True)

This scaler object has a fit method which performs the fit and transformation in one step. For covenience, the fit and transformations made below will use this method.

In [6]:
pd.DataFrame(scaler.fit_transform(df),columns=df.columns)
Out[6]:
Large values Medium values Small values
0 3.679153 3.634937 2.971892
1 2.376821 1.328791 0.880216
2 1.231338 3.196894 3.029988

Min-Max Scaling

Scale data to given range (specified as feature_range)

In [7]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,10))

pd.DataFrame(scaler.fit_transform(df),columns=df.columns)
Out[7]:
Large values Medium values Small values
0 10.000000 10.000000 9.729757
1 4.679615 0.000000 0.000000
2 0.000000 8.100544 10.000000

Normalizer

Normalizes data by l1, l2, or ‘max’ normalization.

L1 Norm – Least Absolute Errors
L2 Norm – Least Squares
Max – Normalized by max value in entire featureset

In [8]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer(norm='l2')

pd.DataFrame(normalizer.fit_transform(df),columns=df.columns)
Out[8]:
Large values Medium values Small values
0 0.996435 0.084043 0.007283
1 0.998857 0.047673 0.003347
2 0.976076 0.216342 0.021734

MaxAbs Scaler

Scale each feature by max value (by column).

In [9]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()

pd.DataFrame(scaler.fit_transform(df),columns=df.columns)
Out[9]:
Large values Medium values Small values
0 1.000000 1.000000 0.980826
1 0.646024 0.365561 0.290502
2 0.334680 0.879491 1.000000

Robust Scaler

Scales features according to specified quantile range and so it is ‘robust’ to outliers.

Parameters:

  • with_centering=True
  • with_scaling=True
  • quantile_range=(25.0, 75.0)
  • copy=True
In [10]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler(quantile_range=(20,80))

pd.DataFrame(scaler.fit_transform(df),columns=df.columns)
Out[10]:
Large values Medium values Small values
0 0.886731 0.316576 0.000000
1 0.000000 -1.350091 -1.621626
2 -0.779936 0.000000 0.045041

Scaling without initializing an estimator

If calling a function is preferable to initializing an estimator (as done above), sklearn has functions which can be called directly off of the preprocessing module.

In [11]:
from sklearn import preprocessing

pd.DataFrame(preprocessing.scale(df),columns=df.columns)
Out[11]:
Large values Medium values Small values
0 1.250049 0.914729 0.677860
1 -0.052283 -1.391416 -1.413816
2 -1.197766 0.476687 0.735956
In [12]:
pd.DataFrame(preprocessing.normalize(df,norm='l2'),columns=df.columns)
Out[12]:
Large values Medium values Small values
0 0.996435 0.084043 0.007283
1 0.998857 0.047673 0.003347
2 0.976076 0.216342 0.021734
In [13]:
pd.DataFrame(preprocessing.robust_scale(df),columns=df.columns)
Out[13]:
Large values Medium values Small values
0 1.064077 0.379891 0.000000
1 0.000000 -1.620109 -1.945951
2 -0.935923 0.000000 0.054049
In [14]:
pd.DataFrame(preprocessing.maxabs_scale(df),columns=df.columns)
Out[14]:
Large values Medium values Small values
0 1.000000 1.000000 0.980826
1 0.646024 0.365561 0.290502
2 0.334680 0.879491 1.000000
In [15]:
pd.DataFrame(preprocessing.minmax_scale(df,feature_range=(0,2)),columns=df.columns)
Out[15]:
Large values Medium values Small values
0 2.000000 2.000000 1.945951
1 0.935923 0.000000 0.000000
2 0.000000 1.620109 2.000000
Categories: scikit-learn

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

scikit-learn

Dimensionality Reduction – PCA

Principal Component Analysis (PCA) offers an effective way to reduce the number of dimensions of the data. This reduction of data allows for improved training speeds for machine learning and easier visualization of the data. Read more…

scikit-learn

Metrics – Regression

This page briefly goes over the regression metrics found in scikit-learn. The metrics are first calculated with NumPy and then calculated using the higher level functions available in sklearn.metrics. 1. Generate data and fit with Read more…

scikit-learn

Metrics – Classification Report Breakdown (Precision, Recall, F1)

Create Dummy Data for Classification Classify Dummy Data Breakdown of Metrics Included in Classification Report List of Other Classification Metrics Available in sklearn.metrics 1. Create Dummy Data for Classification In [1]: import seaborn as sns import Read more…