Problem:

Sometimes our datasets have missing values.

Machine learning algorithms don’t deal well with missing values.

Solutions

Solution 1:
Drop each feature which contains missing values (drop the column)

Solution 2:
Drop each entry which contains missing values (drop the row)

Solution 3:
Imputation (fill in the missing values)

Imputation:

Deal with missing data points by substituting new values.

Common strategy: replace each missing value in a feature with the mean, median, or mode of the feature.

Mean, Median, Mode Refresher

Mean:
Numerical average – the mean of [1,2,3,4] is (1+2+3+4)/4 = 2.5.

Median:
The middle value – the median of [1,3,10] is 3.

Mode:
Most frequent value – the mode of [1,3,3] is 3.


1. Get the data

Import pandas

In [1]:
import pandas as pd

Read data

In [2]:
df = pd.read_csv('train.csv')

Create subset of the data to work with

  • LotFrontage: Linear feet of street connected to property
  • FireplaceQu: Fireplace quality
  • GarageYrBlt: Year garage was built
  • BsmtCond: General condition of the basement
In [3]:
housing = df[['LotFrontage','FireplaceQu','GarageYrBlt','BsmtCond']].copy()
In [4]:
housing.head()
Out[4]:
LotFrontage FireplaceQu GarageYrBlt BsmtCond
0 65.0 NaN 2003.0 TA
1 80.0 TA 1976.0 TA
2 68.0 TA 2001.0 TA
3 60.0 Gd 1998.0 Gd
4 84.0 TA 2000.0 TA

2. Explore the missing data

Examine missing data

In [5]:
housing.isnull().sum()
Out[5]:
LotFrontage    259
FireplaceQu    690
GarageYrBlt     81
BsmtCond        37
dtype: int64
In [6]:
housing.isnull().sum()/len(housing)
Out[6]:
LotFrontage    0.177397
FireplaceQu    0.472603
GarageYrBlt    0.055479
BsmtCond       0.025342
dtype: float64

Drop columns with more than 25% of missing data

In [7]:
housing.drop('FireplaceQu',inplace=True,axis=1)
In [8]:
housing.head(2)
Out[8]:
LotFrontage GarageYrBlt BsmtCond
0 65.0 2003.0 TA
1 80.0 1976.0 TA

3. Impute Substitute Values

Strategy 1: Impute Mean

In [9]:
garage_yr_mean = housing['GarageYrBlt'].mean()

garage_yr_mean
Out[9]:
1978.5061638868744
In [10]:
housing['GarageYrBlt'].fillna(garage_yr_mean,inplace=True)

Strategy 2: Impute Median

In [11]:
frontage_median = housing['LotFrontage'].median()

frontage_median
Out[11]:
69.0
In [12]:
housing['LotFrontage'].fillna(frontage_median,inplace=True)

Strategy 3: Impute Mode

In [13]:
housing['BsmtCond'].value_counts()
Out[13]:
TA    1311
Gd      65
Fa      45
Po       2
Name: BsmtCond, dtype: int64
In [14]:
bsmt_cond_mode = housing['BsmtCond'].value_counts().index[0]

bsmt_cond_mode
Out[14]:
'TA'
In [15]:
housing['BsmtCond'].fillna(bsmt_cond_mode,inplace=True)

Check for missing data

In [16]:
housing.isnull().sum()
Out[16]:
LotFrontage    0
GarageYrBlt    0
BsmtCond       0
dtype: int64

Scikit-Learn Application:

Scikit-learn includes imputation functionality in its preprocessing module.:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html

Categories: pandas

Leave a Reply

Your email address will not be published. Required fields are marked *