This tutorial walks you through submitting a “.csv” file of predictions to Kaggle for the first time.

Scoring and challenges:
If you simply run the code below, your score will be fairly poor. I have intentionally left lots of room for improvement regarding the model used (currently a simple decision tree classifier).

The idea of this tutorial is to get you started and have you make the decisions of how to improve your score. At the bottom of the tutorial are challenges which, if you follow them, will significantly improve your score.

Running the code:
I recommend running the code in Jupyter Notebooks. If you don’t know what Jupyter Notebooks is, check out this tutorial which can help you install it and start using it.

You can also download this entire page as a notebook from this folder on my GitHub.


Steps to complete this tutorial:

  1. Create a Kaggle account (https://www.kaggle.com/)
  2. Download Titanic dataset (https://www.kaggle.com/c/titanic/data)
    a. Download ‘train.csv’ and ‘test.csv’
    b. Place them in the same directory as this notebook
  3. Run every cell in this notebook (except the visualization cells)
  4. Submit CSV containing the predictions
  5. Try to improve the prediction by using the challenge prompts which are suitable to your level

1. Process the data

Load data

In [2]:
#Load data
import pandas as pd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

#Drop features we are not going to use
train = train.drop(['Name','SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],axis=1)
test = test.drop(['Name','SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],axis=1)

#Look at the first 3 rows of our training data
train.head(3)
Out[2]:
PassengerId Survived Pclass Sex Age
0 1 0 3 male 22.0
1 2 1 1 female 38.0
2 3 1 3 female 26.0

Our data has the following columns:

  • PassengerId – Each passenger’s id
  • Survived – Whether the passenger survived or not (1 – yes, 0 – no)
  • Pclass – The passenger class: (1st class – 1, 2nd class – 2, third class – 3)
  • Sex – Each passenger’s sex
  • Age – Each passenger’s age

Prepare the data to be read by our algorithm

In [3]:
#Convert ['male','female'] to [1,0] so that our decision tree can be built
for df in [train,test]:
    df['Sex_binary']=df['Sex'].map({'male':1,'female':0})
    
#Fill in missing age values with 0 (presuming they are a baby if they do not have a listed age)
train['Age'] = train['Age'].fillna(0)
test['Age'] = test['Age'].fillna(0)

#Select feature column names and target variable we are going to use for training
features = ['Pclass','Age','Sex_binary']
target = 'Survived'

#Look at the first 3 rows (we have over 800 total rows) of our training data.; 
#This is input which our classifier will use as an input.
train[features].head(3)
Out[3]:
Pclass Age Sex_binary
0 3 22.0 1
1 1 38.0 0
2 3 26.0 0

Let’s look at the first 3 corresponding target variables. This is the measure of whether the passenger survived or not (i.e. the first passenger (22 year-old male) did not survive, but the second passenger (38 year-old female did survive).

Our classifier will use this to know what the output should be for each of the training instances.

In [4]:
#Display first 3 target variables
train[target].head(3).values
Out[4]:
array([0, 1, 1], dtype=int64)

2. Create and fit the decision tree

This tree is definitely going to overfit our data. When you get to the challenge stage, you can return here and tune hyperparameters in this cell. For example, you could reduce the maximum depth of the tree to 3 by setting max_depth=3 with the following command:

clf = DecisionTreeClassifier(max_depth=3)

To change multiple hyperparameters, seperate out the parameters with a comma. For example, to change the learning rate and minimum samples per leaf and the maximum depth fill in the parentheses with the following:

clf = DecisionTreeClassifier(max_depth=3,min_samples_leaf=2)

The other parameters are listed below.
You can also access the list of parameters by reading the documentation for decision tree classifiers. Another way to access the parameters is to place your cursor in between the parentheses and then press shift-tab.

In [5]:
from sklearn.tree import DecisionTreeClassifier

#Create classifier object with default hyperparameters
clf = DecisionTreeClassifier()  

#Fit our classifier using the training features and the training target values
clf.fit(train[features],train[target]) 
Out[5]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Visualize default tree (optional)

This is not a necessary step, but it shows you how complex the tree is when you don’t restrict it.

In [6]:
#Create decision tree ".dot" file

from sklearn.tree import export_graphviz
export_graphviz(clf,out_file='titanic_tree.dot',feature_names=features,rounded=True,filled=True,class_names=['Survived','Did not Survive'])

Note, if you want to generate a new tree png, you need to open terminal (or command prompt) after running the cell above. Navigate to the directory where you have this notebook and the type the following command.

dot -Tpng titanic_tree.dot -o titanic_tree.png

In [7]:
#Display decision tree

#Blue on a node or leaf means the tree thinks the person did not survive
#Orange on a node or leaf means that tree thinks that the person did survive

#In Chrome, to zoom in press control +. To zoom out, press control -. If you are on a Mac, use Command.

from IPython.core.display import Image, display
display(Image('titanic_tree.png', width=1900, unconfined=True))