Sklearn comes with multiple preloaded datasets for data manipulation, regression, or classification. They are loaded with the following commands

Classification datasets:

  • iris (4 features – set of measurements of flowers – 3 possible flower species)
  • breast_cancer (features describing malignant and benign cell nuclei)
  • digits (hand-written digits stored as 64 numerical array representing 8×8 black/white images)
  • wine (13 numeric features – 3 possibile wine classes)

Regression datsets:

  • boston (13 numeric/categorical features – predict housing prices from boston)
  • diabetes (10 numeric features – used to predict disease progression)

Multivariate regression:

  • linnerud (3 numeric features – phsyical exercises – 3 numeric observations on weight, waist, pulse)

Loading dataset:

from sklearn.datasets import load_name

name = load_name()

In [1]:
from sklearn.datasets import load_iris

iris = load_iris()

Accessing dataset:

To see this options, type iris. then tab after importing.

Select any of the following:

iris.data
iris.DESCR
iris.feature_names
iris.target
iris.target_names

Examining dataset

In [2]:
import pandas as pd
In [3]:
iris_features_df = pd.DataFrame(data=iris.data,
                               columns=iris.feature_names)

iris_features_df.head(2)
Out[3]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
In [4]:
iris_target_df = pd.DataFrame(data=iris.target,
                               columns=["Species"])

iris_target_df.head(2)
Out[4]:
Species
0 0
1 0
In [5]:
list(iris.target_names) #0 - setosa, 1 - versicolor, 2- virginica
Out[5]:
['setosa', 'versicolor', 'virginica']

Printing description of dataset

In [6]:
print(iris.DESCR)
Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Feature Names

Note: digits has no feature_names

In [7]:
from sklearn.datasets import load_breast_cancer,load_boston,load_diabetes,load_linnerud,load_digits
In [8]:
datasets = {'Iris':load_iris() ,'Breast Cancer':load_breast_cancer(),'Boston':load_boston(),
            'Diabetes':load_diabetes(),'Linnerud':load_linnerud()}

for dataset in datasets:
    print("\n** {} **".format(dataset))
    print('{}'.format(datasets[dataset].feature_names))
** Iris **
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

** Breast Cancer **
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

** Boston **
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

** Diabetes **
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

** Linnerud **
['Chins', 'Situps', 'Jumps']

Target Names

In [9]:
datasets = {'Iris':load_iris() ,'Breast Cancer':load_breast_cancer(),'Digits':load_digits(),'Linnerud':load_linnerud()}

for dataset in datasets:
    print("\n** {} **".format(dataset))
    print('{}'.format(datasets[dataset].target_names))
** Iris **
['setosa' 'versicolor' 'virginica']

** Breast Cancer **
['malignant' 'benign']

** Digits **
[0 1 2 3 4 5 6 7 8 9]

** Linnerud **
['Weight', 'Waist', 'Pulse']

Shapes

In [10]:
datasets = {'Iris':load_iris() ,'Breast Cancer':load_breast_cancer(),'Boston':load_boston(),'Digits':load_digits(),
            'Diabetes':load_diabetes(),'Linnerud':load_linnerud()}

for dataset in datasets:
    print("\n** {} **".format(dataset))
    print('{}'.format(datasets[dataset].data.shape))
** Iris **
(150, 4)

** Breast Cancer **
(569, 30)

** Boston **
(506, 13)

** Digits **
(1797, 64)

** Diabetes **
(442, 10)

** Linnerud **
(20, 3)
Categories: scikit-learn

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

scikit-learn

Preparing Data – Scaling and Normalization

Most machine learning algorithms have a hard time dealing with features which contian values on a widely differeing scale. As a result, it is fairly important to scale our data before fitting and predicting. Most Read more…

scikit-learn

Dimensionality Reduction – PCA

Principal Component Analysis (PCA) offers an effective way to reduce the number of dimensions of the data. This reduction of data allows for improved training speeds for machine learning and easier visualization of the data. Read more…

scikit-learn

Metrics – Regression

This page briefly goes over the regression metrics found in scikit-learn. The metrics are first calculated with NumPy and then calculated using the higher level functions available in sklearn.metrics. 1. Generate data and fit with Read more…