K-Means Clustering
Iris Dataset

Before we implement the K-means algorithm, let’s find a dataset. The sklearn package embeds some datasets and sample images. One of them is the Iris dataset.

The Iris dataset consists of measurements of sepals and petals of 3 different plant species:

  • Iris setosa
  • Iris versicolor
  • Iris virginica

The sepal is the part that encases and protects the flower when it is in the bud stage. A petal is a leaflike part that is often colorful.

From sklearn library, import the datasets module:

from sklearn import datasets

To load the Iris dataset:

iris = datasets.load_iris()

The Iris dataset looks like:

[[ 5.1 3.5 1.4 0.2 ] [ 4.9 3. 1.4 0.2 ] [ 4.7 3.2 1.3 0.2 ] [ 4.6 3.1 1.5 0.2 ] . . . [ 5.9 3. 5.1 1.8 ]]

We call each piece of data a sample. For example, each flower is one sample.

Each characteristic we are interested in is a feature. For example, petal length is a feature of this dataset.

The features of the dataset are:

  • Column 0: Sepal length
  • Column 1: Sepal width
  • Column 2: Petal length
  • Column 3: Petal width

The 3 species of Iris plants are what we are going to cluster later in this lesson.



Import the datasets module and load the Iris data.


Every dataset from sklearn comes with a bunch of different information (not just the data) and is stored in a similar fashion.

First, let’s take a look at the most important thing, the sample data:


Each row is a plant!


Since the datasets in sklearn datasets are used for practice, they come with the answers (target values) in the target key:

Take a look at the target values:


The iris.target values give the ground truth for the Iris dataset. Ground truth, in this case, is the number corresponding to the flower that we are trying to learn.


It is always a good idea to read the descriptions of the data:


Expand the terminal (right panel):

  • When was the Iris dataset published?
  • What is the unit of measurement?
Folder Icon

Sign up to start coding

Already have an account?