Skip to Content
Learn
K-Means Clustering
Evaluation

At this point, we have clustered the Iris data into 3 different groups (implemented using Python and using scikit-learn). But do the clusters correspond to the actual species? Let’s find out!

First, remember that the Iris dataset comes with target values:

target = iris.target

It looks like:

[ 0 0 0 0 0 ... 2 2 2]

According to the metadata:

  • All the 0‘s are Iris-setosa
  • All the 1‘s are Iris-versicolor
  • All the 2‘s are Iris-virginica

Let’s change these values into the corresponding species using the following code:

species = np.chararray(target.shape, itemsize=150) for i in range(len(samples)): if target[i] == 0: species[i] = 'setosa' elif target[i] == 1: species[i] = 'versicolor' elif target[i] == 2: species[i] = 'virginica'

Then we are going to use the Pandas library to perform a cross-tabulation.

Cross-tabulations enable you to examine relationships within the data that might not be readily apparent when analyzing total survey responses.

The result should look something like:

labels setosa versicolor virginica 0 50 0 0 1 0 2 36 2 0 48 14

(You might need to expand this narrative panel in order to the read the table better.)

The first column has the cluster labels. The second to fourth columns have the Iris species that are clustered into each of the labels.

By looking at this, you can conclude that:

  • Iris-setosa was clustered with 100% accuracy.
  • Iris-versicolor was clustered with 96% accuracy.
  • Iris-virginica didn’t do so well.

Follow the instructions below to learn how to do a cross-tabulation.

Instructions

1.

pandas is already imported for you:

import pandas as pd

Add the code from the narrative to get the species array and finish the elif statements:

species = np.chararray(target.shape, itemsize=150) for i in range(len(samples)): if target[i] == 0: species[i] = 'setosa' # finish elif # finish elif
2.

Then, below the for loop, create:

df = pd.DataFrame({'labels': labels, 'species': species}) print(df)
3.

Next, use the crosstab() method to perform cross-tabulation:

ct = pd.crosstab(df['labels'], df['species']) print(ct)

Expand the right panel (output terminal).

How accurate are the clusters?

Folder Icon

Sign up to start coding

Already have an account?