Iris Dataset Analysis - Plotting it. | Data Science With Julia


The Iris flower

I happened to go through Iris Data set with Julia and I thought why not blog here? There is a package called RDatasets in Jilia which provides nicely cleaned datasets for machine learning people to start to learn. You can install it and start using it with Jupyter as shown:

using Pkg
Pkg.add("RDatasets")
using RDatasets

[32m[1m   Updating[22m[39m registry at `~/.julia/registries/General`
######################################################################### 100.0%
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`

Once done, lets look at the Data sets:

RDatasets.datasets()

763 rows × 5 columns

	Package	Dataset	Title	Rows	Columns
	String	String	String	Int64	Int64
1	COUNT	affairs	affairs	601	18
2	COUNT	azdrg112	azdrg112	1798	4
3	COUNT	azpro	azpro	3589	6
4	COUNT	badhealth	badhealth	1127	3
5	COUNT	fasttrakg	fasttrakg	15	9
6	COUNT	lbw	lbw	189	10
7	COUNT	lbwgrp	lbwgrp	6	7
8	COUNT	loomis	loomis	410	11
9	COUNT	mdvis	mdvis	2227	13
10	COUNT	medpar	medpar	1495	10
11	COUNT	rwm	rwm	27326	4
12	COUNT	rwm5yr	rwm5yr	19609	17
13	COUNT	ships	ships	40	7
14	COUNT	titanic	titanic	1316	4
15	COUNT	titanicgrp	titanicgrp	12	5
16	Ecdat	Accident	Ship Accidents	40	5
17	Ecdat	Airline	Cost for U.S. Airlines	90	6
18	Ecdat	Airq	Air Quality for Californian Metropolitan Areas	30	6
19	Ecdat	Benefits	Unemployement of Blue Collar Workers	4877	18
20	Ecdat	Bids	Bids Received By U.S. Firms	126	12
21	Ecdat	BudgetFood	Budget Share of Food for Spanish Households	23972	6
22	Ecdat	BudgetItaly	Budget Shares for Italian Households	1729	11
23	Ecdat	BudgetUK	Budget Shares of British Households	1519	10
24	Ecdat	Bwages	Wages in Belgium	1472	4
25	Ecdat	CPSch3	Earnings from the Current Population Survey	11130	3
26	Ecdat	Capm	Stock Market Data	516	5
27	Ecdat	Car	Stated Preferences for Car Choice	4654	70
28	Ecdat	Caschool	The California Test Score Data Set	420	17
29	Ecdat	Catsup	Choice of Brand for Catsup	2798	14
30	Ecdat	Cigar	Cigarette Consumption	1380	9
⋮	⋮	⋮	⋮	⋮	⋮

It’s a lot of data, so Jupyter Truncates the output. Now lets load it into a variable called iris using the following command:

iris = dataset("datasets", "iris")

As you can see it displays the first 30 rows:

150 rows × 5 columns

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
11	5.4	3.7	1.5	0.2	setosa
12	4.8	3.4	1.6	0.2	setosa
13	4.8	3.0	1.4	0.1	setosa
14	4.3	3.0	1.1	0.1	setosa
15	5.8	4.0	1.2	0.2	setosa
16	5.7	4.4	1.5	0.4	setosa
17	5.4	3.9	1.3	0.4	setosa
18	5.1	3.5	1.4	0.3	setosa
19	5.7	3.8	1.7	0.3	setosa
20	5.1	3.8	1.5	0.3	setosa
21	5.4	3.4	1.7	0.2	setosa
22	5.1	3.7	1.5	0.4	setosa
23	4.6	3.6	1.0	0.2	setosa
24	5.1	3.3	1.7	0.5	setosa
25	4.8	3.4	1.9	0.2	setosa
26	5.0	3.0	1.6	0.2	setosa
27	5.0	3.4	1.6	0.4	setosa
28	5.2	3.5	1.5	0.2	setosa
29	5.2	3.4	1.4	0.2	setosa
30	4.7	3.2	1.6	0.2	setosa
⋮	⋮	⋮	⋮	⋮	⋮

As a data scientist, or an aspiring one, you must know what this data means, you must have a clear idea what an Iris flower is, what the columns means, if you do not have an idea, then try reading about it here on Wikipedia. There were some guys patient enough to collect different species of iris flower, measure it’s petal length, sepal length, petal width and sepal width, catalog it, so that who knows what they thought, now its benefiting machine learning learners enormously.

Now I am not sure if I did install Julia’s DataFrames package, but iris is a DataFrame. DataFrame is a library created in Julia for easy visualization and manipulation of tabular data. If the following command gives you error, be sure to install DataFrames.jl.

Now let’s take a look at column names of iris.

names(iris)

So all columns shown below, except for Species is a measure of length and width of a flower part. The last column is the name of the species.

5-element Array{String,1}:
 "SepalLength"
 "SepalWidth"
 "PetalLength"
 "PetalWidth"
 "Species"

If you are not sure what petal and sepal is, I hope this image would help:


Parts of a flower

Now let’s take a look at size of the iris dataset:

size(iris)

It contains 150 rows and 5 columns as shown below:

(150, 5)

Now let’s apply a function called describe() on iris, it gives some statistical values on irises columns

describe(iris)

as shown below:

5 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	SepalLength	5.84333	4.3	5.8	7.9	0	Float64
2	SepalWidth	3.05733	2.0	3.0	4.4	0	Float64
3	PetalLength	3.758	1.0	4.35	6.9	0	Float64
4	PetalWidth	1.19933	0.1	1.3	2.5	0	Float64
5	Species		setosa		virginica	0	CategoricalValue{String,UInt8}

I hope one knows the meaning of statistical terms \(mean\), \(median\), \(min\) and \(max\). Ideally this blog series should explain what they are, may be in the future, for now please see other sources like Wikipedia for help.

If we want more of description of the iris data set, we need to pass a symbol :all. Once again this blog series should say what a symbol is in Julia. But for now passing :all will yield more statistical columns

describe(iris, :all)

as shown below:

5 rows × 13 columns (omitted printing of 4 columns)

	variable	mean	std	min	q25	median	q75	max	nunique
	Symbol	Union…	Union…	Any	Union…	Union…	Union…	Any	Union…
1	SepalLength	5.84333	0.828066	4.3	5.1	5.8	6.4	7.9
2	SepalWidth	3.05733	0.435866	2.0	2.8	3.0	3.3	4.4
3	PetalLength	3.758	1.7653	1.0	1.6	4.35	5.1	6.9
4	PetalWidth	1.19933	0.762238	0.1	0.3	1.3	1.8	2.5
5	Species			setosa				virginica	3

If you are wondering what the \(q25\) and \(q75\) are, refer to interquartile range. May be this blog series will one day explain it.

We can take a look at the first 5 elements of the data frame by using function called first() as shown below:

first(iris, 5)

5 rows × 5 columns

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

Similarly the last few rows can be viewed using function named last(). Below we view last five rows:

last(iris, 5)

5 rows × 5 columns

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	6.7	3.0	5.2	2.3	virginica
2	6.3	2.5	5.0	1.9	virginica
3	6.5	3.0	5.2	2.0	virginica
4	6.2	3.4	5.4	2.3	virginica
5	5.9	3.0	5.1	1.8	virginica

Now here is a way how to extract just petal lengths from iris.

iris[!, "PetalLength"]

150-element Array{Float64,1}:
4
4
3
5
4
7
4
5
4
5
5
6
4
 ⋮
8
4
6
1
1
9
7
2
0
2
4
1

So let’s go on plotting and see if there is a way to separate species, for it we will just use 2 parameters, the PetalLength and SepalLength. I install Plots and started using it with these commands:

Pkg.add("Plots")
using Plots

[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`

Now all I do is gather SepalLength in a variable sepal_length, PetalLength in petal_length and Species in species. Why on earth did I gather Species?! Anyway. I user the scatter() function to plot it as shwn below:

sepal_length = iris[!, "SepalLength"]
petal_length = iris[!, "PetalLength"]
species = iris[!, "Species"]
scatter(petal_length, sepal_length, color = "red")

svg

While you were viewing the Data Frame of iris, you must have seen just only 1 species in your Jupyter notebook. It tends to show only first 30 rows or something. Actually there are three different species that are tabulated in this data set. So in order to get the species name we can use this:

iris[!, "Species"]

Which just pulls out the species column. But we do not need 50 setosa’s followed by 50 some other one, we just want to view unique species names. To get unique values from array, we could use Set() which pulls out the unique values. That’s what we do below:

species = Set(iris[!, "Species"])

Set{CategoricalArrays.CategoricalValue{String,UInt8}} with 3 elements:
  CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"
  CategoricalArrays.CategoricalValue{String,UInt8} "setosa"
  CategoricalArrays.CategoricalValue{String,UInt8} "virginica"

In the code below, I thought of printing just the species name without all those clunky CategoricalArrays.CategoricalValue{String,UInt8} stuff, but I failed.

species_names = []

for x in species
    push!(species_names, x)
end

species_names

3-element Array{Any,1}:
 CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"
 CategoricalArrays.CategoricalValue{String,UInt8} "setosa"
 CategoricalArrays.CategoricalValue{String,UInt8} "virginica"

Once again my failed attempt to pretty print just a species name without the clunky CategoricalArrays.CategoricalValue{String,UInt8}

species_names[1]

CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"

It would be great if we can color our dots depending on the species. That might tell us something. So we need to split the dataset depending on the species. For that we can use the package DataFramesMeta. We install it as shown below:

using Pkg
Pkg.add("DataFramesMeta")

[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`

We use it as shown below:

using DataFramesMeta

One of the species name is versicolor. The DataFramesMeta package gives us the ability to filter a DataFrame to our need, so let’s separate just the versicolor data.

iris_versicolor = @where(iris, :Species .== "versicolor")
first(iris_versicolor, 5)

In the code above, we use a function called @where(), well is it a function? I don’t know, it starts with an @. As a first argument we send in the data frame that needs filtering, so the code now becomes @where(iris). As a second argument we send in the condition. Now look at the condition:

:Species .== "versicolor"

For now just think that :Species means iris[!, "Species"], the colon : before the Species means that :Species is a Symbol. More on Symbols in later blogs may be.

Now look at this one .==, let’s fire up Julia REPL and do this:

julia> ["setosa", "versicolor", "virginica", "setosa"] .== "versicolor"
4-element BitVector:
 0
 1
 0
 0

As you see above, when you do an array comparison with a .==, it compares each and every element of an array. So what matches becomes 1 and what doesn’t becomes 0. This is used by @where() to filter out the rows whose Species is only versicolor, and we get a result, we store it in a variable named iris_versicolor and print its first five rows using first(iris_versicolor, 5).

5 rows × 5 columns

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	7.0	3.2	4.7	1.4	versicolor
2	6.4	3.2	4.5	1.5	versicolor
3	6.9	3.1	4.9	1.5	versicolor
4	5.5	2.3	4.0	1.3	versicolor
5	6.5	2.8	4.6	1.5	versicolor

In a similar fashion, we filter out setosa:

iris_setosa = @where(iris, :Species .== "setosa")
first(iris_setosa, 5)

5 rows × 5 columns

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

and virginica:

iris_virginica = @where(iris, :Species .== "virginica")
first(iris_virginica, 5)

5 rows × 5 columns

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	6.3	3.3	6.0	2.5	virginica
2	5.8	2.7	5.1	1.9	virginica
3	7.1	3.0	5.9	2.1	virginica
4	6.3	2.9	5.6	1.8	virginica
5	6.5	3.0	5.8	2.2	virginica

Now let’s plot versicolor PetalLength vs SepalLength in red dots:

petal_lengths = iris_versicolor[!, "PetalLength"]
sepal_lengths = iris_versicolor[!, "SepalLength"]

petal_length_sepal_length_plot = scatter(
    petal_lengths,
    sepal_lengths,
    color= "red",
    label = "versicolor"
)

svg

Now we do for setosa in blue dots:

petal_lengths = iris_setosa[!, "PetalLength"]
sepal_lengths = iris_setosa[!, "SepalLength"]

scatter!(
    petal_length_sepal_length_plot,
    petal_lengths, sepal_lengths,
    color = "blue",
    label = "setosa"
)

svg

and virginica in green:

petal_lengths = iris_virginica[!, "PetalLength"]
sepal_lengths = iris_virginica[!, "SepalLength"]

petal_length_sepal_length_plot = scatter!(
    petal_length_sepal_length_plot,
    petal_lengths,
    sepal_lengths,
    color= "green",
    label = "virginica"
)

svg

Now let’s give the plot a title and label its axis and put legend in top left so that it does not obstruct our plot:

scatter!(
    petal_length_sepal_length_plot,
    title = "Iris Data Set",
    xlabel = "Petal Length (cm)",
    ylabel = "Sepal Length (cm)",
    legend = :topleft
)

svg

One could get Jupyter notebook for this blog here.