Iris Dataset Analysis - Plotting it.
The Iris flower |
I happened to go through Iris Data set with Julia and I thought why not blog here? There is a package called RDatasets in Jilia which provides nicely cleaned datasets for machine learning people to start to learn. You can install it and start using it with Jupyter as shown:
using Pkg
Pkg.add("RDatasets")
using RDatasets
[32m[1m Updating[22m[39m registry at `~/.julia/registries/General`
######################################################################### 100.0%
[32m[1m Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`
Once done, lets look at the Data sets:
RDatasets.datasets()
Package | Dataset | Title | Rows | Columns | |
---|---|---|---|---|---|
String | String | String | Int64 | Int64 | |
1 | COUNT | affairs | affairs | 601 | 18 |
2 | COUNT | azdrg112 | azdrg112 | 1798 | 4 |
3 | COUNT | azpro | azpro | 3589 | 6 |
4 | COUNT | badhealth | badhealth | 1127 | 3 |
5 | COUNT | fasttrakg | fasttrakg | 15 | 9 |
6 | COUNT | lbw | lbw | 189 | 10 |
7 | COUNT | lbwgrp | lbwgrp | 6 | 7 |
8 | COUNT | loomis | loomis | 410 | 11 |
9 | COUNT | mdvis | mdvis | 2227 | 13 |
10 | COUNT | medpar | medpar | 1495 | 10 |
11 | COUNT | rwm | rwm | 27326 | 4 |
12 | COUNT | rwm5yr | rwm5yr | 19609 | 17 |
13 | COUNT | ships | ships | 40 | 7 |
14 | COUNT | titanic | titanic | 1316 | 4 |
15 | COUNT | titanicgrp | titanicgrp | 12 | 5 |
16 | Ecdat | Accident | Ship Accidents | 40 | 5 |
17 | Ecdat | Airline | Cost for U.S. Airlines | 90 | 6 |
18 | Ecdat | Airq | Air Quality for Californian Metropolitan Areas | 30 | 6 |
19 | Ecdat | Benefits | Unemployement of Blue Collar Workers | 4877 | 18 |
20 | Ecdat | Bids | Bids Received By U.S. Firms | 126 | 12 |
21 | Ecdat | BudgetFood | Budget Share of Food for Spanish Households | 23972 | 6 |
22 | Ecdat | BudgetItaly | Budget Shares for Italian Households | 1729 | 11 |
23 | Ecdat | BudgetUK | Budget Shares of British Households | 1519 | 10 |
24 | Ecdat | Bwages | Wages in Belgium | 1472 | 4 |
25 | Ecdat | CPSch3 | Earnings from the Current Population Survey | 11130 | 3 |
26 | Ecdat | Capm | Stock Market Data | 516 | 5 |
27 | Ecdat | Car | Stated Preferences for Car Choice | 4654 | 70 |
28 | Ecdat | Caschool | The California Test Score Data Set | 420 | 17 |
29 | Ecdat | Catsup | Choice of Brand for Catsup | 2798 | 14 |
30 | Ecdat | Cigar | Cigarette Consumption | 1380 | 9 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
It’s a lot of data, so Jupyter Truncates the output. Now lets load it into a variable called iris
using the following command:
iris = dataset("datasets", "iris")
As you can see it displays the first 30 rows:
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
21 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
22 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
23 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
25 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
26 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
27 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
29 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
As a data scientist, or an aspiring one, you must know what this data means, you must have a clear idea what an Iris flower is, what the columns means, if you do not have an idea, then try reading about it here on Wikipedia. There were some guys patient enough to collect different species of iris flower, measure it’s petal length, sepal length, petal width and sepal width, catalog it, so that who knows what they thought, now its benefiting machine learning learners enormously.
Now I am not sure if I did install Julia’s DataFrames
package, but iris
is a DataFrame. DataFrame is a library created in Julia for easy visualization and manipulation of tabular data. If the following command gives you error, be sure to install DataFrames.jl.
Now let’s take a look at column names of iris
.
names(iris)
So all columns shown below, except for Species is a measure of length and width of a flower part. The last column is the name of the species.
5-element Array{String,1}:
"SepalLength"
"SepalWidth"
"PetalLength"
"PetalWidth"
"Species"
If you are not sure what petal and sepal is, I hope this image would help:
Parts of a flower |
Now let’s take a look at size of the iris dataset:
size(iris)
It contains 150 rows and 5 columns as shown below:
(150, 5)
Now let’s apply a function called describe()
on iris
, it gives some statistical values on irises columns
describe(iris)
as shown below:
variable | mean | min | median | max | nmissing | eltype | |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | SepalLength | 5.84333 | 4.3 | 5.8 | 7.9 | 0 | Float64 |
2 | SepalWidth | 3.05733 | 2.0 | 3.0 | 4.4 | 0 | Float64 |
3 | PetalLength | 3.758 | 1.0 | 4.35 | 6.9 | 0 | Float64 |
4 | PetalWidth | 1.19933 | 0.1 | 1.3 | 2.5 | 0 | Float64 |
5 | Species | setosa | virginica | 0 | CategoricalValue{String,UInt8} |
I hope one knows the meaning of statistical terms \(mean\), \(median\), \(min\) and \(max\). Ideally this blog series should explain what they are, may be in the future, for now please see other sources like Wikipedia for help.
If we want more of description of the iris
data set, we need to pass a symbol :all
. Once again this blog series should say what a symbol is in Julia. But for now passing :all
will yield more statistical columns
describe(iris, :all)
as shown below:
variable | mean | std | min | q25 | median | q75 | max | nunique | |
---|---|---|---|---|---|---|---|---|---|
Symbol | Union… | Union… | Any | Union… | Union… | Union… | Any | Union… | |
1 | SepalLength | 5.84333 | 0.828066 | 4.3 | 5.1 | 5.8 | 6.4 | 7.9 | |
2 | SepalWidth | 3.05733 | 0.435866 | 2.0 | 2.8 | 3.0 | 3.3 | 4.4 | |
3 | PetalLength | 3.758 | 1.7653 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | |
4 | PetalWidth | 1.19933 | 0.762238 | 0.1 | 0.3 | 1.3 | 1.8 | 2.5 | |
5 | Species | setosa | virginica | 3 |
If you are wondering what the \(q25\) and \(q75\) are, refer to interquartile range. May be this blog series will one day explain it.
We can take a look at the first 5 elements of the data frame by using function called first()
as shown below:
first(iris, 5)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Similarly the last few rows can be viewed using function named last()
. Below we view last five rows:
last(iris, 5)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
2 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
3 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
4 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
5 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
Now here is a way how to extract just petal lengths from iris
.
iris[!, "PetalLength"]
150-element Array{Float64,1}:
1.4
1.4
1.3
1.5
1.4
1.7
1.4
1.5
1.4
1.5
1.5
1.6
1.4
⋮
4.8
5.4
5.6
5.1
5.1
5.9
5.7
5.2
5.0
5.2
5.4
5.1
So let’s go on plotting and see if there is a way to separate species, for it we will just use 2 parameters, the PetalLength
and SepalLength
. I install Plots and started using it with these commands:
Pkg.add("Plots")
using Plots
[32m[1m Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`
Now all I do is gather SepalLength
in a variable sepal_length
, PetalLength
in petal_length
and Species
in species
. Why on earth did I gather Species
?! Anyway. I user the scatter()
function to plot it as shwn below:
sepal_length = iris[!, "SepalLength"]
petal_length = iris[!, "PetalLength"]
species = iris[!, "Species"]
scatter(petal_length, sepal_length, color = "red")
While you were viewing the Data Frame of iris
, you must have seen just only 1 species in your Jupyter notebook. It tends to show only first 30 rows or something. Actually there are three different species that are tabulated in this data set. So in order to get the species name we can use this:
iris[!, "Species"]
Which just pulls out the species column. But we do not need 50 setosa’s followed by 50 some other one, we just want to view unique species names. To get unique values from array, we could use Set()
which pulls out the unique values. That’s what we do below:
species = Set(iris[!, "Species"])
Set{CategoricalArrays.CategoricalValue{String,UInt8}} with 3 elements:
CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"
CategoricalArrays.CategoricalValue{String,UInt8} "setosa"
CategoricalArrays.CategoricalValue{String,UInt8} "virginica"
In the code below, I thought of printing just the species name without all those clunky CategoricalArrays.CategoricalValue{String,UInt8}
stuff, but I failed.
species_names = []
for x in species
push!(species_names, x)
end
species_names
3-element Array{Any,1}:
CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"
CategoricalArrays.CategoricalValue{String,UInt8} "setosa"
CategoricalArrays.CategoricalValue{String,UInt8} "virginica"
Once again my failed attempt to pretty print just a species name without the clunky CategoricalArrays.CategoricalValue{String,UInt8}
species_names[1]
CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"
It would be great if we can color our dots depending on the species. That might tell us something. So we need to split the dataset depending on the species. For that we can use the package DataFramesMeta
. We install it as shown below:
using Pkg
Pkg.add("DataFramesMeta")
[32m[1m Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Project.toml`
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`
We use it as shown below:
using DataFramesMeta
One of the species name is versicolor. The DataFramesMeta
package gives us the ability to filter a DataFrame to our need, so let’s separate just the versicolor data.
iris_versicolor = @where(iris, :Species .== "versicolor")
first(iris_versicolor, 5)
In the code above, we use a function called @where()
, well is it a function? I don’t know, it starts with an @
. As a first argument we send in the data frame that needs filtering, so the code now becomes @where(iris)
. As a second argument we send in the condition. Now look at the condition:
:Species .== "versicolor"
For now just think that :Species
means iris[!, "Species"]
, the colon :
before the Species
means that :Species
is a Symbol. More on Symbols in later blogs may be.
Now look at this one .==
, let’s fire up Julia REPL and do this:
julia> ["setosa", "versicolor", "virginica", "setosa"] .== "versicolor"
4-element BitVector:
0
1
0
0
As you see above, when you do an array comparison with a .==
, it compares each and every element of an array. So what matches becomes 1 and what doesn’t becomes 0. This is used by @where()
to filter out the rows whose Species
is only versicolor, and we get a result, we store it in a variable named iris_versicolor
and print its first five rows using first(iris_versicolor, 5)
.
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
2 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
3 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
4 | 5.5 | 2.3 | 4.0 | 1.3 | versicolor |
5 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
In a similar fashion, we filter out setosa:
iris_setosa = @where(iris, :Species .== "setosa")
first(iris_setosa, 5)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
and virginica:
iris_virginica = @where(iris, :Species .== "virginica")
first(iris_virginica, 5)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
2 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
3 | 7.1 | 3.0 | 5.9 | 2.1 | virginica |
4 | 6.3 | 2.9 | 5.6 | 1.8 | virginica |
5 | 6.5 | 3.0 | 5.8 | 2.2 | virginica |
Now let’s plot versicolor PetalLength
vs SepalLength
in red dots:
petal_lengths = iris_versicolor[!, "PetalLength"]
sepal_lengths = iris_versicolor[!, "SepalLength"]
petal_length_sepal_length_plot = scatter(
petal_lengths,
sepal_lengths,
color= "red",
label = "versicolor"
)
Now we do for setosa in blue dots:
petal_lengths = iris_setosa[!, "PetalLength"]
sepal_lengths = iris_setosa[!, "SepalLength"]
scatter!(
petal_length_sepal_length_plot,
petal_lengths, sepal_lengths,
color = "blue",
label = "setosa"
)
and virginica in green:
petal_lengths = iris_virginica[!, "PetalLength"]
sepal_lengths = iris_virginica[!, "SepalLength"]
petal_length_sepal_length_plot = scatter!(
petal_length_sepal_length_plot,
petal_lengths,
sepal_lengths,
color= "green",
label = "virginica"
)
Now let’s give the plot a title and label its axis and put legend in top left so that it does not obstruct our plot:
scatter!(
petal_length_sepal_length_plot,
title = "Iris Data Set",
xlabel = "Petal Length (cm)",
ylabel = "Sepal Length (cm)",
legend = :topleft
)
One could get Jupyter notebook for this blog here.