The Iris flower

I happened to go through Iris Data set with Julia and I thought why not blog here? There is a package called RDatasets in Jilia which provides nicely cleaned datasets for machine learning people to start to learn. You can install it and start using it with Jupyter as shown:

using Pkg
Pkg.add("RDatasets")
using RDatasets
   Updating registry at `~/.julia/registries/General`
######################################################################### 100.0%
  Resolving package versions...
No Changes to `~/.julia/environments/v1.5/Project.toml`
No Changes to `~/.julia/environments/v1.5/Manifest.toml`

Once done, lets look at the Data sets:

RDatasets.datasets()

763 rows × 5 columns

PackageDatasetTitleRowsColumns
StringStringStringInt64Int64
1COUNTaffairsaffairs60118
2COUNTazdrg112azdrg11217984
3COUNTazproazpro35896
4COUNTbadhealthbadhealth11273
5COUNTfasttrakgfasttrakg159
6COUNTlbwlbw18910
7COUNTlbwgrplbwgrp67
8COUNTloomisloomis41011
9COUNTmdvismdvis222713
10COUNTmedparmedpar149510
11COUNTrwmrwm273264
12COUNTrwm5yrrwm5yr1960917
13COUNTshipsships407
14COUNTtitanictitanic13164
15COUNTtitanicgrptitanicgrp125
16EcdatAccidentShip Accidents405
17EcdatAirlineCost for U.S. Airlines906
18EcdatAirqAir Quality for Californian Metropolitan Areas306
19EcdatBenefitsUnemployement of Blue Collar Workers487718
20EcdatBidsBids Received By U.S. Firms12612
21EcdatBudgetFoodBudget Share of Food for Spanish Households239726
22EcdatBudgetItalyBudget Shares for Italian Households172911
23EcdatBudgetUKBudget Shares of British Households151910
24EcdatBwagesWages in Belgium14724
25EcdatCPSch3Earnings from the Current Population Survey111303
26EcdatCapmStock Market Data5165
27EcdatCarStated Preferences for Car Choice465470
28EcdatCaschoolThe California Test Score Data Set42017
29EcdatCatsupChoice of Brand for Catsup279814
30EcdatCigarCigarette Consumption13809

It’s a lot of data, so Jupyter Truncates the output. Now lets load it into a variable called iris using the following command:

iris = dataset("datasets", "iris")

As you can see it displays the first 30 rows:

150 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa

As a data scientist, or an aspiring one, you must know what this data means, you must have a clear idea what an Iris flower is, what the columns means, if you do not have an idea, then try reading about it here on Wikipedia. There were some guys patient enough to collect different species of iris flower, measure it’s petal length, sepal length, petal width and sepal width, catalog it, so that who knows what they thought, now its benefiting machine learning learners enormously.

Now I am not sure if I did install Julia’s DataFrames package, but iris is a DataFrame. DataFrame is a library created in Julia for easy visualization and manipulation of tabular data. If the following command gives you error, be sure to install DataFrames.jl.

Now let’s take a look at column names of iris.

names(iris)

So all columns shown below, except for Species is a measure of length and width of a flower part. The last column is the name of the species.

5-element Array{String,1}:
 "SepalLength"
 "SepalWidth"
 "PetalLength"
 "PetalWidth"
 "Species"

If you are not sure what petal and sepal is, I hope this image would help:

Parts of a flower

Now let’s take a look at size of the iris dataset:

size(iris)

It contains 150 rows and 5 columns as shown below:

(150, 5)

Now let’s apply a function called describe() on iris, it gives some statistical values on irises columns

describe(iris)

as shown below:

5 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1SepalLength5.843334.35.87.90Float64
2SepalWidth3.057332.03.04.40Float64
3PetalLength3.7581.04.356.90Float64
4PetalWidth1.199330.11.32.50Float64
5Speciessetosavirginica0CategoricalValue{String,UInt8}

I hope one knows the meaning of statistical terms \(mean\), \(median\), \(min\) and \(max\). Ideally this blog series should explain what they are, may be in the future, for now please see other sources like Wikipedia for help.

If we want more of description of the iris data set, we need to pass a symbol :all. Once again this blog series should say what a symbol is in Julia. But for now passing :all will yield more statistical columns

describe(iris, :all)

as shown below:

5 rows × 13 columns (omitted printing of 4 columns)

variablemeanstdminq25medianq75maxnunique
SymbolUnion…Union…AnyUnion…Union…Union…AnyUnion…
1SepalLength5.843330.8280664.35.15.86.47.9
2SepalWidth3.057330.4358662.02.83.03.34.4
3PetalLength3.7581.76531.01.64.355.16.9
4PetalWidth1.199330.7622380.10.31.31.82.5
5Speciessetosavirginica3

If you are wondering what the \(q25\) and \(q75\) are, refer to interquartile range. May be this blog series will one day explain it.

We can take a look at the first 5 elements of the data frame by using function called first() as shown below:

first(iris, 5)

5 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa

Similarly the last few rows can be viewed using function named last(). Below we view last five rows:

last(iris, 5)

5 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
16.73.05.22.3virginica
26.32.55.01.9virginica
36.53.05.22.0virginica
46.23.45.42.3virginica
55.93.05.11.8virginica

Now here is a way how to extract just petal lengths from iris.

iris[!, "PetalLength"]
150-element Array{Float64,1}:
 1.4
 1.4
 1.3
 1.5
 1.4
 1.7
 1.4
 1.5
 1.4
 1.5
 1.5
 1.6
 1.4
 ⋮
 4.8
 5.4
 5.6
 5.1
 5.1
 5.9
 5.7
 5.2
 5.0
 5.2
 5.4
 5.1

So let’s go on plotting and see if there is a way to separate species, for it we will just use 2 parameters, the PetalLength and SepalLength. I install Plots and started using it with these commands:

Pkg.add("Plots")
using Plots
  Resolving package versions...
No Changes to `~/.julia/environments/v1.5/Project.toml`
No Changes to `~/.julia/environments/v1.5/Manifest.toml`

Now all I do is gather SepalLength in a variable sepal_length, PetalLength in petal_length and Species in species. Why on earth did I gather Species?! Anyway. I user the scatter() function to plot it as shwn below:

sepal_length = iris[!, "SepalLength"]
petal_length = iris[!, "PetalLength"]
species = iris[!, "Species"]
scatter(petal_length, sepal_length, color = "red")

svg

While you were viewing the Data Frame of iris, you must have seen just only 1 species in your Jupyter notebook. It tends to show only first 30 rows or something. Actually there are three different species that are tabulated in this data set. So in order to get the species name we can use this:

iris[!, "Species"]

Which just pulls out the species column. But we do not need 50 setosa’s followed by 50 some other one, we just want to view unique species names. To get unique values from array, we could use Set() which pulls out the unique values. That’s what we do below:

species = Set(iris[!, "Species"])
Set{CategoricalArrays.CategoricalValue{String,UInt8}} with 3 elements:
  CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"
  CategoricalArrays.CategoricalValue{String,UInt8} "setosa"
  CategoricalArrays.CategoricalValue{String,UInt8} "virginica"

In the code below, I thought of printing just the species name without all those clunky CategoricalArrays.CategoricalValue{String,UInt8} stuff, but I failed.

species_names = []

for x in species
    push!(species_names, x)
end

species_names
3-element Array{Any,1}:
 CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"
 CategoricalArrays.CategoricalValue{String,UInt8} "setosa"
 CategoricalArrays.CategoricalValue{String,UInt8} "virginica"

Once again my failed attempt to pretty print just a species name without the clunky CategoricalArrays.CategoricalValue{String,UInt8}

species_names[1]
CategoricalArrays.CategoricalValue{String,UInt8} "versicolor"

It would be great if we can color our dots depending on the species. That might tell us something. So we need to split the dataset depending on the species. For that we can use the package DataFramesMeta. We install it as shown below:

using Pkg
Pkg.add("DataFramesMeta")
  Resolving package versions...
No Changes to `~/.julia/environments/v1.5/Project.toml`
No Changes to `~/.julia/environments/v1.5/Manifest.toml`

We use it as shown below:

using DataFramesMeta

One of the species name is versicolor. The DataFramesMeta package gives us the ability to filter a DataFrame to our need, so let’s separate just the versicolor data.

iris_versicolor = @where(iris, :Species .== "versicolor")
first(iris_versicolor, 5)

In the code above, we use a function called @where(), well is it a function? I don’t know, it starts with an @. As a first argument we send in the data frame that needs filtering, so the code now becomes @where(iris). As a second argument we send in the condition. Now look at the condition:

:Species .== "versicolor"

For now just think that :Species means iris[!, "Species"], the colon : before the Species means that :Species is a Symbol. More on Symbols in later blogs may be.

Now look at this one .==, let’s fire up Julia REPL and do this:

julia> ["setosa", "versicolor", "virginica", "setosa"] .== "versicolor"
4-element BitVector:
 0
 1
 0
 0

As you see above, when you do an array comparison with a .==, it compares each and every element of an array. So what matches becomes 1 and what doesn’t becomes 0. This is used by @where() to filter out the rows whose Species is only versicolor, and we get a result, we store it in a variable named iris_versicolor and print its first five rows using first(iris_versicolor, 5).

5 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
17.03.24.71.4versicolor
26.43.24.51.5versicolor
36.93.14.91.5versicolor
45.52.34.01.3versicolor
56.52.84.61.5versicolor

In a similar fashion, we filter out setosa:

iris_setosa = @where(iris, :Species .== "setosa")
first(iris_setosa, 5)

5 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa

and virginica:

iris_virginica = @where(iris, :Species .== "virginica")
first(iris_virginica, 5)

5 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
16.33.36.02.5virginica
25.82.75.11.9virginica
37.13.05.92.1virginica
46.32.95.61.8virginica
56.53.05.82.2virginica

Now let’s plot versicolor PetalLength vs SepalLength in red dots:

petal_lengths = iris_versicolor[!, "PetalLength"]
sepal_lengths = iris_versicolor[!, "SepalLength"]

petal_length_sepal_length_plot = scatter(
    petal_lengths,
    sepal_lengths,
    color= "red",
    label = "versicolor"
)

svg

Now we do for setosa in blue dots:

petal_lengths = iris_setosa[!, "PetalLength"]
sepal_lengths = iris_setosa[!, "SepalLength"]

scatter!(
    petal_length_sepal_length_plot,
    petal_lengths, sepal_lengths,
    color = "blue",
    label = "setosa"
)

svg

and virginica in green:

petal_lengths = iris_virginica[!, "PetalLength"]
sepal_lengths = iris_virginica[!, "SepalLength"]

petal_length_sepal_length_plot = scatter!(
    petal_length_sepal_length_plot,
    petal_lengths,
    sepal_lengths,
    color= "green",
    label = "virginica"
)

svg

Now let’s give the plot a title and label its axis and put legend in top left so that it does not obstruct our plot:

scatter!(
    petal_length_sepal_length_plot,
    title = "Iris Data Set",
    xlabel = "Petal Length (cm)",
    ylabel = "Sepal Length (cm)",
    legend = :topleft
)

svg

One could get Jupyter notebook for this blog here.