K Nearest Neighbors in action
In our last blog K Nearest Neighbors (KNN), we saw the theory behind that algorithm and how a computer classifies things. Here in this blog we are going to code for it. To do so, for the sake of learning, we are going to generate our own data.
Scenario
Let’s assume that we wanted to build a system where given size of a orange the system should determine if the orange should be sent to luxury hotels, or if it needs to be sent to retail as edible fruit or if it needs to sent for juice factories.
Let’s assume you are in the orange sorting facility and are measuring fruit sizes in bins that are labeled one of these. Since this is not real now, we have written a function to simulate it, where given orange size it will label it.
function category(orange_size)
if orange_size > 6
return "Luxury Hotels"
end
if orange_size > 4
return "Edible Fruit"
end
"Juice"
end
Output:
category (generic function with 1 method)
Now let’s test this function, what is the output when the diameter of the orange is very small, say 1.5 inches:
category(1.5)
Output:
"Juice"
Its Juice, now what an orange with diameter of 4.5 goes for?
category(4.5)
Output:
"Edible Fruit"
It goes as an edible fruit in retail. Now let’s check for an orange that has diameter 6.2 inches.
category(6.2)
Output:
"Luxury Hotels"
It goes to luxury hotels.
Generating fake data
Now since are not in a real orange sorting facility, we will write a function to fake our measurements. Below is a function named fake_data_generator
, which generates fake data for us.
function fake_data_generator(number_of_points = 50)
output = []
for i in 1:number_of_points
orange_size = rand(2:0.1:8)
push!(output, (orange_size, category(orange_size)))
end
output
end
Output:
fake_data_generator (generic function with 2 methods)
I won’t be going into the details of this function, as our primary motive now is to learn about K Nearest Neighbors than to learn to generate fake data.
Now let’s generate fake data:
orange_sizes = fake_data_generator(50)
Output:
50-element Vector{Any}:
(3.4, "Juice")
(7.9, "Luxury Hotels")
(5.7, "Edible Fruit")
(7.4, "Luxury Hotels")
(7.3, "Luxury Hotels")
(5.9, "Edible Fruit")
(5.4, "Edible Fruit")
(6.5, "Luxury Hotels")
(8.0, "Luxury Hotels")
(5.3, "Edible Fruit")
(2.6, "Juice")
(4.3, "Edible Fruit")
(3.0, "Juice")
⋮
(2.8, "Juice")
(6.9, "Luxury Hotels")
(7.3, "Luxury Hotels")
(2.1, "Juice")
(4.8, "Edible Fruit")
(5.0, "Edible Fruit")
(7.5, "Luxury Hotels")
(5.6, "Edible Fruit")
(5.1, "Edible Fruit")
(5.9, "Edible Fruit")
(7.6, "Luxury Hotels")
(2.8, "Juice")
Basic functions
To make KNN work, we need to write some basic functions. First let’s write a function to count number occourances of things in an array.
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
counts[element] = get(counts, element, 0) + 1
end
counts
end
Output:
counter (generic function with 1 method)
In the above function we pass an array array_of_elements
to it. We have a dictionary called counts:
function counter(array_of_elements)
counts = Dict()
end
Next, for every element
in array_of_elements
:
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
end
end
We get the counts of the element
using get(counts, element)
, or set it to zero if not found get(counts, element, 0)
and add one to increment it get(counts, element, 0) + 1
as shown in the line counts[element] = get(counts, element, 0) + 1
below:
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
counts[element] = get(counts, element, 0) + 1
end
end
Finally we return the counts
dictionary:
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
counts[element] = get(counts, element, 0) + 1
end
counts
end
Let’s test our function:
counts = counter(["Juice", "Edible Fruit", "Juice", "Luxury Hotels"])
Output:
Dict{Any, Any} with 3 entries:
"Edible Fruit" => 1
"Juice" => 2
"Luxury Hotels" => 1
Seems to work.
Next we should see which count is the highest. For that we write a function highest_vote
as shown below:
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
if value > max_val
max_val = value
max_vote = key
end
end
max_vote
end
Output:
highest_vote (generic function with 1 method)
We code highest_vote
as follows. First we have max_val
which records maximum vote counts, we set it to zero and we have a variable max_vote
which records which label has the max_val
, which we set it to an empty string as shown below:
function highest_vote(counts)
max_val = 0
max_vote = ""
end
Next for each and every key value pair in counts
, which should be a Dict
:
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
end
end
When the value
exceeds max_val
, we note the value
into max_val
and note the key
in max_vote
as shown below:
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
if value > max_val
max_val = value
max_vote = key
end
end
end
So max_vote
at any point contains the key
that has has the largest value
that was iterated till that point in time. All we need to do now is to return the key
that contains the largest value
, this key is stored in max_vote
so we return it:
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
if value > max_val
max_val = value
max_vote = key
end
end
max_vote
end
Now let’s test this function with counts
which contains the following dictionary:
{
"Edible Fruit" => 1
"Juice" => 2
"Luxury Hotels" => 1
}
highest_vote(counts)
Output:
"Juice"
Seems to work!
Plotting our sample data
Let’s plot our generated data, for that we create a Dict
named key_values
, see the code below, and for split the sizes according to what we call is sell
. Sell is nothing but the name of the label which defines where the orange of a particular size must be sold.
key_values = Dict(
"Luxury Hotels" => [],
"Juice" => [],
"Edible Fruit" => []
)
for (size, sell) in orange_sizes
push!(key_values[sell], size)
end
key_values
Output:
Dict{String, Vector{Any}} with 3 entries:
"Edible Fruit" => [5.7, 5.9, 5.4, 5.3, 4.3, 4.5, 4.8, 5.9, 5.8, 5.5, 4.3, 5.…
"Juice" => [3.4, 2.6, 3.0, 4.0, 3.8, 2.8, 3.3, 3.9, 2.9, 3.8, 2.2, 2.…
"Luxury Hotels" => [7.9, 7.4, 7.3, 6.5, 8.0, 7.1, 7.0, 6.6, 6.3, 7.7, 6.9, 7.…
If you see the above output, we have successfully split our orange sizes into three bins.
Let’s plot it. We will plot the Juice sizes in blue, Edible Fruit sizes in red, and fruits that needs to be sent to Luxury hotels in orange as shown below:
using Plots
label = "Juice"
y = key_values[label]
x = fill(5, length(y))
p = scatter!(x, y, xlims=(4, 6), ylims=(0, 10), label = label,
title = "Orange size and sale", ylabel = "Size in cms",
color = "blue"
)
label = "Edible Fruit"
y = key_values[label]
x = fill(5, length(y))
scatter!(p, x, y, label = label, color = "red")
label = "Luxury Hotels"
y = key_values[label]
x = fill(5, length(y))
scatter!(p, x, y, label = label, color = "Orange")
Output:
K Nearest Neighbours
Now is the time to implement K Nearest Neighbors. for that we will write a error or distance function as shown:
Δ(a, b) = abs(a - b)
Output:
Δ (generic function with 1 method)
Let’s test it out:
Δ(3, 10)
Output:
7
Let’s say we have a orange of diameter 5 inches, and we want to determine whaere it needs to be sold or into what bin it should be classified.
orange_size = 5
Output:
5
We define a variable errors_and_sells
, that is a collection of distance between orange_size
to every orange size we have cataloged, along with the sell label, weather it should goto Luxury hotels or so on. So we want to collect it like array of tuples like this:
[
(error_1, label_1), (error_2, label_2), ....., (error_m, label_m),
]
And that’s what the following code does, see how we split orange_sizes
to size
and sell
, and push the error / distance from orange_size
to size
, along with the sell
into errors_and_sells
using the following code:
errors_and_sells = []
for (size, sell) in orange_sizes
push!(errors_and_sells, (Δ(orange_size, size), sell))
end
Le’s inspect the errors_and_sells
:
errors_and_sells
Output:
50-element Vector{Any}:
(1.6, "Juice")
(2.9000000000000004, "Luxury Hotels")
(0.7000000000000002, "Edible Fruit")
(2.4000000000000004, "Luxury Hotels")
(2.3, "Luxury Hotels")
(0.9000000000000004, "Edible Fruit")
(0.40000000000000036, "Edible Fruit")
(1.5, "Luxury Hotels")
(3.0, "Luxury Hotels")
(0.2999999999999998, "Edible Fruit")
(2.4, "Juice")
(0.7000000000000002, "Edible Fruit")
(2.0, "Juice")
⋮
(2.2, "Juice")
(1.9000000000000004, "Luxury Hotels")
(2.3, "Luxury Hotels")
(2.9, "Juice")
(0.20000000000000018, "Edible Fruit")
(0.0, "Edible Fruit")
(2.5, "Luxury Hotels")
(0.5999999999999996, "Edible Fruit")
(0.09999999999999964, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(2.5999999999999996, "Luxury Hotels")
(2.2, "Juice")
Now let’s define out K value, we take it to be 20 here, we sort errors_and_sells
and take first K values as shown below:
k = 20
nearest_errors_and_sells = sort(errors_and_sells)[1:k]
Output:
20-element Vector{Any}:
(0.0, "Edible Fruit")
(0.09999999999999964, "Edible Fruit")
(0.20000000000000018, "Edible Fruit")
(0.20000000000000018, "Edible Fruit")
(0.20000000000000018, "Edible Fruit")
(0.2999999999999998, "Edible Fruit")
(0.40000000000000036, "Edible Fruit")
(0.40000000000000036, "Edible Fruit")
(0.5, "Edible Fruit")
(0.5, "Edible Fruit")
(0.5999999999999996, "Edible Fruit")
(0.7000000000000002, "Edible Fruit")
(0.7000000000000002, "Edible Fruit")
(0.7000000000000002, "Edible Fruit")
(0.7999999999999998, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(1.0, "Juice")
Note that its mostly occupied by the label Edible Fruit, as a human we know the answer, for the computer to pick it we must count the sell’s or it’s labels. So we collect all the labels in a variable named nearest_sells
:
nearest_sells = [sell for (_error, sell) in nearest_errors_and_sells]
Output:
20-element Vector{String}:
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Juice"
We count it:
counted_nearest = counter(nearest_sells)
Output:
Dict{Any, Any} with 2 entries:
"Edible Fruit" => 19
"Juice" => 1
And we take the highest value:
highest_vote(counted_nearest)
Output:
"Edible Fruit"
Play along with different orange_size
, change the value of K and see what happens. Make K to 50, then does the algorithm work right? If yes why and if no why?
Get the Jupyter notebook for this blog here https://gitlab.com/data-science-with-julia/code/-/blob/master/knn.ipynb