# K Nearest Neighbors in action

In our last blog K Nearest Neighbors (KNN), we saw the theory behind that algorithm and how a computer classifies things. Here in this blog we are going to code for it. To do so, for the sake of learning, we are going to generate our own data.

## Scenario

Let’s assume that we wanted to build a system where given size of a orange the system should determine if the orange should be sent to luxury hotels, or if it needs to be sent to retail as edible fruit or if it needs to sent for juice factories.

Let’s assume you are in the orange sorting facility and are measuring fruit sizes in bins that are labeled one of these. Since this is not real now, we have written a function to simulate it, where given orange size it will label it.

```
function category(orange_size)
if orange_size > 6
return "Luxury Hotels"
end
if orange_size > 4
return "Edible Fruit"
end
"Juice"
end
```

Output:

```
category (generic function with 1 method)
```

Now let’s test this function, what is the output when the diameter of the orange is very small, say 1.5 inches:

```
category(1.5)
```

Output:

```
"Juice"
```

Its Juice, now what an orange with diameter of 4.5 goes for?

```
category(4.5)
```

Output:

```
"Edible Fruit"
```

It goes as an edible fruit in retail. Now let’s check for an orange that has diameter 6.2 inches.

```
category(6.2)
```

Output:

```
"Luxury Hotels"
```

It goes to luxury hotels.

## Generating fake data

Now since are not in a real orange sorting facility, we will write a function to fake our measurements. Below is a function named `fake_data_generator`

, which generates fake data for us.

```
function fake_data_generator(number_of_points = 50)
output = []
for i in 1:number_of_points
orange_size = rand(2:0.1:8)
push!(output, (orange_size, category(orange_size)))
end
output
end
```

Output:

```
fake_data_generator (generic function with 2 methods)
```

I won’t be going into the details of this function, as our primary motive now is to learn about K Nearest Neighbors than to learn to generate fake data.

Now let’s generate fake data:

```
orange_sizes = fake_data_generator(50)
```

Output:

```
50-element Vector{Any}:
(3.4, "Juice")
(7.9, "Luxury Hotels")
(5.7, "Edible Fruit")
(7.4, "Luxury Hotels")
(7.3, "Luxury Hotels")
(5.9, "Edible Fruit")
(5.4, "Edible Fruit")
(6.5, "Luxury Hotels")
(8.0, "Luxury Hotels")
(5.3, "Edible Fruit")
(2.6, "Juice")
(4.3, "Edible Fruit")
(3.0, "Juice")
⋮
(2.8, "Juice")
(6.9, "Luxury Hotels")
(7.3, "Luxury Hotels")
(2.1, "Juice")
(4.8, "Edible Fruit")
(5.0, "Edible Fruit")
(7.5, "Luxury Hotels")
(5.6, "Edible Fruit")
(5.1, "Edible Fruit")
(5.9, "Edible Fruit")
(7.6, "Luxury Hotels")
(2.8, "Juice")
```

## Basic functions

To make KNN work, we need to write some basic functions. First let’s write a function to count number occourances of things in an array.

```
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
counts[element] = get(counts, element, 0) + 1
end
counts
end
```

Output:

```
counter (generic function with 1 method)
```

In the above function we pass an array `array_of_elements`

to it. We have a dictionary called counts:

```
function counter(array_of_elements)
counts = Dict()
end
```

Next, for every `element`

in `array_of_elements`

:

```
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
end
end
```

We get the counts of the `element`

using `get(counts, element)`

, or set it to zero if not found `get(counts, element, 0)`

and add one to increment it `get(counts, element, 0) + 1`

as shown in the line `counts[element] = get(counts, element, 0) + 1`

below:

```
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
counts[element] = get(counts, element, 0) + 1
end
end
```

Finally we return the `counts`

dictionary:

```
function counter(array_of_elements)
counts = Dict()
for element in array_of_elements
counts[element] = get(counts, element, 0) + 1
end
counts
end
```

Let’s test our function:

```
counts = counter(["Juice", "Edible Fruit", "Juice", "Luxury Hotels"])
```

Output:

```
Dict{Any, Any} with 3 entries:
"Edible Fruit" => 1
"Juice" => 2
"Luxury Hotels" => 1
```

Seems to work.

Next we should see which count is the highest. For that we write a function `highest_vote`

as shown below:

```
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
if value > max_val
max_val = value
max_vote = key
end
end
max_vote
end
```

Output:

```
highest_vote (generic function with 1 method)
```

We code `highest_vote`

as follows. First we have `max_val`

which records maximum vote counts, we set it to zero and we have a variable `max_vote`

which records which label has the `max_val`

, which we set it to an empty string as shown below:

```
function highest_vote(counts)
max_val = 0
max_vote = ""
end
```

Next for each and every key value pair in `counts`

, which should be a `Dict`

:

```
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
end
end
```

When the `value`

exceeds `max_val`

, we note the `value`

into `max_val`

and note the `key`

in `max_vote`

as shown below:

```
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
if value > max_val
max_val = value
max_vote = key
end
end
end
```

So `max_vote`

at any point contains the `key`

that has has the largest `value`

that was iterated till that point in time. All we need to do now is to return the `key`

that contains the largest `value`

, this key is stored in `max_vote`

so we return it:

```
function highest_vote(counts)
max_val = 0
max_vote = ""
for (key, value) in counts
if value > max_val
max_val = value
max_vote = key
end
end
max_vote
end
```

Now let’s test this function with `counts`

which contains the following dictionary:

```
{
"Edible Fruit" => 1
"Juice" => 2
"Luxury Hotels" => 1
}
```

```
highest_vote(counts)
```

Output:

```
"Juice"
```

Seems to work!

## Plotting our sample data

Let’s plot our generated data, for that we create a `Dict`

named `key_values`

, see the code below, and for split the sizes according to what we call is `sell`

. Sell is nothing but the name of the label which defines where the orange of a particular size must be sold.

```
key_values = Dict(
"Luxury Hotels" => [],
"Juice" => [],
"Edible Fruit" => []
)
for (size, sell) in orange_sizes
push!(key_values[sell], size)
end
key_values
```

Output:

```
Dict{String, Vector{Any}} with 3 entries:
"Edible Fruit" => [5.7, 5.9, 5.4, 5.3, 4.3, 4.5, 4.8, 5.9, 5.8, 5.5, 4.3, 5.…
"Juice" => [3.4, 2.6, 3.0, 4.0, 3.8, 2.8, 3.3, 3.9, 2.9, 3.8, 2.2, 2.…
"Luxury Hotels" => [7.9, 7.4, 7.3, 6.5, 8.0, 7.1, 7.0, 6.6, 6.3, 7.7, 6.9, 7.…
```

If you see the above output, we have successfully split our orange sizes into three bins.

Let’s plot it. We will plot the Juice sizes in blue, Edible Fruit sizes in red, and fruits that needs to be sent to Luxury hotels in orange as shown below:

```
using Plots
label = "Juice"
y = key_values[label]
x = fill(5, length(y))
p = scatter!(x, y, xlims=(4, 6), ylims=(0, 10), label = label,
title = "Orange size and sale", ylabel = "Size in cms",
color = "blue"
)
label = "Edible Fruit"
y = key_values[label]
x = fill(5, length(y))
scatter!(p, x, y, label = label, color = "red")
label = "Luxury Hotels"
y = key_values[label]
x = fill(5, length(y))
scatter!(p, x, y, label = label, color = "Orange")
```

Output:

## K Nearest Neighbours

Now is the time to implement K Nearest Neighbors. for that we will write a error or distance function as shown:

```
Δ(a, b) = abs(a - b)
```

Output:

```
Δ (generic function with 1 method)
```

Let’s test it out:

```
Δ(3, 10)
```

Output:

```
7
```

Let’s say we have a orange of diameter 5 inches, and we want to determine whaere it needs to be sold or into what bin it should be classified.

```
orange_size = 5
```

Output:

```
5
```

We define a variable `errors_and_sells`

, that is a collection of distance between `orange_size`

to every orange size we have cataloged, along with the sell label, weather it should goto Luxury hotels or so on. So we want to collect it like array of tuples like this:

```
[
(error_1, label_1), (error_2, label_2), ....., (error_m, label_m),
]
```

And that’s what the following code does, see how we split `orange_sizes`

to `size`

and `sell`

, and push the error / distance from `orange_size`

to `size`

, along with the `sell`

into `errors_and_sells`

using the following code:

```
errors_and_sells = []
for (size, sell) in orange_sizes
push!(errors_and_sells, (Δ(orange_size, size), sell))
end
```

Le’s inspect the `errors_and_sells`

:

```
errors_and_sells
```

Output:

```
50-element Vector{Any}:
(1.6, "Juice")
(2.9000000000000004, "Luxury Hotels")
(0.7000000000000002, "Edible Fruit")
(2.4000000000000004, "Luxury Hotels")
(2.3, "Luxury Hotels")
(0.9000000000000004, "Edible Fruit")
(0.40000000000000036, "Edible Fruit")
(1.5, "Luxury Hotels")
(3.0, "Luxury Hotels")
(0.2999999999999998, "Edible Fruit")
(2.4, "Juice")
(0.7000000000000002, "Edible Fruit")
(2.0, "Juice")
⋮
(2.2, "Juice")
(1.9000000000000004, "Luxury Hotels")
(2.3, "Luxury Hotels")
(2.9, "Juice")
(0.20000000000000018, "Edible Fruit")
(0.0, "Edible Fruit")
(2.5, "Luxury Hotels")
(0.5999999999999996, "Edible Fruit")
(0.09999999999999964, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(2.5999999999999996, "Luxury Hotels")
(2.2, "Juice")
```

Now let’s define out K value, we take it to be 20 here, we sort `errors_and_sells`

and take first K values as shown below:

```
k = 20
nearest_errors_and_sells = sort(errors_and_sells)[1:k]
```

Output:

```
20-element Vector{Any}:
(0.0, "Edible Fruit")
(0.09999999999999964, "Edible Fruit")
(0.20000000000000018, "Edible Fruit")
(0.20000000000000018, "Edible Fruit")
(0.20000000000000018, "Edible Fruit")
(0.2999999999999998, "Edible Fruit")
(0.40000000000000036, "Edible Fruit")
(0.40000000000000036, "Edible Fruit")
(0.5, "Edible Fruit")
(0.5, "Edible Fruit")
(0.5999999999999996, "Edible Fruit")
(0.7000000000000002, "Edible Fruit")
(0.7000000000000002, "Edible Fruit")
(0.7000000000000002, "Edible Fruit")
(0.7999999999999998, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(0.9000000000000004, "Edible Fruit")
(1.0, "Juice")
```

Note that its mostly occupied by the label Edible Fruit, as a human we know the answer, for the computer to pick it we must count the sell’s or it’s labels. So we collect all the labels in a variable named `nearest_sells`

:

```
nearest_sells = [sell for (_error, sell) in nearest_errors_and_sells]
```

Output:

```
20-element Vector{String}:
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Edible Fruit"
"Juice"
```

We count it:

```
counted_nearest = counter(nearest_sells)
```

Output:

```
Dict{Any, Any} with 2 entries:
"Edible Fruit" => 19
"Juice" => 1
```

And we take the highest value:

```
highest_vote(counted_nearest)
```

Output:

```
"Edible Fruit"
```

Play along with different `orange_size`

, change the value of K and see what happens. Make K to 50, then does the algorithm work right? If yes why and if no why?

Get the Jupyter notebook for this blog here https://gitlab.com/data-science-with-julia/code/-/blob/master/knn.ipynb