K Nearest Neighbors in action

In our last blog K Nearest Neighbors (KNN), we saw the theory behind that algorithm and how a computer classifies things. Here in this blog we are going to code for it. To do so, for the sake of learning, we are going to generate our own data.

Scenario

Let’s assume that we wanted to build a system where given size of a orange the system should determine if the orange should be sent to luxury hotels, or if it needs to be sent to retail as edible fruit or if it needs to sent for juice factories.

Let’s assume you are in the orange sorting facility and are measuring fruit sizes in bins that are labeled one of these. Since this is not real now, we have written a function to simulate it, where given orange size it will label it.

function category(orange_size)
    if orange_size > 6
        return "Luxury Hotels"
    end

    if orange_size > 4
        return "Edible Fruit"
    end

    "Juice"
end

Output:

category (generic function with 1 method)

Now let’s test this function, what is the output when the diameter of the orange is very small, say 1.5 inches:

category(1.5)

Output:

"Juice"

Its Juice, now what an orange with diameter of 4.5 goes for?

category(4.5)

Output:

"Edible Fruit"

It goes as an edible fruit in retail. Now let’s check for an orange that has diameter 6.2 inches.

category(6.2)

Output:

"Luxury Hotels"

It goes to luxury hotels.

Generating fake data

Now since are not in a real orange sorting facility, we will write a function to fake our measurements. Below is a function named fake_data_generator, which generates fake data for us.

function fake_data_generator(number_of_points = 50)
    output = []

    for i in 1:number_of_points
        orange_size = rand(2:0.1:8)
        push!(output, (orange_size, category(orange_size)))
    end

    output
end

Output:

fake_data_generator (generic function with 2 methods)

I won’t be going into the details of this function, as our primary motive now is to learn about K Nearest Neighbors than to learn to generate fake data.

Now let’s generate fake data:

orange_sizes = fake_data_generator(50)

Output:

50-element Vector{Any}:
 (3.4, "Juice")
 (7.9, "Luxury Hotels")
 (5.7, "Edible Fruit")
 (7.4, "Luxury Hotels")
 (7.3, "Luxury Hotels")
 (5.9, "Edible Fruit")
 (5.4, "Edible Fruit")
 (6.5, "Luxury Hotels")
 (8.0, "Luxury Hotels")
 (5.3, "Edible Fruit")
 (2.6, "Juice")
 (4.3, "Edible Fruit")
 (3.0, "Juice")
 ⋮
 (2.8, "Juice")
 (6.9, "Luxury Hotels")
 (7.3, "Luxury Hotels")
 (2.1, "Juice")
 (4.8, "Edible Fruit")
 (5.0, "Edible Fruit")
 (7.5, "Luxury Hotels")
 (5.6, "Edible Fruit")
 (5.1, "Edible Fruit")
 (5.9, "Edible Fruit")
 (7.6, "Luxury Hotels")
 (2.8, "Juice")

Basic functions

To make KNN work, we need to write some basic functions. First let’s write a function to count number occourances of things in an array.

function counter(array_of_elements)
    counts = Dict()

    for element in array_of_elements
        counts[element] = get(counts, element, 0) + 1
    end

    counts
end

Output:

counter (generic function with 1 method)

In the above function we pass an array array_of_elements to it. We have a dictionary called counts:

function counter(array_of_elements)
    counts = Dict()
end

Next, for every element in array_of_elements:

function counter(array_of_elements)
    counts = Dict()

    for element in array_of_elements
    end
end

We get the counts of the element using get(counts, element), or set it to zero if not found get(counts, element, 0) and add one to increment it get(counts, element, 0) + 1 as shown in the line counts[element] = get(counts, element, 0) + 1 below:

function counter(array_of_elements)
    counts = Dict()

    for element in array_of_elements
        counts[element] = get(counts, element, 0) + 1
    end
end

Finally we return the counts dictionary:

function counter(array_of_elements)
    counts = Dict()

    for element in array_of_elements
        counts[element] = get(counts, element, 0) + 1
    end

    counts
end

Let’s test our function:

counts = counter(["Juice", "Edible Fruit", "Juice", "Luxury Hotels"])

Output:

Dict{Any, Any} with 3 entries:
  "Edible Fruit"  => 1
  "Juice"         => 2
  "Luxury Hotels" => 1

Seems to work.

Next we should see which count is the highest. For that we write a function highest_vote as shown below:

function highest_vote(counts)
    max_val = 0
    max_vote = ""

    for (key, value) in counts
        if value > max_val
            max_val = value
            max_vote = key
        end
    end

    max_vote
end

Output:

highest_vote (generic function with 1 method)

We code highest_vote as follows. First we have max_val which records maximum vote counts, we set it to zero and we have a variable max_vote which records which label has the max_val, which we set it to an empty string as shown below:

function highest_vote(counts)
    max_val = 0
    max_vote = ""
end

Next for each and every key value pair in counts, which should be a Dict:

function highest_vote(counts)
    max_val = 0
    max_vote = ""

    for (key, value) in counts
    end
end

When the value exceeds max_val, we note the value into max_val and note the key in max_vote as shown below:

function highest_vote(counts)
    max_val = 0
    max_vote = ""

    for (key, value) in counts
        if value > max_val
            max_val = value
            max_vote = key
        end
    end
end

So max_vote at any point contains the key that has has the largest value that was iterated till that point in time. All we need to do now is to return the key that contains the largest value, this key is stored in max_vote so we return it:

function highest_vote(counts)
    max_val = 0
    max_vote = ""

    for (key, value) in counts
        if value > max_val
            max_val = value
            max_vote = key
        end
    end

    max_vote
end

Now let’s test this function with counts which contains the following dictionary:

{
  "Edible Fruit"  => 1
  "Juice"         => 2
  "Luxury Hotels" => 1
}

highest_vote(counts)

Output:

"Juice"

Seems to work!

Plotting our sample data

Let’s plot our generated data, for that we create a Dict named key_values, see the code below, and for split the sizes according to what we call is sell. Sell is nothing but the name of the label which defines where the orange of a particular size must be sold.

key_values = Dict(
    "Luxury Hotels" => [],
    "Juice" => [],
    "Edible Fruit" => []
)

for (size, sell) in orange_sizes
    push!(key_values[sell], size)
end

key_values

Output:

Dict{String, Vector{Any}} with 3 entries:
  "Edible Fruit"  => [5.7, 5.9, 5.4, 5.3, 4.3, 4.5, 4.8, 5.9, 5.8, 5.5, 4.3, 5.…
  "Juice"         => [3.4, 2.6, 3.0, 4.0, 3.8, 2.8, 3.3, 3.9, 2.9, 3.8, 2.2, 2.…
  "Luxury Hotels" => [7.9, 7.4, 7.3, 6.5, 8.0, 7.1, 7.0, 6.6, 6.3, 7.7, 6.9, 7.…

If you see the above output, we have successfully split our orange sizes into three bins.

Let’s plot it. We will plot the Juice sizes in blue, Edible Fruit sizes in red, and fruits that needs to be sent to Luxury hotels in orange as shown below:

using Plots

label = "Juice"
y = key_values[label]
x = fill(5, length(y))
p = scatter!(x, y, xlims=(4, 6), ylims=(0, 10), label = label,
    title = "Orange size and sale", ylabel = "Size in cms",
    color = "blue"
)

label = "Edible Fruit"
y = key_values[label]
x = fill(5, length(y))
scatter!(p, x, y, label = label, color = "red")

label = "Luxury Hotels"
y = key_values[label]
x = fill(5, length(y))
scatter!(p, x, y, label = label, color = "Orange")

Output:

svg

K Nearest Neighbours

Now is the time to implement K Nearest Neighbors. for that we will write a error or distance function as shown:

Δ(a, b) = abs(a - b)

Output:

Δ (generic function with 1 method)

Let’s test it out:

Δ(3, 10)

Output:

Let’s say we have a orange of diameter 5 inches, and we want to determine whaere it needs to be sold or into what bin it should be classified.

orange_size = 5

Output:

We define a variable errors_and_sells, that is a collection of distance between orange_size to every orange size we have cataloged, along with the sell label, weather it should goto Luxury hotels or so on. So we want to collect it like array of tuples like this:

[
  (error_1, label_1), (error_2, label_2), ....., (error_m, label_m),
]

And that’s what the following code does, see how we split orange_sizes to size and sell, and push the error / distance from orange_size to size, along with the sell into errors_and_sells using the following code:

errors_and_sells = []

for (size, sell) in orange_sizes
    push!(errors_and_sells, (Δ(orange_size, size), sell))
end

Le’s inspect the errors_and_sells:

errors_and_sells

Output:

50-element Vector{Any}:
 (1.6, "Juice")
 (2.9000000000000004, "Luxury Hotels")
 (0.7000000000000002, "Edible Fruit")
 (2.4000000000000004, "Luxury Hotels")
 (2.3, "Luxury Hotels")
 (0.9000000000000004, "Edible Fruit")
 (0.40000000000000036, "Edible Fruit")
 (1.5, "Luxury Hotels")
 (3.0, "Luxury Hotels")
 (0.2999999999999998, "Edible Fruit")
 (2.4, "Juice")
 (0.7000000000000002, "Edible Fruit")
 (2.0, "Juice")
 ⋮
 (2.2, "Juice")
 (1.9000000000000004, "Luxury Hotels")
 (2.3, "Luxury Hotels")
 (2.9, "Juice")
 (0.20000000000000018, "Edible Fruit")
 (0.0, "Edible Fruit")
 (2.5, "Luxury Hotels")
 (0.5999999999999996, "Edible Fruit")
 (0.09999999999999964, "Edible Fruit")
 (0.9000000000000004, "Edible Fruit")
 (2.5999999999999996, "Luxury Hotels")
 (2.2, "Juice")

Now let’s define out K value, we take it to be 20 here, we sort errors_and_sells and take first K values as shown below:

k = 20
nearest_errors_and_sells = sort(errors_and_sells)[1:k]

Output:

20-element Vector{Any}:
 (0.0, "Edible Fruit")
 (0.09999999999999964, "Edible Fruit")
 (0.20000000000000018, "Edible Fruit")
 (0.20000000000000018, "Edible Fruit")
 (0.20000000000000018, "Edible Fruit")
 (0.2999999999999998, "Edible Fruit")
 (0.40000000000000036, "Edible Fruit")
 (0.40000000000000036, "Edible Fruit")
 (0.5, "Edible Fruit")
 (0.5, "Edible Fruit")
 (0.5999999999999996, "Edible Fruit")
 (0.7000000000000002, "Edible Fruit")
 (0.7000000000000002, "Edible Fruit")
 (0.7000000000000002, "Edible Fruit")
 (0.7999999999999998, "Edible Fruit")
 (0.9000000000000004, "Edible Fruit")
 (0.9000000000000004, "Edible Fruit")
 (0.9000000000000004, "Edible Fruit")
 (0.9000000000000004, "Edible Fruit")
 (1.0, "Juice")

Note that its mostly occupied by the label Edible Fruit, as a human we know the answer, for the computer to pick it we must count the sell’s or it’s labels. So we collect all the labels in a variable named nearest_sells:

nearest_sells = [sell for (_error, sell) in nearest_errors_and_sells]

Output:

20-element Vector{String}:
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Edible Fruit"
 "Juice"

We count it:

counted_nearest = counter(nearest_sells)

Output:

Dict{Any, Any} with 2 entries:
  "Edible Fruit" => 19
  "Juice"        => 1

And we take the highest value:

highest_vote(counted_nearest)

Output:

"Edible Fruit"

Play along with different orange_size, change the value of K and see what happens. Make K to 50, then does the algorithm work right? If yes why and if no why?

Get the Jupyter notebook for this blog here https://gitlab.com/data-science-with-julia/code/-/blob/master/knn.ipynb