Data Drift
In this blog I want to explain what the term Data Drift is. Let me explain by using some plots. Let’s add Plots
to our Jupyter notebook.
using Plots
We usually work for 5 days a week, let’s gather those days in a variable called days
.
days = ["Mon", "Tue", "Wed", "Thu", "Fri"]
Output:
5-element Vector{String}:
"Mon"
"Tue"
"Wed"
"Thu"
"Fri"
Now let’s try to have a variable called 2020_travel_time
where we note the average travel times while we commute to work on the five working days:
2020_travel_time = [60, 45, 35, 40, 30]
Output:
syntax: "2020" is not a valid function argument name around In[3]:1
Stacktrace:
[1] top-level scope
@ In[3]:1
[2] eval
@ ./boot.jl:360 [inlined]
[3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1094
As you see it throws an error. In Julia we can’t have variables starting with a number. So let’s change the variable name to travel_time_2020
as shown below.
travel_time_2020 = [60, 45, 35, 40, 30]
Output:
5-element Vector{Int64}:
60
45
35
40
30
Let’s plot days
vs travel_time_2020
as shown below.
plot_2020 = bar(
days,
travel_time_2020,
ylabel = "Travel Time (minutes)",
xlabel= "Week Day",
title = "2020 Time Wated in Travel",
label = "Travel Time"
)
Output:
We import a package called Statistics
and find the mean of travel time.
using Statistics
average_2020 = mean(travel_time_2020)
Output:
42.0
So we travel about 42 minutes a day, let’s say its the morning commute time. Le’s for the sake of learning, let assume this data had be got from average of 1000’s of rides in a ride sharing platform where we work as data scientist. Let’s say that our ML (Machine Learning) algorithms have found that its better to ask ratings from user who’s travel time exceeds an upper limit and who’s travel time falls below an lower limit.
average_line = fill(average_2020, length(days))
upper_limit = fill((average_2020 + 10), length(days))
lower_limit = fill((average_2020 - 10), length(days))
plot!(plot_2020, days, average_line, linewidth = 5, label = "Average")
plot!(plot_2020, days, upper_limit, linewidth = 3, label = "Upper Limit")
plot!(plot_2020, days, lower_limit, linewidth = 3, label = "Lower Limit")
Output:
Because some thing might be wrong if the car takes too long to reach its destination and perhaps the driven could have driven rash if car takes too less time.
So in the graph above the upper limit is about 50 minutes of travel and the lower limit is about 30, when the travel time breaches these limits, you tell the app to ask a feed back about the cab.
Now your city’s population has increased, all offices still open up at the same time in 2021, and hence because of that we have new travel time for 2021 as shown below. Definitely its upper and lower limit has changed.
travel_time_2021 = [75, 60, 50, 45, 35]
average_2021 = mean(travel_time_2021)
average_line = fill(average_2021, length(days))
upper_limit = fill((average_2021 + 10), length(days))
lower_limit = fill((average_2021 - 10), length(days))
Output:
5-element Vector{Float64}:
43.0
43.0
43.0
43.0
43.0
Let’s plot it.
plot_2021 = bar(
days,
travel_time_2021,
ylabel = "Travel Time (minutes)",
xlabel= "Week Day",
title = "2021 Time Wated in Travel",
label = "Travel Time"
)
plot!(plot_2021, days, average_line, linewidth = 3, label = "Average")
plot!(plot_2021, days, upper_limit, linewidth = 3, label = "Upper Limit")
plot!(plot_2021, days, lower_limit, linewidth = 3, label = "Lower Limit")
Output:
As you can see from the image above the average travel time is about 50 minutes, and the upper and lower limits are about 60 and 40 minutes respectively. If you still tell your app to ask feedback when the travel time breaches the old upper and lower limit, then you will be getting wring feedback. 2020’s upper limit is about 50 minutes, and if you keep asking feed back to people who have had rides for around that amount of time, they would feel very normal and would not give a critical feedback which is essential to rate a cab. This could severely affect quality of the app you are building.
This change in data is called data drift. this means that your ML algorithm should be retrained periodically to mitigate these issues.
Data Science is not just about getting data and using it to train some algorithms, its about one understanding the data and trying to extract valuable information from it. Data drift is a thing that a data scientist should monitor regularly and must take corrective action accordingly.
You can get the Jupyter notebook for this blog here https://gitlab.com/data-science-with-julia/code/-/blob/master/Data%20Drift.ipynb.