By Pablo C.
Introduction
Inspired by this Netflix post, I decided to write a post based on this topic using R.
There are several nice packages to achieve this goal, the one we´re going to review is AnomalyDetection.
Download full –and tiny– R code of this post here.
Normal Vs. Abnormal
The definition for abnormal, or outlier, is an element which does not follow the behaviour of the majority.
Data has noise, same example as a radio which doesn’t have good signal, and you end up listening to some background noise.
- The orange section could be noise in data, since it oscillates around a value without showing a defined pattern, in other words: White noise
- Are the red circles noise or they are peaks from an undercover pattern?
A good algorithm can detect abnormal points considering the inner noise and leaving it behind. The
AnomalyDetectionTs
inAnomalyDetection
package can perform this task quite well.
Hands on anomaly detection!
In this example, data comes from the well known wikipedia, which offers an API to download from R the daily page views
given any {term + language}
.
In this case, we’ve got page views from term fifa
, language en
, from 2013-02-22
up to today.
After applying the algorithm, we can plot the original time series plus the abnormal points in which the page views were over the expected value.
About the algorithm
Parameters in algorithm are max_anoms=0.01
(to have a maximum of 0.01%
outliers points in final result), and direction="pos"
to detect anomalies over (not below) the expected value.
As a result, 8 anomalies dates were detected. Additionally, the algorithm returns what it would have been the expected value, and an extra calculation is performed to get this value in terms of percentage perc_diff
.
If …read more
Source:: r-bloggers.com