When data is the life blood of your business, outliers are the pulse. The unexpected deviations in your data can indicate business successes and thus new opportunities: had a promotional discount caused a spike in first time customers – can you build on that success and make them come back? Outliers can also signal serious problems: a sudden surge in online sales volume without a simultaneous jump in online revenue very often indicates a price glitch.
Not surprisingly, outlier detection is emerging as the next big thing in business intelligence (BI), especially in organizations which have millions of metrics monitored in real time. Old school analytics dashboards and static thresholds mire your team in alert storms and need constant readjusting.
The 3 flavors of outliers
Automated outlier detection systems avoid these problems by using machine learning to accurately detect three different types of outliers:
Global outliers are rare data values which are drastically different from the rest of the data they’re found in. You winning the lottery would be global outlier since the odds of a particular person winning the jackpot is often over a million to one.
Contextual outliers are data points which are considered outliers in the context which they appear, but the same values wouldn’t be considered outliers in other contexts within the same data set. You buying a lottery ticket at 11:00 at night would be considered a contextual outlier if you routinely purchase a ticket every day on your way home from work much earlier in the evening.
Finally, there are collective outliers, which are a group of data points which as a subset, deviate significantly from the rest of the dataset. It’s probably easiest to think of collective outliers as statistically unlikely “clumps” of data points within a larger dataset. If the last three winners of the jackpot were neighbors of yours, that would be a collective outlier since we’d statistically expect a wider geographic distribution of the winners.
Human beings are naturally good at spotting outliers, provided we’re given enough data to form a clear mental model of what’s normal for a dataset. Graphs and other ways of visualizing data make this even easier for us. If we dig deep into a plot of time series data, we could probably spot all three types, even if it took us a while. It’s impractical, however, for businesses to use manual outlier detection for hundreds, thousands or millions of metrics.
Outlier detection and matters of the heart
It turns out that neural networks are also good at detecting outliers and in some areas, are already better than humans in terms of recall (how well all the real outliers are identified) and precision (how many of the data points identified as outliers actually were outliers).
One example is the recently developed 34-layer convolutional neural network created by Stanford researchers which outperforms board-certified cardiologists at heart arrhythmia detection. This neural net utilized a massive dataset of ECG recordings taken from people who wore a specific wearable heart monitor. This is one context where abstract statistical terms like precision and recall can mean the difference between life and death, especially those who may not have access to a cardiologist.
Part of the team’s success was due to the fact that they had a dataset of ECG recordings far larger than anyone else who has attempted to use computers to accurately detect heart arrhythmia. They had this huge dataset because the research team partnered with the vendor of the wearable heart monitor, who was able to collect real data from its customers. Machine learning needs lots of data to learn from, just like we all needed to spend our first few years of life listening to and babbling at adults before we could speak in complete sentences.
Just like the heart arrhythmia example, real-time outlier detection vendor Anodot uses lots of data and neural nets to achieve high precision, recall and conciseness in real time at the scale of millions of metrics. It does so by tailoring the specific statistical outlier tests to the distribution actually seen in the data, and by learning the “normal” behavior of a metric over time. Any data point which the statistical test identifies as outside of the normal range is labeled an outlier. This robust approach is able to identify all three types of outliers in any time series data.
When it comes to business metrics, precision means catching only the legitimate and significant outliers and not flooding analysts with false positives. Recall means not letting any legitimate outliers slip through, like a price glitch which causes you to lose hundreds of dollars of revenue on every sale.
And hemorrhaging revenue is the last thing your company needs. Machine learning can pump new insights into your business, bulking up your bottom line.