MachineLearningStatistics Datasets may contain unexpected values, patterns, or outliers. One needs to balance strict models that penalize false positives (false alarms) and lenient models that reduce false negatives (missed anomalies). There are three primary types of anomalies:

  1. Point anomalies: Individual datum that is anomalous or an outlier w.r.t. the rest of the data Ex: high latency in database queries
  2. Collective anomalies: A subset of data that are anomalous w.r.t. the rest of the data Ex: same average temperature for many days in a row; irregular heartbeat;
  3. Contextual anomalies: Data that is anomalous in a given context, but otherwise not Ex: a spike in network traffic at an odd time

The central idea to detecting such outliers is in identifying data points that deviate from expected statistics:

  • Percentiles: identify data that falls outside a specified percentile range:
import numpy as np
pecentile_threshold = (1, 99)

percentiles = np.percentile(data, pecentile_threshold)
anomalies = np.where((data < percentiles[0]) or (data > percentiles[1]))
  • Interquartile range (IQR): identify data that fall outside the first to third quartiles (25-75% range)
import numpy as np
q1, q3 = np.percentile(data, (25, 75))
iqr = q3 - q1

iqr_threshold = (q1 - thresh*iqr, q3 + thresh*iqr)
anomalies = np.where((data < iqr_threshold[0]) or (data > iqr_threshold[1]))

The above methods make a strong assumption of stationarity, mainly that the data always comes from some fixed distribution with similar behavior of it’s tails. A more robust way to detect anomalies is by grouping features based on their similarities, which implicitly estimates the current distribution of the data:

  • Clustering: group data based on Euclidean (or other) distance between features (K-means) or their a measure of denseness (w.r.t. a distance threshold) (DBSCAN):
import numpy as np
from sklearn.cluster import KMeans, DBSCAN

kmeans = KMeans(n_clusters=k).fit(data)
dbscan = DBSCAN(eps=0.5, min_samples=5).fit(data)

# get top 95% of closest distances
distances = kmeans.transform(data)
nearest_distances = np.min(distances, axis=1)

thresh = 95
threshold = np.percentile(nearest_distances, thresh)

kmeans_anomalies = np.where(nearest_distances > threshold)
dbscan_anomalies = np.where(dbscan.labels_ == -1)

The problem with the above methods is that they require online updates of the densities and/or clusters, which can be computationally expensive when data is streaming at thousands of data points per second or minute. However, another way to detect outliers via distance from a prototypical example that can generalize more is by constructing a decision region, where the boundary separate normal from anomalous data. This is based on the idea of confidence or margin:

  • One class SVM (OCSVM): defines a boundary around the majority class in feature space and maximizes the margin w.r.t. the majority class, which allows it to be robust to class imbalance. The hyperparameter is an upper bound on the fraction of margin errors (i.e. support vectors).
import numpy as np
from sklearn.svm import OneClassSVM

svm = OneClassSVM(gamma='auto').fit(data)
scores = svm.score_samples(data)

threshold = np.percentile(scores, thresh)
anomalies = np.where(scores < threshold)
  • Autoencoders: compute confidence scores/probability of anomaly from latent representation. The goal is that the latent representation might have a stationary distribution, by learning a conditional independence structure w.r.t. the non-stationary features of the data.
import torch
import torch.nn as nn

class AnomalyDetector(nn.Module):
	def __init__(self):
		self.encoder = nn.Sequential(nn.Linear(in_dim, latent_dim), nn.ReLU())
		self.decoder = nn.Sequential(nn.Linear(latent_dim, in_dim), nn.ReLU())
		self.confidence_layer = nn.Sequential(nn.Linear(latent_dim, 2), nn.Softmax(dim=1))

	def forward(self, x):
		z = self.encoder(x)
		x_hat = self. decoder(z)
		confidence = self.confidence_layer(z)
		return x_hat, z, confidence

The key challenge in the representation learning method would be the training procedure: how to update the confidence_layer parameters?