Comprehensive Anomaly Detection Cheatsheet

Introduction: Understanding Anomaly Detection

Anomaly detection is the identification of rare items, events, or observations that significantly deviate from the expected pattern in a dataset. These anomalies might indicate critical incidents such as bank fraud, medical problems, structural defects, or errors in text. Effective anomaly detection is crucial across industries including cybersecurity, manufacturing, healthcare, finance, and IoT, where detecting unusual patterns can prevent losses, improve quality control, and potentially save lives.

Core Concepts & Principles

Types of Anomalies

Point Anomalies: Individual data points that deviate significantly from the norm
Contextual Anomalies: Data points that are anomalous in a specific context
Collective Anomalies: Collections of related data points that are anomalous as a group

Detection Approaches

Supervised: Uses labeled data (normal and anomalous)
Semi-supervised: Trained on normal data only
Unsupervised: No training data required, identifies patterns without labels

Key Metrics

True Positives (TP): Correctly identified anomalies
False Positives (FP): Normal data incorrectly flagged as anomalies
True Negatives (TN): Correctly identified normal data
False Negatives (FN): Missed anomalies
Precision: TP/(TP+FP) – Proportion of correctly identified anomalies
Recall: TP/(TP+FN) – Proportion of actual anomalies detected
F1 Score: 2×(Precision×Recall)/(Precision+Recall) – Balanced measure
AUC-ROC: Area under receiver operating characteristic curve

Anomaly Detection Methods

Statistical Methods

Z-Score: Identifies points that deviate from mean by more than a specified number of standard deviations

def z_score(data, threshold=3):
    mean = np.mean(data)
    std = np.std(data)
    z_scores = [(y - mean) / std for y in data]
    return [i for i, z in enumerate(z_scores) if abs(z) > threshold]

Modified Z-Score: Uses median and MAD (Median Absolute Deviation) for robustness

def modified_z_score(data, threshold=3.5):
    median = np.median(data)
    mad = np.median([abs(x - median) for x in data])
    modified_z = [0.6745 * (y - median) / mad for y in data]
    return [i for i, z in enumerate(modified_z) if abs(z) > threshold]

GESD (Generalized ESD): Detects multiple outliers in univariate data

Tukey’s Method: Uses quartiles to identify outliers (IQR method)

def tukey_method(data, k=1.5):
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower_bound = q1 - k * iqr
    upper_bound = q3 + k * iqr
    return [i for i, x in enumerate(data) if x < lower_bound or x > upper_bound]

Proximity-Based Methods

K-Nearest Neighbors (KNN): Anomalous points have greater distances to neighbors

from sklearn.neighbors import NearestNeighbors

def knn_anomaly_detection(data, n_neighbors=5):
    nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
    distances, _ = nbrs.kneighbors(data)
    avg_distances = distances.mean(axis=1)
    threshold = np.percentile(avg_distances, 95)  # Top 5% as anomalies
    return [i for i, dist in enumerate(avg_distances) if dist > threshold]

Local Outlier Factor (LOF): Compares local density of point with neighbors

from sklearn.neighbors import LocalOutlierFactor

def lof_anomaly_detection(data, n_neighbors=20):
    lof = LocalOutlierFactor(n_neighbors=n_neighbors)
    predictions = lof.fit_predict(data)
    return [i for i, pred in enumerate(predictions) if pred == -1]

DBSCAN: Density-based clustering; points not in any cluster are anomalies

from sklearn.cluster import DBSCAN

def dbscan_anomaly_detection(data, eps=0.5, min_samples=5):
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    predictions = dbscan.fit_predict(data)
    return [i for i, pred in enumerate(predictions) if pred == -1]

Clustering-Based Methods

K-Means: Points far from cluster centroids are anomalies

from sklearn.cluster import KMeans

def kmeans_anomaly_detection(data, n_clusters=5, threshold=2.0):
    kmeans = KMeans(n_clusters=n_clusters).fit(data)
    distances = np.min(cdist(data, kmeans.cluster_centers_), axis=1)
    threshold_value = np.mean(distances) + threshold * np.std(distances)
    return [i for i, dist in enumerate(distances) if dist > threshold_value]

Gaussian Mixture Models (GMM): Low likelihood points are anomalies

from sklearn.mixture import GaussianMixture

def gmm_anomaly_detection(data, n_components=5, threshold=-10):
    gmm = GaussianMixture(n_components=n_components)
    gmm.fit(data)
    log_probs = gmm.score_samples(data)
    return [i for i, score in enumerate(log_probs) if score < threshold]

Classification-Based Methods

One-Class SVM: Learns boundary around normal data

from sklearn.svm import OneClassSVM

def one_class_svm(data, nu=0.1):
    model = OneClassSVM(nu=nu, kernel="rbf")
    model.fit(data)
    predictions = model.predict(data)
    return [i for i, pred in enumerate(predictions) if pred == -1]

Isolation Forest: Isolates anomalies through random partitioning

from sklearn.ensemble import IsolationForest

def isolation_forest(data, contamination=0.1):
    model = IsolationForest(contamination=contamination)
    model.fit(data)
    predictions = model.predict(data)
    return [i for i, pred in enumerate(predictions) if pred == -1]

Deep Learning Methods

Autoencoders: High reconstruction error indicates anomalies

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

def autoencoder_anomaly_detection(data, threshold=None):
    input_dim = data.shape[1]
    input_layer = Input(shape=(input_dim,))
    encoded = Dense(int(input_dim/2), activation='relu')(input_layer)
    decoded = Dense(input_dim, activation='linear')(encoded)
    
    autoencoder = Model(inputs=input_layer, outputs=decoded)
    autoencoder.compile(optimizer='adam', loss='mse')
    autoencoder.fit(data, data, epochs=50, batch_size=32, verbose=0)
    
    reconstructions = autoencoder.predict(data)
    mse = np.mean(np.power(data - reconstructions, 2), axis=1)
    
    if threshold is None:
        threshold = np.percentile(mse, 95)  # Top 5% as anomalies
        
    return [i for i, error in enumerate(mse) if error > threshold]

LSTM Autoencoders: For sequential/time-series anomaly detection
Variational Autoencoders (VAEs): Probabilistic version of autoencoders
Generative Adversarial Networks (GANs): Discriminator identifies anomalies

Time Series Methods

ARIMA: Flags observations outside prediction intervals
Exponential Smoothing: Detects deviations from smoothed values
Prophet: Automated time series decomposition and forecasting
Change Point Detection: Identifies shifts in time series distributions

Method Selection Guide

Method	Data Type	Labeled Data Needed	Scalability	Handles High Dimensions	Good For
Z-Score	Univariate	No	High	N/A	Simple, quick detection
Tukey’s Method	Univariate	No	High	N/A	Robust to non-normal data
KNN	Multivariate	No	Low	No	Small datasets, clear proximity patterns
LOF	Multivariate	No	Medium	No	Local density variations
DBSCAN	Multivariate	No	Medium	No	Varied density clusters
K-Means	Multivariate	No	High	Somewhat	Well-separated data
One-Class SVM	Multivariate	Semi	Medium	Somewhat	Complex decision boundaries
Isolation Forest	Multivariate	No	High	Yes	High-dimensional data
Autoencoders	Multivariate	Semi	Medium	Yes	Complex, high-dimensional patterns
LSTM-AE	Sequential	Semi	Medium	Yes	Time series, sequence data
ARIMA	Time Series	No	Medium	No	Structured time series

Anomaly Detection Workflow

Problem Definition
- Define what constitutes an anomaly in your domain
- Determine detection goals (prevention, investigation, alerting)
Data Preparation
- Feature selection/engineering
- Handling missing values and outliers
- Normalization/scaling
- Time alignment (for time series)
Method Selection
- Based on data type, availability of labels, dimensionality
- Consider computational constraints
- Select appropriate methods from the selection guide
Parameter Tuning
- Set thresholds for anomaly scores
- Optimize model parameters (grid search, cross-validation)
- Balance precision and recall based on business needs
Evaluation
- Use metrics: precision, recall, F1-score, AUC-ROC
- Assess computational efficiency
- Validate with domain experts
Deployment
- Implement detection in production systems
- Set up monitoring and alerting
- Establish feedback loop for continuous improvement

Common Challenges & Solutions

Challenge	Solution
High false positive rate	Adjust threshold, combine multiple methods, incorporate context
Lack of labeled data	Use unsupervised or semi-supervised methods, active learning
High dimensionality	Feature selection, dimensionality reduction (PCA, t-SNE), use forest-based methods
Class imbalance	Synthetic minority oversampling (SMOTE), cost-sensitive learning
Concept drift	Online learning, periodic retraining, drift detection methods
Seasonal/cyclical patterns	Decomposition, specialized time series methods
Multimodal data	Ensemble methods, model stacking, specialized models per data type
Interpretability needs	Rule-based methods, feature importance analysis, LIME/SHAP explanations

Best Practices & Tips

Data Preparation

Normalize features to prevent scale biases
Handle seasonality in time series before detection
Remove known outliers from training data in semi-supervised approaches
Create domain-specific features that might highlight anomalies

Method Selection

Start simple: try statistical methods before complex ML approaches
Ensemble multiple methods for more robust detection
Consider domain requirements for real-time vs. batch processing

Threshold Setting

Set thresholds based on business impact not statistical significance
Consider adaptive thresholds that evolve with data
Use separate thresholds for different segments/contexts

Evaluation

Focus on recall for critical applications (fraud, security)
Prioritize precision when false alarms are costly
Always validate with domain experts

Production Implementation

Implement gradual rollout with human oversight
Set up monitoring for the detector itself
Create feedback mechanisms to capture false positives/negatives

Resources for Further Learning

Books

“Outlier Analysis” by Charu Aggarwal
“Anomaly Detection Principles and Algorithms” by Mehrotra et al.
“Python Data Science Handbook” by Jake VanderPlas

Online Courses

Coursera: “Anomaly Detection in Time Series Data with Keras”
edX: “Data Science: Machine Learning”
Udemy: “Machine Learning A-Z™: Hands-On Python & R”

Libraries & Tools

Python: Scikit-learn, PyOD, ADTK (Anomaly Detection Toolkit)
R: anomalize, AnomalyDetection, outliers
Commercial: AWS SageMaker, Datadog, Anodot, Microsoft Azure Anomaly Detector

Research Papers

“A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms” by Goldstein & Uchida
“LOF: Identifying Density-Based Local Outliers” by Breunig et al.
“Isolation Forest” by Liu, Ting & Zhou

Communities & Forums

Kaggle Competitions (search for anomaly detection)
Stack Overflow tags: anomaly-detection, outlier-detection
KDnuggets articles on anomaly detection