Comprehensive Anomaly Detection Cheatsheet

Introduction: Understanding Anomaly Detection

Anomaly detection is the identification of rare items, events, or observations that significantly deviate from the expected pattern in a dataset. These anomalies might indicate critical incidents such as bank fraud, medical problems, structural defects, or errors in text. Effective anomaly detection is crucial across industries including cybersecurity, manufacturing, healthcare, finance, and IoT, where detecting unusual patterns can prevent losses, improve quality control, and potentially save lives.

Core Concepts & Principles

Types of Anomalies

  • Point Anomalies: Individual data points that deviate significantly from the norm
  • Contextual Anomalies: Data points that are anomalous in a specific context
  • Collective Anomalies: Collections of related data points that are anomalous as a group

Detection Approaches

  • Supervised: Uses labeled data (normal and anomalous)
  • Semi-supervised: Trained on normal data only
  • Unsupervised: No training data required, identifies patterns without labels

Key Metrics

  • True Positives (TP): Correctly identified anomalies
  • False Positives (FP): Normal data incorrectly flagged as anomalies
  • True Negatives (TN): Correctly identified normal data
  • False Negatives (FN): Missed anomalies
  • Precision: TP/(TP+FP) – Proportion of correctly identified anomalies
  • Recall: TP/(TP+FN) – Proportion of actual anomalies detected
  • F1 Score: 2×(Precision×Recall)/(Precision+Recall) – Balanced measure
  • AUC-ROC: Area under receiver operating characteristic curve

Anomaly Detection Methods

Statistical Methods

  • Z-Score: Identifies points that deviate from mean by more than a specified number of standard deviations

    def z_score(data, threshold=3):
        mean = np.mean(data)
        std = np.std(data)
        z_scores = [(y - mean) / std for y in data]
        return [i for i, z in enumerate(z_scores) if abs(z) > threshold]
    
  • Modified Z-Score: Uses median and MAD (Median Absolute Deviation) for robustness

    def modified_z_score(data, threshold=3.5):
        median = np.median(data)
        mad = np.median([abs(x - median) for x in data])
        modified_z = [0.6745 * (y - median) / mad for y in data]
        return [i for i, z in enumerate(modified_z) if abs(z) > threshold]
    
  • GESD (Generalized ESD): Detects multiple outliers in univariate data

  • Tukey’s Method: Uses quartiles to identify outliers (IQR method)

    def tukey_method(data, k=1.5):
        q1 = np.percentile(data, 25)
        q3 = np.percentile(data, 75)
        iqr = q3 - q1
        lower_bound = q1 - k * iqr
        upper_bound = q3 + k * iqr
        return [i for i, x in enumerate(data) if x < lower_bound or x > upper_bound]
    

Proximity-Based Methods

  • K-Nearest Neighbors (KNN): Anomalous points have greater distances to neighbors

    from sklearn.neighbors import NearestNeighbors
    
    def knn_anomaly_detection(data, n_neighbors=5):
        nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
        distances, _ = nbrs.kneighbors(data)
        avg_distances = distances.mean(axis=1)
        threshold = np.percentile(avg_distances, 95)  # Top 5% as anomalies
        return [i for i, dist in enumerate(avg_distances) if dist > threshold]
    
  • Local Outlier Factor (LOF): Compares local density of point with neighbors

    from sklearn.neighbors import LocalOutlierFactor
    
    def lof_anomaly_detection(data, n_neighbors=20):
        lof = LocalOutlierFactor(n_neighbors=n_neighbors)
        predictions = lof.fit_predict(data)
        return [i for i, pred in enumerate(predictions) if pred == -1]
    
  • DBSCAN: Density-based clustering; points not in any cluster are anomalies

    from sklearn.cluster import DBSCAN
    
    def dbscan_anomaly_detection(data, eps=0.5, min_samples=5):
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        predictions = dbscan.fit_predict(data)
        return [i for i, pred in enumerate(predictions) if pred == -1]
    

Clustering-Based Methods

  • K-Means: Points far from cluster centroids are anomalies

    from sklearn.cluster import KMeans
    
    def kmeans_anomaly_detection(data, n_clusters=5, threshold=2.0):
        kmeans = KMeans(n_clusters=n_clusters).fit(data)
        distances = np.min(cdist(data, kmeans.cluster_centers_), axis=1)
        threshold_value = np.mean(distances) + threshold * np.std(distances)
        return [i for i, dist in enumerate(distances) if dist > threshold_value]
    
  • Gaussian Mixture Models (GMM): Low likelihood points are anomalies

    from sklearn.mixture import GaussianMixture
    
    def gmm_anomaly_detection(data, n_components=5, threshold=-10):
        gmm = GaussianMixture(n_components=n_components)
        gmm.fit(data)
        log_probs = gmm.score_samples(data)
        return [i for i, score in enumerate(log_probs) if score < threshold]
    

Classification-Based Methods

  • One-Class SVM: Learns boundary around normal data

    from sklearn.svm import OneClassSVM
    
    def one_class_svm(data, nu=0.1):
        model = OneClassSVM(nu=nu, kernel="rbf")
        model.fit(data)
        predictions = model.predict(data)
        return [i for i, pred in enumerate(predictions) if pred == -1]
    
  • Isolation Forest: Isolates anomalies through random partitioning

    from sklearn.ensemble import IsolationForest
    
    def isolation_forest(data, contamination=0.1):
        model = IsolationForest(contamination=contamination)
        model.fit(data)
        predictions = model.predict(data)
        return [i for i, pred in enumerate(predictions) if pred == -1]
    

Deep Learning Methods

  • Autoencoders: High reconstruction error indicates anomalies

    import tensorflow as tf
    from tensorflow.keras.models import Model
    from tensorflow.keras.layers import Input, Dense
    
    def autoencoder_anomaly_detection(data, threshold=None):
        input_dim = data.shape[1]
        input_layer = Input(shape=(input_dim,))
        encoded = Dense(int(input_dim/2), activation='relu')(input_layer)
        decoded = Dense(input_dim, activation='linear')(encoded)
        
        autoencoder = Model(inputs=input_layer, outputs=decoded)
        autoencoder.compile(optimizer='adam', loss='mse')
        autoencoder.fit(data, data, epochs=50, batch_size=32, verbose=0)
        
        reconstructions = autoencoder.predict(data)
        mse = np.mean(np.power(data - reconstructions, 2), axis=1)
        
        if threshold is None:
            threshold = np.percentile(mse, 95)  # Top 5% as anomalies
            
        return [i for i, error in enumerate(mse) if error > threshold]
    
  • LSTM Autoencoders: For sequential/time-series anomaly detection

  • Variational Autoencoders (VAEs): Probabilistic version of autoencoders

  • Generative Adversarial Networks (GANs): Discriminator identifies anomalies

Time Series Methods

  • ARIMA: Flags observations outside prediction intervals
  • Exponential Smoothing: Detects deviations from smoothed values
  • Prophet: Automated time series decomposition and forecasting
  • Change Point Detection: Identifies shifts in time series distributions

Method Selection Guide

MethodData TypeLabeled Data NeededScalabilityHandles High DimensionsGood For
Z-ScoreUnivariateNoHighN/ASimple, quick detection
Tukey’s MethodUnivariateNoHighN/ARobust to non-normal data
KNNMultivariateNoLowNoSmall datasets, clear proximity patterns
LOFMultivariateNoMediumNoLocal density variations
DBSCANMultivariateNoMediumNoVaried density clusters
K-MeansMultivariateNoHighSomewhatWell-separated data
One-Class SVMMultivariateSemiMediumSomewhatComplex decision boundaries
Isolation ForestMultivariateNoHighYesHigh-dimensional data
AutoencodersMultivariateSemiMediumYesComplex, high-dimensional patterns
LSTM-AESequentialSemiMediumYesTime series, sequence data
ARIMATime SeriesNoMediumNoStructured time series

Anomaly Detection Workflow

  1. Problem Definition

    • Define what constitutes an anomaly in your domain
    • Determine detection goals (prevention, investigation, alerting)
  2. Data Preparation

    • Feature selection/engineering
    • Handling missing values and outliers
    • Normalization/scaling
    • Time alignment (for time series)
  3. Method Selection

    • Based on data type, availability of labels, dimensionality
    • Consider computational constraints
    • Select appropriate methods from the selection guide
  4. Parameter Tuning

    • Set thresholds for anomaly scores
    • Optimize model parameters (grid search, cross-validation)
    • Balance precision and recall based on business needs
  5. Evaluation

    • Use metrics: precision, recall, F1-score, AUC-ROC
    • Assess computational efficiency
    • Validate with domain experts
  6. Deployment

    • Implement detection in production systems
    • Set up monitoring and alerting
    • Establish feedback loop for continuous improvement

Common Challenges & Solutions

ChallengeSolution
High false positive rateAdjust threshold, combine multiple methods, incorporate context
Lack of labeled dataUse unsupervised or semi-supervised methods, active learning
High dimensionalityFeature selection, dimensionality reduction (PCA, t-SNE), use forest-based methods
Class imbalanceSynthetic minority oversampling (SMOTE), cost-sensitive learning
Concept driftOnline learning, periodic retraining, drift detection methods
Seasonal/cyclical patternsDecomposition, specialized time series methods
Multimodal dataEnsemble methods, model stacking, specialized models per data type
Interpretability needsRule-based methods, feature importance analysis, LIME/SHAP explanations

Best Practices & Tips

Data Preparation

  • Normalize features to prevent scale biases
  • Handle seasonality in time series before detection
  • Remove known outliers from training data in semi-supervised approaches
  • Create domain-specific features that might highlight anomalies

Method Selection

  • Start simple: try statistical methods before complex ML approaches
  • Ensemble multiple methods for more robust detection
  • Consider domain requirements for real-time vs. batch processing

Threshold Setting

  • Set thresholds based on business impact not statistical significance
  • Consider adaptive thresholds that evolve with data
  • Use separate thresholds for different segments/contexts

Evaluation

  • Focus on recall for critical applications (fraud, security)
  • Prioritize precision when false alarms are costly
  • Always validate with domain experts

Production Implementation

  • Implement gradual rollout with human oversight
  • Set up monitoring for the detector itself
  • Create feedback mechanisms to capture false positives/negatives

Resources for Further Learning

Books

  • “Outlier Analysis” by Charu Aggarwal
  • “Anomaly Detection Principles and Algorithms” by Mehrotra et al.
  • “Python Data Science Handbook” by Jake VanderPlas

Online Courses

  • Coursera: “Anomaly Detection in Time Series Data with Keras”
  • edX: “Data Science: Machine Learning”
  • Udemy: “Machine Learning A-Z™: Hands-On Python & R”

Libraries & Tools

  • Python: Scikit-learn, PyOD, ADTK (Anomaly Detection Toolkit)
  • R: anomalize, AnomalyDetection, outliers
  • Commercial: AWS SageMaker, Datadog, Anodot, Microsoft Azure Anomaly Detector

Research Papers

  • “A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms” by Goldstein & Uchida
  • “LOF: Identifying Density-Based Local Outliers” by Breunig et al.
  • “Isolation Forest” by Liu, Ting & Zhou

Communities & Forums

  • Kaggle Competitions (search for anomaly detection)
  • Stack Overflow tags: anomaly-detection, outlier-detection
  • KDnuggets articles on anomaly detection
Scroll to Top