Big Data Technologies Cheatsheet: The Complete Reference Guide

Core Big Data Frameworks & Platforms

TechnologyDescriptionKey FeaturesBest For
Apache HadoopOpen-source distributed computing framework• HDFS (storage) • YARN (resource management) • MapReduce (processing model) • Fault toleranceBatch processing of large datasets; foundation for data lakes
Apache SparkFast, in-memory data processing engine• Up to 100x faster than Hadoop MapReduce • Unified API for batch/stream processing • Built-in libraries for ML, graph processing, SQLInteractive queries, machine learning, real-time analytics
Cloudera CDPEnterprise data platform• Integrated security & governance • Hybrid cloud architecture • End-to-end data lifecycleEnterprise-scale deployments with compliance requirements
DatabricksUnified analytics platform• Optimized Spark runtime • Notebook interface • MLflow integration • Delta Lake supportCollaborative data science, ML workflows, data engineering
Google BigQueryServerless data warehouse• Separation of storage and compute • ML integration • Real-time analyticsAd-hoc analysis of massive datasets without infrastructure management
SnowflakeCloud data platform• Multi-cluster architecture • Separation of storage/compute • Semi-structured data supportEnterprise data warehousing, data sharing, data applications
Amazon EMRManaged Hadoop framework• Easy cluster provisioning • Integration with AWS services • Support for Spark, Hive, PrestoCloud-based big data processing on AWS

Distributed Storage Technologies

TechnologyTypeDescriptionBest For
HDFSDistributed file systemHadoop’s storage layer with data replicationStoring large files for batch processing
Amazon S3Object storageHighly durable cloud storage serviceCost-effective storage of any amount of data
Apache HBaseNoSQL columnar databaseRandom, real-time read/write access on HDFSReal-time read/write access to sparse data
Apache CassandraNoSQL distributed databaseLinear scalability with no single point of failureDistributed applications requiring high availability
MongoDBDocument databaseFlexible schema with BSON document storageApplications with complex, evolving data models
Google BigtableNoSQL database serviceLow latency and high throughputTime-series data, IoT, analytics storage
Apache ParquetColumnar file formatEfficient compression and encoding schemesAnalytical workloads with complex queries
Apache ORCColumnar file formatOptimized for Hive with advanced compressionReducing storage footprint for Hadoop workloads
Apache IcebergTable formatSchema evolution, atomic operationsManaging large analytic datasets
Delta LakeStorage layerACID transactions on data lakesReliable data lakes with schema enforcement

Data Processing Engines

TechnologyProcessing ModelKey FeaturesIdeal For
MapReduceBatch• Simple programming model • High fault tolerance • Built for very large datasetsBatch jobs with linear data processing flow
Apache Spark CoreBatch & micro-batch• In-memory processing • DAG execution engine • Lazy evaluationGeneral-purpose data processing
Apache TezDAG-based• Dynamic DAG optimization • Container reuse • Resource-aware schedulingComplex multi-stage data processing workflows
Apache FlinkStream & batch• True streaming (not micro-batch) • Event time processing • Exactly-once semanticsApplications requiring real-time processing with low latency
Presto/TrinoDistributed SQL• MPP (massively parallel processing) • In-memory processing • Multiple data source connectorsInteractive querying across multiple data sources
Databricks PhotonVectorized engine• Vectorized execution • Native code generation • Compatible with Spark APIsPerformance-critical Spark workloads
Apache BeamUnified (batch & stream)• Runner agnostic • Unified programming model • Windowing abstractionsPortable data processing pipelines

Stream Processing Technologies

TechnologyDescriptionKey CapabilitiesUse Cases
Apache KafkaDistributed event streaming platform• High throughput • Scalable storage • Client librariesLog aggregation, messaging, activity tracking
Kafka StreamsClient library for Kafka• Exactly-once processing • No separate cluster required • Local state storesStream processing directly on Kafka
Apache FlinkStream processing framework• Stateful stream processing • Event time processing • High throughput/low latencyComplex event processing, anomaly detection
Spark Structured StreamingStream processing on Spark• Incremental query execution • End-to-end exactly-once • Unified batch/streaming APIIntegrating streaming with batch workloads
AWS KinesisManaged streaming service• Automatic scaling • Integration with AWS services • Real-time analyticsReal-time dashboards, analytics on streaming data
Google DataflowManaged stream/batch processing• Autoscaling • Unified batch/stream • Apache Beam implementationProcessing data pipelines with varying traffic
Azure Event HubsCloud messaging service• Big data streaming • Millions of events per second • Time retentionTelemetry intake, live dashboarding

Query & Analysis Tools

TechnologyTypeKey FeaturesWhen To Use
Apache HiveSQL-on-Hadoop• HiveQL (SQL-like) • Schema-on-read • Metadata storeBatch SQL queries on large datasets
Apache ImpalaMPP SQL engine• Low latency queries • HDFS/HBase integration • Native executionInteractive SQL queries on Hadoop
Apache DruidOLAP database• Column-oriented storage • Real-time ingestion • Fast aggregationsTime series analytics, dashboards
ClickhouseColumn-oriented DBMS• High performance • Real-time data updates • Linear scalabilityReal-time analytics with high ingestion rates
Apache KylinOLAP engine• MOLAP cube architecture • SQL interface • Sub-second query latencyBusiness intelligence on huge datasets
DremioData lake engine• Data reflections (acceleration) • Self-service semantics • Multi-source queriesAccelerating queries across data lake sources
SparkSQLSQL engine for Spark• DataFrame API • Catalyst optimizer • Native integration with SparkSQL analytics within Spark applications

Data Orchestration & Workflow Management

TechnologyDescriptionKey FeaturesBest Suited For
Apache AirflowWorkflow automation platform• DAG-based workflows • Python for task definition • Extensible architectureComplex ETL pipelines with dependencies
Apache NiFiData flow automation• Visual web-based UI • Data provenance • Extensible processor modelData routing, transformation, system mediation
DagsterData orchestrator• Data-aware orchestration • Type-checked data handoffs • Testing frameworkData pipelines with complex interdependencies
LuigiPipeline framework• Dependency resolution • Failure recovery • Command line integrationBatch jobs with complex dependencies
PrefectWorkflow management• Hybrid execution model • Positive engineering • Dynamic workflowsModern data stacks with complex flows
AWS Step FunctionsServerless orchestration• Visual workflow editor • AWS service integration • ParallelizationCoordinating AWS services in serverless apps
Argo WorkflowsKubernetes-native workflows• Container-native • CI/CD integration • Complex dependenciesCloud-native data pipelines on Kubernetes

Data Visualization & BI Tools

TechnologyTypeStrengthsIdeal For
TableauBI & visualization• Intuitive interface • Rich visualizations • Strong data connectivityBusiness users needing self-service analytics
Power BIBI & visualization• Microsoft ecosystem integration • Cost-effective • Regular updatesOrganizations using Microsoft stack
LookerData platform• LookML modeling layer • Git integration • Embedded analyticsData-driven organizations needing governed self-service
Apache SupersetOpen-source BI• Interactive exploration • Wide visualization library • SQL IDEOrganizations wanting open-source modern BI
GrafanaObservability platform• Time-series focused • Alerting capabilities • Plugin ecosystemReal-time monitoring, operational dashboards
KibanaElasticsearch frontend• Log & document exploration • Integration with Elastic Stack • Security featuresSearching and visualizing log data
RedashQuery & visualization• Multiple data source support • Query library • Shareable dashboardsSQL-based reporting and dashboards

Machine Learning & AI for Big Data

TechnologyCategoryKey FeaturesUse Cases
Spark MLlibML library• Distributed algorithms • Pipeline API • Integration with Spark ecosystemLarge-scale ML integrated with data processing
TensorFlow on SparkDistributed deep learning• Distributed TensorFlow • CPU/GPU training • Model parallelismDeep learning at scale on existing Spark clusters
H2OAutomated ML platform• AutoML capabilities • R, Python, Scala APIs • GPU accelerationAutomated model building on large datasets
RayDistributed computing• Task parallelism • Reinforcement learning • Distributed trainingScaling Python ML workflows and AI applications
MLflowML lifecycle platform• Experiment tracking • Model registry • Model servingEnd-to-end ML lifecycle management
KubeflowML toolkit for Kubernetes• End-to-end ML pipelines • Model training/serving • Notebook serversML workflows on Kubernetes
Databricks MLUnified ML platform• Feature store • AutoML • MLflow integrationEnd-to-end ML workflows with governance

Big Data Integration & Ingestion Tools

TechnologyTypeKey FeaturesBest For
Apache Kafka ConnectData integration for Kafka• Source/sink connectors • Scalable architecture • Transformation capabilitiesStreaming data integration with Kafka
Apache SqoopDatabase ingestion• RDBMS to Hadoop transfers • Incremental imports • Parallel transfersBatch loading data from relational databases
Apache FlumeLog collection• Reliable data movement • Multiple sources/sinks • Extensible architectureCollecting, aggregating, and moving log data
TalendData integration platform• Visual interface • 1000+ connectors • Data quality featuresEnterprise data integration with governance
FivetranCloud ELT• Automated schema management • Incremental updates • Monitoring & alertsTurnkey data pipeline construction
AirbyteOpen-source ELT• Growing connector library • Customizable connectors • Data synchronizationBuilding and maintaining ELT pipelines
StitchCloud ETL service• Simple interface • Quick setup • Comprehensive integrationsQuick setup for common data sources to destinations

Container & Cluster Management

TechnologyDescriptionKey FeaturesBest For
KubernetesContainer orchestration• Auto-scaling • Service discovery • Declarative configurationManaging containerized big data applications
DockerContainerization platform• Isolation • Reproducibility • PortabilityPackaging data applications with dependencies
Apache MesosCluster manager• Two-level scheduling • Resource isolation • ScalabilityLarge-scale cluster resource management
YARN (Hadoop 3)Resource manager• Docker support • Federation • GPU isolationManaging resources on Hadoop clusters
Amazon EKSManaged Kubernetes• AWS integration • Automated updates • ScalabilityRunning Kubernetes on AWS
Google Kubernetes EngineManaged Kubernetes• Autopilot mode • Auto-scaling • Multi-cluster managementOperating production-grade Kubernetes on GCP
Azure Kubernetes ServiceManaged Kubernetes• Integration with Azure services • DevOps integration • Security featuresRunning Kubernetes on Azure

Key Big Data File Formats

FormatTypeCharacteristicsBest For
ParquetColumnar• Column-oriented • Efficient compression • Schema preservationAnalytical workloads, complex queries
ORCColumnar• Optimized for Hive • ACID support • Predicate pushdownHive-based analytical workloads
AvroRow-based• Schema evolution • Compact binary format • Rich data structuresData serialization, schema evolution
JSONSemi-structured• Human-readable • Flexible schema • Wide supportAPIs, configuration, flexible data
ProtobufBinary serialization• Compact binary format • Schema definition • Cross-language supportEfficient serialization with fixed schema
DeltaTable format• ACID transactions • Schema enforcement • Time travelReliable data lake architecture
IcebergTable format• Schema evolution • Hidden partitioning • Snapshot isolationLarge-scale analytics tables
HudiTable format• Upserts & deletes • Incremental processing • Time travelRecord-level operations on data lakes

Monitoring & Observability Tools

TechnologyFocusKey FeaturesBest For
PrometheusMetrics• Time-series database • Powerful query language • Alert managementMonitoring Kubernetes and cloud-native applications
GrafanaVisualization• Multi-source dashboards • Alerting • AnnotationCreating observability dashboards
Elastic APMApplication monitoring• Distributed tracing • Performance metrics • Error trackingEnd-to-end application monitoring
DatadogFull-stack observability• Infrastructure monitoring • APM • Log managementComplete visibility across distributed systems
New RelicFull-stack observability• Real-time metrics • Distributed tracing • Applied intelligencePerformance monitoring of applications and infrastructure
SplunkMachine data platform• Search & investigation • Alerts & dashboards • AI-powered analyticsSecurity, IT, and DevOps monitoring
Cloudera ManagerHadoop management• Cluster deployment • Configuration management • Service monitoringManaging Hadoop ecosystem components
Scroll to Top