Ultimate Hadoop & Spark Big Data Cheatsheet: Commands, Concepts & Best Practices

Introduction to Big Data Fundamentals

Big Data refers to datasets that are too large or complex for traditional data processing software to handle efficiently. The concept revolves around the “5 Vs”: Volume (size), Velocity (speed of generation), Variety (different formats), Veracity (uncertainty/quality), and Value (extracting insights). Hadoop and Spark are two of the most important frameworks in the Big Data ecosystem, with Hadoop providing distributed storage and batch processing, while Spark offers lightning-fast in-memory processing across diverse workloads. Understanding these technologies matters because they enable organizations to analyze massive datasets for valuable insights, make data-driven decisions, and implement machine learning at scale.

Hadoop Ecosystem Components

ComponentPrimary FunctionKey Characteristics
HDFSDistributed file systemFault-tolerant, high throughput, optimized for large files
YARNResource managementManages cluster resources and schedules jobs
MapReduceProcessing frameworkBatch processing model with Map and Reduce phases
HiveSQL-like interfaceData warehouse infrastructure for SQL queries on Hadoop
HBaseNoSQL databaseColumn-oriented datastore for random, real-time access
PigData flow scriptingHigh-level language for expressing data analysis programs
ZooKeeperCoordination serviceSynchronization, configuration, naming registry
OozieWorkflow schedulerManages Hadoop jobs with directed acyclic graphs (DAGs)
SqoopData transfer toolImports/exports data between RDBMS and Hadoop
FlumeLog collectionCollects, aggregates, and moves streaming data
AmbariManagement platformProvisions, manages, and monitors Hadoop clusters

HDFS Commands & Operations

Basic File Operations

# List files
hdfs dfs -ls /path

# Create directory
hdfs dfs -mkdir /path/directory

# Copy from local to HDFS
hdfs dfs -put localfile /path/in/hdfs

# Copy from HDFS to local
hdfs dfs -get /path/in/hdfs localfile

# View file content
hdfs dfs -cat /path/to/file

# Delete file or directory
hdfs dfs -rm /path/to/file
hdfs dfs -rm -r /path/to/directory  # Recursive

# Copy within HDFS
hdfs dfs -cp /source/path /destination/path

# Move within HDFS
hdfs dfs -mv /source/path /destination/path

# Check file size
hdfs dfs -du -h /path/to/file

# Set replication factor
hdfs dfs -setrep -w 3 /path/to/file

Administrative Commands

# Check HDFS status
hdfs dfsadmin -report

# Enter safe mode
hdfs dfsadmin -safemode enter

# Leave safe mode
hdfs dfsadmin -safemode leave

# Check file system health
hdfs fsck /path

# Balance data across datanodes
hdfs balancer

# Refresh namenode configuration
hdfs dfsadmin -refreshNodes

# Set HDFS quota
hdfs dfsadmin -setQuota <quota> <directory>
hdfs dfsadmin -setSpaceQuota <quota> <directory>

MapReduce Concepts & Patterns

MapReduce Process Flow

  1. Input: Data split into chunks
  2. Map Phase: Transform data into key-value pairs
  3. Shuffle & Sort: Group data by keys
  4. Reduce Phase: Aggregate values by key
  5. Output: Final results written to HDFS

Common MapReduce Patterns

Filtering: Select records matching criteria • Counting: Count occurrences of items • Sorting: Order data by key • Joining: Combine datasets on common keys • Graph Processing: Process graph data structures • Inverted Index: Build search indexes

Basic Word Count Example (Java)

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
        
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
        
    public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
}

Apache Spark Architecture

Core Components

Driver Program: Contains application’s main function, creates SparkContext • SparkContext: Coordinates and communicates with executors • Cluster Manager: Acquires resources (YARN, Mesos, Kubernetes, Standalone) • Executors: JVM processes running tasks on worker nodes • Tasks: Individual units of computation • RDDs: Resilient Distributed Datasets (fundamental data abstraction) • DAG Scheduler: Creates optimized execution plans

Key Spark Concepts

Transformations: Lazy operations that create new RDDs (map, filter, etc.) • Actions: Operations that trigger computation and return results (count, collect) • Lazy Evaluation: Computations are only executed when an action is called • Persistence: Caching data in memory or disk for reuse • Partitioning: How data is distributed across the cluster • Shuffling: Redistributing data across partitions • Broadcast Variables: Read-only variables cached on each machine • Accumulators: Variables that can only be added to

Spark Core Operations (PySpark)

RDD Creation and Basic Operations

# Create RDD from collection
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Create RDD from file
rdd = sc.textFile("hdfs:///path/to/file.txt")

# Map transformation
mapped_rdd = rdd.map(lambda x: x * 2)

# Filter transformation
filtered_rdd = rdd.filter(lambda x: x > 2)

# FlatMap transformation
text_rdd = sc.textFile("hdfs:///path/to/file.txt")
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))

# Set operations
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])
union_rdd = rdd1.union(rdd2)      # [1, 2, 3, 3, 4, 5]
intersection_rdd = rdd1.intersection(rdd2)  # [3]

RDD Actions

# Count elements
count = rdd.count()

# Collect results to driver
result = rdd.collect()

# Take first n elements
first_elements = rdd.take(5)

# Count by key
key_counts = paired_rdd.countByKey()

# Reduce operation
sum_of_elements = rdd.reduce(lambda a, b: a + b)

# Save RDD to file
rdd.saveAsTextFile("hdfs:///path/to/output")

Key-Value Pair Operations

# Create pair RDD
pair_rdd = rdd.map(lambda x: (x, 1))

# ReduceByKey
word_counts = pair_rdd.reduceByKey(lambda a, b: a + b)

# GroupByKey
grouped = pair_rdd.groupByKey()

# Join operations
joined_rdd = rdd1.join(rdd2)
left_joined = rdd1.leftOuterJoin(rdd2)
right_joined = rdd1.rightOuterJoin(rdd2)

# Aggregate by key
result = pair_rdd.aggregateByKey(
    0,                          # Zero value
    lambda acc, val: acc + val, # Combine values within partitions
    lambda acc1, acc2: acc1 + acc2  # Combine values across partitions
)

Spark SQL & DataFrames (PySpark)

DataFrame Creation

# Create DataFrame from RDD
from pyspark.sql import Row
row_rdd = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
df = spark.createDataFrame(row_rdd)

# Create DataFrame from data source
df = spark.read.csv("hdfs:///path/to/file.csv", header=True, inferSchema=True)
df = spark.read.json("hdfs:///path/to/file.json")
df = spark.read.parquet("hdfs:///path/to/file.parquet")

# Create DataFrame from Hive table
df = spark.sql("SELECT * FROM employees")

# Create temporary view
df.createOrReplaceTempView("people")
result = spark.sql("SELECT * FROM people WHERE age > 30")

DataFrame Operations

# Select columns
result = df.select("name", "age")

# Filter data
filtered = df.filter(df.age > 30)
filtered = df.where("age > 30")

# Add column
with_col = df.withColumn("doubled_age", df.age * 2)

# Rename column
renamed = df.withColumnRenamed("age", "years")

# Drop column
dropped = df.drop("age")

# Distinct values
unique = df.select("department").distinct()

# Group by and aggregate
grouped = df.groupBy("department").agg({"salary": "avg", "age": "max"})

# Sort data
sorted_df = df.orderBy("age")  # Ascending
sorted_df = df.orderBy(df.age.desc())  # Descending

# Join DataFrames
joined = df1.join(df2, df1.id == df2.id, "inner")

DataFrame Actions

# Show data
df.show()
df.show(10, False)  # 10 rows, don't truncate strings

# Get number of rows
count = df.count()

# Get DataFrame schema
df.printSchema()

# Describe statistics
df.describe().show()

# Collect to Python list
rows = df.collect()

# Convert to Pandas DataFrame
pandas_df = df.toPandas()

# Write to file
df.write.csv("hdfs:///path/to/output.csv")
df.write.parquet("hdfs:///path/to/output.parquet")
df.write.saveAsTable("my_table")

Spark Streaming

Basic Streaming (PySpark)

# Import libraries
from pyspark.streaming import StreamingContext

# Create streaming context
ssc = StreamingContext(sc, batchDuration=10)  # 10-second batches

# Create DStream from socket
lines = ssc.socketTextStream("localhost", 9999)

# Create DStream from files
lines = ssc.textFileStream("hdfs:///path/to/streaming/dir")

# Create DStream from Kafka
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createStream(ssc, "zk_host:2181", "consumer_group", {"topic": 1})

# Process streams
word_counts = lines.flatMap(lambda line: line.split(" ")) \
                   .map(lambda word: (word, 1)) \
                   .reduceByKey(lambda a, b: a + b)

# Output operations
word_counts.pprint()  # Print to console
word_counts.saveAsTextFiles("hdfs:///path/to/output")

# Start and stop streaming
ssc.start()
ssc.awaitTermination()
# or ssc.stop(stopSparkContext=False, stopGraceFully=True)

Structured Streaming (PySpark)

# Create streaming DataFrame from socket
lines = spark.readStream.format("socket") \
                        .option("host", "localhost") \
                        .option("port", 9999) \
                        .load()

# Process streaming data
words = lines.select(explode(split(lines.value, " ")).alias("word"))
wordCounts = words.groupBy("word").count()

# Output sink
query = wordCounts.writeStream \
                  .outputMode("complete") \
                  .format("console") \
                  .start()

# Kafka as source
df = spark.readStream \
         .format("kafka") \
         .option("kafka.bootstrap.servers", "host:port") \
         .option("subscribe", "topic") \
         .load()

# File sink with checkpointing
query = df.writeStream \
          .format("parquet") \
          .option("path", "hdfs:///path/to/output") \
          .option("checkpointLocation", "hdfs:///path/to/checkpoint") \
          .start()

query.awaitTermination()

Spark MLlib (Machine Learning)

ML Pipeline Operations

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Prepare data
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "categoryVec"], 
    outputCol="features"
)
lr = LogisticRegression(maxIter=10, regParam=0.01)

# Create and run pipeline
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
model = pipeline.fit(training_data)
predictions = model.transform(test_data)

# Evaluate model
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(predictions)

Common ML Algorithms

# Linear Regression
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(training_data)

# Random Forest
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(numTrees=100)
rf_model = rf.fit(training_data)

# K-Means Clustering
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=5, seed=1)
model = kmeans.fit(data)

Comparison: Hadoop vs. Spark

FeatureHadoop MapReduceApache Spark
Processing ModelBatch processingBatch, interactive, streaming, ML
SpeedDisk-based (slower)In-memory (up to 100x faster)
Ease of UseComplex Java codeSimple APIs in Java, Scala, Python, R
Data ProcessingRead → Process → Write for each jobCache intermediate data in memory
Fault ToleranceReplication for HDFS, rerunning tasksRDD lineage and checkpointing
SchedulingYARN, internal schedulerYARN, Mesos, Kubernetes, Standalone
Memory UsageLow memory requirementsHigher memory requirements
Use CasesVery large batch jobs, budget constraintsInteractive queries, iterative algorithms, streaming
LibrariesLimited built-in librariesRich ecosystem (SQL, MLlib, GraphX, Streaming)

Performance Tuning Best Practices

Hadoop Tuning

Memory Configuration:

  • Set mapred.child.java.opts to allocate memory to mappers/reducers
  • Tune yarn.nodemanager.resource.memory-mb for YARN memory allocation

I/O Optimization:

  • Use Sequence/Parquet files instead of text files
  • Configure appropriate block sizes (dfs.blocksize)
  • Enable compression (mapred.output.compress)

Job Configuration:

  • Set appropriate mapred.map.tasks and mapred.reduce.tasks
  • Use combiner when possible to reduce data shuffling
  • Reuse JVM with mapred.job.reuse.jvm.num.tasks

Spark Tuning

Memory Management:

  • Configure spark.executor.memory appropriately
  • Set spark.memory.fraction (default 0.6) for execution vs. storage
  • Use spark.memory.storageFraction to control storage vs. execution

Parallelism:

  • Set spark.default.parallelism to 2-3x CPU cores available
  • Repartition skewed data with repartition() or coalesce()
  • Use partitionBy() for key-based operations

Data Serialization:

  • Use Kryo serialization with spark.serializer
  • Register custom classes with KryoSerializer
  • Consider spark.kryoserializer.buffer.max

Data Formats:

  • Use Parquet or ORC instead of CSV or JSON
  • Partition data appropriately
  • Consider bucketing for join operations

Shuffle Optimization:

  • Avoid groupByKey, prefer reduceByKey or aggregateByKey
  • Use broadcast joins for small tables with broadcast()
  • Monitor shuffle spill with spark.shuffle.spill metrics

Common Hadoop & Spark Pitfalls

Hadoop Pitfalls

Small Files Problem: HDFS not optimized for small files • NameNode Memory: Excessive files exhaust NameNode memory • Improper Partitioning: Skewed data causes job imbalance • Speculative Execution: Can cause duplicate processing • Resource Allocation: Improper configuration wastes resources

Spark Pitfalls

Out of Memory Errors: Excessive data in collect() or broadcast() • Inefficient Joins: Not using broadcast for small tables • Skewed Data: Uneven partitioning causes straggler tasks • Unnecessary Shuffling: Using groupByKey instead of reduceByKey • Poor Caching Strategy: Caching too much or wrong data • Debug Difficulties: Errors in transformations only appear at action time

Resources for Further Learning

Documentation & Official Resources

Apache Hadoop DocumentationApache Spark DocumentationCloudera DocumentationHortonworks Documentation

Books

• “Hadoop: The Definitive Guide” by Tom White • “Learning Spark” by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia • “Spark: The Definitive Guide” by Bill Chambers and Matei Zaharia • “Hadoop Operations” by Eric Sammer

Online Courses

• Cloudera Hadoop Training • Databricks Spark Training • Coursera “Big Data Analysis with Scala and Spark” • Udemy “Apache Spark with Scala – Hands On with Big Data”

Communities & Forums

• Stack Overflow – Hadoop and Spark tags • Apache Mailing Lists • Hadoop and Spark User Groups • GitHub repositories and examples

Scroll to Top