Ultimate Hadoop & Spark Big Data Cheatsheet: Commands, Concepts & Best Practices

Introduction to Big Data Fundamentals

Big Data refers to datasets that are too large or complex for traditional data processing software to handle efficiently. The concept revolves around the “5 Vs”: Volume (size), Velocity (speed of generation), Variety (different formats), Veracity (uncertainty/quality), and Value (extracting insights). Hadoop and Spark are two of the most important frameworks in the Big Data ecosystem, with Hadoop providing distributed storage and batch processing, while Spark offers lightning-fast in-memory processing across diverse workloads. Understanding these technologies matters because they enable organizations to analyze massive datasets for valuable insights, make data-driven decisions, and implement machine learning at scale.

Hadoop Ecosystem Components

Component	Primary Function	Key Characteristics
HDFS	Distributed file system	Fault-tolerant, high throughput, optimized for large files
YARN	Resource management	Manages cluster resources and schedules jobs
MapReduce	Processing framework	Batch processing model with Map and Reduce phases
Hive	SQL-like interface	Data warehouse infrastructure for SQL queries on Hadoop
HBase	NoSQL database	Column-oriented datastore for random, real-time access
Pig	Data flow scripting	High-level language for expressing data analysis programs
ZooKeeper	Coordination service	Synchronization, configuration, naming registry
Oozie	Workflow scheduler	Manages Hadoop jobs with directed acyclic graphs (DAGs)
Sqoop	Data transfer tool	Imports/exports data between RDBMS and Hadoop
Flume	Log collection	Collects, aggregates, and moves streaming data
Ambari	Management platform	Provisions, manages, and monitors Hadoop clusters

HDFS Commands & Operations

Basic File Operations

# List files
hdfs dfs -ls /path

# Create directory
hdfs dfs -mkdir /path/directory

# Copy from local to HDFS
hdfs dfs -put localfile /path/in/hdfs

# Copy from HDFS to local
hdfs dfs -get /path/in/hdfs localfile

# View file content
hdfs dfs -cat /path/to/file

# Delete file or directory
hdfs dfs -rm /path/to/file
hdfs dfs -rm -r /path/to/directory  # Recursive

# Copy within HDFS
hdfs dfs -cp /source/path /destination/path

# Move within HDFS
hdfs dfs -mv /source/path /destination/path

# Check file size
hdfs dfs -du -h /path/to/file

# Set replication factor
hdfs dfs -setrep -w 3 /path/to/file

Administrative Commands

# Check HDFS status
hdfs dfsadmin -report

# Enter safe mode
hdfs dfsadmin -safemode enter

# Leave safe mode
hdfs dfsadmin -safemode leave

# Check file system health
hdfs fsck /path

# Balance data across datanodes
hdfs balancer

# Refresh namenode configuration
hdfs dfsadmin -refreshNodes

# Set HDFS quota
hdfs dfsadmin -setQuota <quota> <directory>
hdfs dfsadmin -setSpaceQuota <quota> <directory>

MapReduce Concepts & Patterns

MapReduce Process Flow

Input: Data split into chunks
Map Phase: Transform data into key-value pairs
Shuffle & Sort: Group data by keys
Reduce Phase: Aggregate values by key
Output: Final results written to HDFS

Common MapReduce Patterns

• Filtering: Select records matching criteria • Counting: Count occurrences of items • Sorting: Order data by key • Joining: Combine datasets on common keys • Graph Processing: Process graph data structures • Inverted Index: Build search indexes

Basic Word Count Example (Java)

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
        
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
        
    public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
}

Apache Spark Architecture

Core Components

• Driver Program: Contains application’s main function, creates SparkContext • SparkContext: Coordinates and communicates with executors • Cluster Manager: Acquires resources (YARN, Mesos, Kubernetes, Standalone) • Executors: JVM processes running tasks on worker nodes • Tasks: Individual units of computation • RDDs: Resilient Distributed Datasets (fundamental data abstraction) • DAG Scheduler: Creates optimized execution plans

Key Spark Concepts

• Transformations: Lazy operations that create new RDDs (map, filter, etc.) • Actions: Operations that trigger computation and return results (count, collect) • Lazy Evaluation: Computations are only executed when an action is called • Persistence: Caching data in memory or disk for reuse • Partitioning: How data is distributed across the cluster • Shuffling: Redistributing data across partitions • Broadcast Variables: Read-only variables cached on each machine • Accumulators: Variables that can only be added to

Spark Core Operations (PySpark)

RDD Creation and Basic Operations

# Create RDD from collection
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Create RDD from file
rdd = sc.textFile("hdfs:///path/to/file.txt")

# Map transformation
mapped_rdd = rdd.map(lambda x: x * 2)

# Filter transformation
filtered_rdd = rdd.filter(lambda x: x > 2)

# FlatMap transformation
text_rdd = sc.textFile("hdfs:///path/to/file.txt")
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))

# Set operations
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])
union_rdd = rdd1.union(rdd2)      # [1, 2, 3, 3, 4, 5]
intersection_rdd = rdd1.intersection(rdd2)  # [3]

RDD Actions

# Count elements
count = rdd.count()

# Collect results to driver
result = rdd.collect()

# Take first n elements
first_elements = rdd.take(5)

# Count by key
key_counts = paired_rdd.countByKey()

# Reduce operation
sum_of_elements = rdd.reduce(lambda a, b: a + b)

# Save RDD to file
rdd.saveAsTextFile("hdfs:///path/to/output")

Key-Value Pair Operations

# Create pair RDD
pair_rdd = rdd.map(lambda x: (x, 1))

# ReduceByKey
word_counts = pair_rdd.reduceByKey(lambda a, b: a + b)

# GroupByKey
grouped = pair_rdd.groupByKey()

# Join operations
joined_rdd = rdd1.join(rdd2)
left_joined = rdd1.leftOuterJoin(rdd2)
right_joined = rdd1.rightOuterJoin(rdd2)

# Aggregate by key
result = pair_rdd.aggregateByKey(
    0,                          # Zero value
    lambda acc, val: acc + val, # Combine values within partitions
    lambda acc1, acc2: acc1 + acc2  # Combine values across partitions
)

Spark SQL & DataFrames (PySpark)

DataFrame Creation

# Create DataFrame from RDD
from pyspark.sql import Row
row_rdd = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
df = spark.createDataFrame(row_rdd)

# Create DataFrame from data source
df = spark.read.csv("hdfs:///path/to/file.csv", header=True, inferSchema=True)
df = spark.read.json("hdfs:///path/to/file.json")
df = spark.read.parquet("hdfs:///path/to/file.parquet")

# Create DataFrame from Hive table
df = spark.sql("SELECT * FROM employees")

# Create temporary view
df.createOrReplaceTempView("people")
result = spark.sql("SELECT * FROM people WHERE age > 30")

DataFrame Operations

# Select columns
result = df.select("name", "age")

# Filter data
filtered = df.filter(df.age > 30)
filtered = df.where("age > 30")

# Add column
with_col = df.withColumn("doubled_age", df.age * 2)

# Rename column
renamed = df.withColumnRenamed("age", "years")

# Drop column
dropped = df.drop("age")

# Distinct values
unique = df.select("department").distinct()

# Group by and aggregate
grouped = df.groupBy("department").agg({"salary": "avg", "age": "max"})

# Sort data
sorted_df = df.orderBy("age")  # Ascending
sorted_df = df.orderBy(df.age.desc())  # Descending

# Join DataFrames
joined = df1.join(df2, df1.id == df2.id, "inner")

DataFrame Actions

# Show data
df.show()
df.show(10, False)  # 10 rows, don't truncate strings

# Get number of rows
count = df.count()

# Get DataFrame schema
df.printSchema()

# Describe statistics
df.describe().show()

# Collect to Python list
rows = df.collect()

# Convert to Pandas DataFrame
pandas_df = df.toPandas()

# Write to file
df.write.csv("hdfs:///path/to/output.csv")
df.write.parquet("hdfs:///path/to/output.parquet")
df.write.saveAsTable("my_table")

Spark Streaming

Basic Streaming (PySpark)

# Import libraries
from pyspark.streaming import StreamingContext

# Create streaming context
ssc = StreamingContext(sc, batchDuration=10)  # 10-second batches

# Create DStream from socket
lines = ssc.socketTextStream("localhost", 9999)

# Create DStream from files
lines = ssc.textFileStream("hdfs:///path/to/streaming/dir")

# Create DStream from Kafka
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createStream(ssc, "zk_host:2181", "consumer_group", {"topic": 1})

# Process streams
word_counts = lines.flatMap(lambda line: line.split(" ")) \
                   .map(lambda word: (word, 1)) \
                   .reduceByKey(lambda a, b: a + b)

# Output operations
word_counts.pprint()  # Print to console
word_counts.saveAsTextFiles("hdfs:///path/to/output")

# Start and stop streaming
ssc.start()
ssc.awaitTermination()
# or ssc.stop(stopSparkContext=False, stopGraceFully=True)

Structured Streaming (PySpark)

# Create streaming DataFrame from socket
lines = spark.readStream.format("socket") \
                        .option("host", "localhost") \
                        .option("port", 9999) \
                        .load()

# Process streaming data
words = lines.select(explode(split(lines.value, " ")).alias("word"))
wordCounts = words.groupBy("word").count()

# Output sink
query = wordCounts.writeStream \
                  .outputMode("complete") \
                  .format("console") \
                  .start()

# Kafka as source
df = spark.readStream \
         .format("kafka") \
         .option("kafka.bootstrap.servers", "host:port") \
         .option("subscribe", "topic") \
         .load()

# File sink with checkpointing
query = df.writeStream \
          .format("parquet") \
          .option("path", "hdfs:///path/to/output") \
          .option("checkpointLocation", "hdfs:///path/to/checkpoint") \
          .start()

query.awaitTermination()

Spark MLlib (Machine Learning)

ML Pipeline Operations

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Prepare data
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "categoryVec"], 
    outputCol="features"
)
lr = LogisticRegression(maxIter=10, regParam=0.01)

# Create and run pipeline
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
model = pipeline.fit(training_data)
predictions = model.transform(test_data)

# Evaluate model
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(predictions)

Common ML Algorithms

# Linear Regression
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(training_data)

# Random Forest
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(numTrees=100)
rf_model = rf.fit(training_data)

# K-Means Clustering
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=5, seed=1)
model = kmeans.fit(data)

Comparison: Hadoop vs. Spark

Feature	Hadoop MapReduce	Apache Spark
Processing Model	Batch processing	Batch, interactive, streaming, ML
Speed	Disk-based (slower)	In-memory (up to 100x faster)
Ease of Use	Complex Java code	Simple APIs in Java, Scala, Python, R
Data Processing	Read → Process → Write for each job	Cache intermediate data in memory
Fault Tolerance	Replication for HDFS, rerunning tasks	RDD lineage and checkpointing
Scheduling	YARN, internal scheduler	YARN, Mesos, Kubernetes, Standalone
Memory Usage	Low memory requirements	Higher memory requirements
Use Cases	Very large batch jobs, budget constraints	Interactive queries, iterative algorithms, streaming
Libraries	Limited built-in libraries	Rich ecosystem (SQL, MLlib, GraphX, Streaming)

Performance Tuning Best Practices

Hadoop Tuning

• Memory Configuration:

Set mapred.child.java.opts to allocate memory to mappers/reducers
Tune yarn.nodemanager.resource.memory-mb for YARN memory allocation

• I/O Optimization:

Use Sequence/Parquet files instead of text files
Configure appropriate block sizes (dfs.blocksize)
Enable compression (mapred.output.compress)

• Job Configuration:

Set appropriate mapred.map.tasks and mapred.reduce.tasks
Use combiner when possible to reduce data shuffling
Reuse JVM with mapred.job.reuse.jvm.num.tasks

Spark Tuning

• Memory Management:

Configure spark.executor.memory appropriately
Set spark.memory.fraction (default 0.6) for execution vs. storage
Use spark.memory.storageFraction to control storage vs. execution

• Parallelism:

Set spark.default.parallelism to 2-3x CPU cores available
Repartition skewed data with repartition() or coalesce()
Use partitionBy() for key-based operations

• Data Serialization:

Use Kryo serialization with spark.serializer
Register custom classes with KryoSerializer
Consider spark.kryoserializer.buffer.max

• Data Formats:

Use Parquet or ORC instead of CSV or JSON
Partition data appropriately
Consider bucketing for join operations

• Shuffle Optimization:

Avoid groupByKey, prefer reduceByKey or aggregateByKey
Use broadcast joins for small tables with broadcast()
Monitor shuffle spill with spark.shuffle.spill metrics

Common Hadoop & Spark Pitfalls

Hadoop Pitfalls

• Small Files Problem: HDFS not optimized for small files • NameNode Memory: Excessive files exhaust NameNode memory • Improper Partitioning: Skewed data causes job imbalance • Speculative Execution: Can cause duplicate processing • Resource Allocation: Improper configuration wastes resources

Spark Pitfalls

• Out of Memory Errors: Excessive data in collect() or broadcast() • Inefficient Joins: Not using broadcast for small tables • Skewed Data: Uneven partitioning causes straggler tasks • Unnecessary Shuffling: Using groupByKey instead of reduceByKey • Poor Caching Strategy: Caching too much or wrong data • Debug Difficulties: Errors in transformations only appear at action time

Resources for Further Learning

Documentation & Official Resources

• Apache Hadoop Documentation • Apache Spark Documentation • Cloudera Documentation • Hortonworks Documentation

Books

• “Hadoop: The Definitive Guide” by Tom White • “Learning Spark” by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia • “Spark: The Definitive Guide” by Bill Chambers and Matei Zaharia • “Hadoop Operations” by Eric Sammer

Online Courses

• Cloudera Hadoop Training • Databricks Spark Training • Coursera “Big Data Analysis with Scala and Spark” • Udemy “Apache Spark with Scala – Hands On with Big Data”

Communities & Forums

• Stack Overflow – Hadoop and Spark tags • Apache Mailing Lists • Hadoop and Spark User Groups • GitHub repositories and examples