Introduction to Big Data Fundamentals
Big Data refers to datasets that are too large or complex for traditional data processing software to handle efficiently. The concept revolves around the “5 Vs”: Volume (size), Velocity (speed of generation), Variety (different formats), Veracity (uncertainty/quality), and Value (extracting insights). Hadoop and Spark are two of the most important frameworks in the Big Data ecosystem, with Hadoop providing distributed storage and batch processing, while Spark offers lightning-fast in-memory processing across diverse workloads. Understanding these technologies matters because they enable organizations to analyze massive datasets for valuable insights, make data-driven decisions, and implement machine learning at scale.
Hadoop Ecosystem Components
Component | Primary Function | Key Characteristics |
---|---|---|
HDFS | Distributed file system | Fault-tolerant, high throughput, optimized for large files |
YARN | Resource management | Manages cluster resources and schedules jobs |
MapReduce | Processing framework | Batch processing model with Map and Reduce phases |
Hive | SQL-like interface | Data warehouse infrastructure for SQL queries on Hadoop |
HBase | NoSQL database | Column-oriented datastore for random, real-time access |
Pig | Data flow scripting | High-level language for expressing data analysis programs |
ZooKeeper | Coordination service | Synchronization, configuration, naming registry |
Oozie | Workflow scheduler | Manages Hadoop jobs with directed acyclic graphs (DAGs) |
Sqoop | Data transfer tool | Imports/exports data between RDBMS and Hadoop |
Flume | Log collection | Collects, aggregates, and moves streaming data |
Ambari | Management platform | Provisions, manages, and monitors Hadoop clusters |
HDFS Commands & Operations
Basic File Operations
# List files
hdfs dfs -ls /path
# Create directory
hdfs dfs -mkdir /path/directory
# Copy from local to HDFS
hdfs dfs -put localfile /path/in/hdfs
# Copy from HDFS to local
hdfs dfs -get /path/in/hdfs localfile
# View file content
hdfs dfs -cat /path/to/file
# Delete file or directory
hdfs dfs -rm /path/to/file
hdfs dfs -rm -r /path/to/directory # Recursive
# Copy within HDFS
hdfs dfs -cp /source/path /destination/path
# Move within HDFS
hdfs dfs -mv /source/path /destination/path
# Check file size
hdfs dfs -du -h /path/to/file
# Set replication factor
hdfs dfs -setrep -w 3 /path/to/file
Administrative Commands
# Check HDFS status
hdfs dfsadmin -report
# Enter safe mode
hdfs dfsadmin -safemode enter
# Leave safe mode
hdfs dfsadmin -safemode leave
# Check file system health
hdfs fsck /path
# Balance data across datanodes
hdfs balancer
# Refresh namenode configuration
hdfs dfsadmin -refreshNodes
# Set HDFS quota
hdfs dfsadmin -setQuota <quota> <directory>
hdfs dfsadmin -setSpaceQuota <quota> <directory>
MapReduce Concepts & Patterns
MapReduce Process Flow
- Input: Data split into chunks
- Map Phase: Transform data into key-value pairs
- Shuffle & Sort: Group data by keys
- Reduce Phase: Aggregate values by key
- Output: Final results written to HDFS
Common MapReduce Patterns
• Filtering: Select records matching criteria • Counting: Count occurrences of items • Sorting: Order data by key • Joining: Combine datasets on common keys • Graph Processing: Process graph data structures • Inverted Index: Build search indexes
Basic Word Count Example (Java)
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
Apache Spark Architecture
Core Components
• Driver Program: Contains application’s main function, creates SparkContext • SparkContext: Coordinates and communicates with executors • Cluster Manager: Acquires resources (YARN, Mesos, Kubernetes, Standalone) • Executors: JVM processes running tasks on worker nodes • Tasks: Individual units of computation • RDDs: Resilient Distributed Datasets (fundamental data abstraction) • DAG Scheduler: Creates optimized execution plans
Key Spark Concepts
• Transformations: Lazy operations that create new RDDs (map, filter, etc.) • Actions: Operations that trigger computation and return results (count, collect) • Lazy Evaluation: Computations are only executed when an action is called • Persistence: Caching data in memory or disk for reuse • Partitioning: How data is distributed across the cluster • Shuffling: Redistributing data across partitions • Broadcast Variables: Read-only variables cached on each machine • Accumulators: Variables that can only be added to
Spark Core Operations (PySpark)
RDD Creation and Basic Operations
# Create RDD from collection
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Create RDD from file
rdd = sc.textFile("hdfs:///path/to/file.txt")
# Map transformation
mapped_rdd = rdd.map(lambda x: x * 2)
# Filter transformation
filtered_rdd = rdd.filter(lambda x: x > 2)
# FlatMap transformation
text_rdd = sc.textFile("hdfs:///path/to/file.txt")
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))
# Set operations
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([3, 4, 5])
union_rdd = rdd1.union(rdd2) # [1, 2, 3, 3, 4, 5]
intersection_rdd = rdd1.intersection(rdd2) # [3]
RDD Actions
# Count elements
count = rdd.count()
# Collect results to driver
result = rdd.collect()
# Take first n elements
first_elements = rdd.take(5)
# Count by key
key_counts = paired_rdd.countByKey()
# Reduce operation
sum_of_elements = rdd.reduce(lambda a, b: a + b)
# Save RDD to file
rdd.saveAsTextFile("hdfs:///path/to/output")
Key-Value Pair Operations
# Create pair RDD
pair_rdd = rdd.map(lambda x: (x, 1))
# ReduceByKey
word_counts = pair_rdd.reduceByKey(lambda a, b: a + b)
# GroupByKey
grouped = pair_rdd.groupByKey()
# Join operations
joined_rdd = rdd1.join(rdd2)
left_joined = rdd1.leftOuterJoin(rdd2)
right_joined = rdd1.rightOuterJoin(rdd2)
# Aggregate by key
result = pair_rdd.aggregateByKey(
0, # Zero value
lambda acc, val: acc + val, # Combine values within partitions
lambda acc1, acc2: acc1 + acc2 # Combine values across partitions
)
Spark SQL & DataFrames (PySpark)
DataFrame Creation
# Create DataFrame from RDD
from pyspark.sql import Row
row_rdd = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
df = spark.createDataFrame(row_rdd)
# Create DataFrame from data source
df = spark.read.csv("hdfs:///path/to/file.csv", header=True, inferSchema=True)
df = spark.read.json("hdfs:///path/to/file.json")
df = spark.read.parquet("hdfs:///path/to/file.parquet")
# Create DataFrame from Hive table
df = spark.sql("SELECT * FROM employees")
# Create temporary view
df.createOrReplaceTempView("people")
result = spark.sql("SELECT * FROM people WHERE age > 30")
DataFrame Operations
# Select columns
result = df.select("name", "age")
# Filter data
filtered = df.filter(df.age > 30)
filtered = df.where("age > 30")
# Add column
with_col = df.withColumn("doubled_age", df.age * 2)
# Rename column
renamed = df.withColumnRenamed("age", "years")
# Drop column
dropped = df.drop("age")
# Distinct values
unique = df.select("department").distinct()
# Group by and aggregate
grouped = df.groupBy("department").agg({"salary": "avg", "age": "max"})
# Sort data
sorted_df = df.orderBy("age") # Ascending
sorted_df = df.orderBy(df.age.desc()) # Descending
# Join DataFrames
joined = df1.join(df2, df1.id == df2.id, "inner")
DataFrame Actions
# Show data
df.show()
df.show(10, False) # 10 rows, don't truncate strings
# Get number of rows
count = df.count()
# Get DataFrame schema
df.printSchema()
# Describe statistics
df.describe().show()
# Collect to Python list
rows = df.collect()
# Convert to Pandas DataFrame
pandas_df = df.toPandas()
# Write to file
df.write.csv("hdfs:///path/to/output.csv")
df.write.parquet("hdfs:///path/to/output.parquet")
df.write.saveAsTable("my_table")
Spark Streaming
Basic Streaming (PySpark)
# Import libraries
from pyspark.streaming import StreamingContext
# Create streaming context
ssc = StreamingContext(sc, batchDuration=10) # 10-second batches
# Create DStream from socket
lines = ssc.socketTextStream("localhost", 9999)
# Create DStream from files
lines = ssc.textFileStream("hdfs:///path/to/streaming/dir")
# Create DStream from Kafka
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createStream(ssc, "zk_host:2181", "consumer_group", {"topic": 1})
# Process streams
word_counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
# Output operations
word_counts.pprint() # Print to console
word_counts.saveAsTextFiles("hdfs:///path/to/output")
# Start and stop streaming
ssc.start()
ssc.awaitTermination()
# or ssc.stop(stopSparkContext=False, stopGraceFully=True)
Structured Streaming (PySpark)
# Create streaming DataFrame from socket
lines = spark.readStream.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Process streaming data
words = lines.select(explode(split(lines.value, " ")).alias("word"))
wordCounts = words.groupBy("word").count()
# Output sink
query = wordCounts.writeStream \
.outputMode("complete") \
.format("console") \
.start()
# Kafka as source
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host:port") \
.option("subscribe", "topic") \
.load()
# File sink with checkpointing
query = df.writeStream \
.format("parquet") \
.option("path", "hdfs:///path/to/output") \
.option("checkpointLocation", "hdfs:///path/to/checkpoint") \
.start()
query.awaitTermination()
Spark MLlib (Machine Learning)
ML Pipeline Operations
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Prepare data
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "categoryVec"],
outputCol="features"
)
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Create and run pipeline
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
model = pipeline.fit(training_data)
predictions = model.transform(test_data)
# Evaluate model
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(predictions)
Common ML Algorithms
# Linear Regression
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(training_data)
# Random Forest
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(numTrees=100)
rf_model = rf.fit(training_data)
# K-Means Clustering
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=5, seed=1)
model = kmeans.fit(data)
Comparison: Hadoop vs. Spark
Feature | Hadoop MapReduce | Apache Spark |
---|---|---|
Processing Model | Batch processing | Batch, interactive, streaming, ML |
Speed | Disk-based (slower) | In-memory (up to 100x faster) |
Ease of Use | Complex Java code | Simple APIs in Java, Scala, Python, R |
Data Processing | Read → Process → Write for each job | Cache intermediate data in memory |
Fault Tolerance | Replication for HDFS, rerunning tasks | RDD lineage and checkpointing |
Scheduling | YARN, internal scheduler | YARN, Mesos, Kubernetes, Standalone |
Memory Usage | Low memory requirements | Higher memory requirements |
Use Cases | Very large batch jobs, budget constraints | Interactive queries, iterative algorithms, streaming |
Libraries | Limited built-in libraries | Rich ecosystem (SQL, MLlib, GraphX, Streaming) |
Performance Tuning Best Practices
Hadoop Tuning
• Memory Configuration:
- Set mapred.child.java.opts to allocate memory to mappers/reducers
- Tune yarn.nodemanager.resource.memory-mb for YARN memory allocation
• I/O Optimization:
- Use Sequence/Parquet files instead of text files
- Configure appropriate block sizes (dfs.blocksize)
- Enable compression (mapred.output.compress)
• Job Configuration:
- Set appropriate mapred.map.tasks and mapred.reduce.tasks
- Use combiner when possible to reduce data shuffling
- Reuse JVM with mapred.job.reuse.jvm.num.tasks
Spark Tuning
• Memory Management:
- Configure spark.executor.memory appropriately
- Set spark.memory.fraction (default 0.6) for execution vs. storage
- Use spark.memory.storageFraction to control storage vs. execution
• Parallelism:
- Set spark.default.parallelism to 2-3x CPU cores available
- Repartition skewed data with repartition() or coalesce()
- Use partitionBy() for key-based operations
• Data Serialization:
- Use Kryo serialization with spark.serializer
- Register custom classes with KryoSerializer
- Consider spark.kryoserializer.buffer.max
• Data Formats:
- Use Parquet or ORC instead of CSV or JSON
- Partition data appropriately
- Consider bucketing for join operations
• Shuffle Optimization:
- Avoid groupByKey, prefer reduceByKey or aggregateByKey
- Use broadcast joins for small tables with broadcast()
- Monitor shuffle spill with spark.shuffle.spill metrics
Common Hadoop & Spark Pitfalls
Hadoop Pitfalls
• Small Files Problem: HDFS not optimized for small files • NameNode Memory: Excessive files exhaust NameNode memory • Improper Partitioning: Skewed data causes job imbalance • Speculative Execution: Can cause duplicate processing • Resource Allocation: Improper configuration wastes resources
Spark Pitfalls
• Out of Memory Errors: Excessive data in collect() or broadcast() • Inefficient Joins: Not using broadcast for small tables • Skewed Data: Uneven partitioning causes straggler tasks • Unnecessary Shuffling: Using groupByKey instead of reduceByKey • Poor Caching Strategy: Caching too much or wrong data • Debug Difficulties: Errors in transformations only appear at action time
Resources for Further Learning
Documentation & Official Resources
• Apache Hadoop Documentation • Apache Spark Documentation • Cloudera Documentation • Hortonworks Documentation
Books
• “Hadoop: The Definitive Guide” by Tom White • “Learning Spark” by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia • “Spark: The Definitive Guide” by Bill Chambers and Matei Zaharia • “Hadoop Operations” by Eric Sammer
Online Courses
• Cloudera Hadoop Training • Databricks Spark Training • Coursera “Big Data Analysis with Scala and Spark” • Udemy “Apache Spark with Scala – Hands On with Big Data”
Communities & Forums
• Stack Overflow – Hadoop and Spark tags • Apache Mailing Lists • Hadoop and Spark User Groups • GitHub repositories and examples