The Ultimate Apache Cassandra Cheatsheet: Master NoSQL Database Management

Introduction to Apache Cassandra

Apache Cassandra is a free, open-source, distributed NoSQL database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and superior performance for write-heavy workloads. Originally developed at Facebook to power their Inbox Search feature, Cassandra was designed to handle massive datasets across distributed systems with exceptional fault tolerance and tunable consistency.

Core Cassandra Concepts

Data Model Architecture

Concept	Description
Keyspace	Container for tables, similar to a schema in relational databases
Table	Collection of rows and columns, similar to tables in relational databases
Partition Key	First part of primary key that determines data distribution across nodes
Clustering Key	Optional second part of primary key that determines sort order within a partition
Column	Name-value pair with a defined data type
Row	Collection of columns identified by a primary key

Cassandra’s CAP Characteristics

Consistency: Tunable consistency levels (ONE, QUORUM, ALL, etc.)
Availability: No single point of failure, continuous availability
Partition Tolerance: Designed to operate across distributed nodes

Cassandra vs. Traditional RDBMS

Feature	Cassandra	Traditional RDBMS
Data Model	Column-family, schema-flexible	Row-based, rigid schema
Scaling	Horizontal (add nodes)	Vertical (bigger servers)
Transactions	Limited (lightweight transactions)	ACID compliant
Joins	Not supported natively	Fully supported
Architecture	Masterless, peer-to-peer	Master-slave
Best For	Write-heavy workloads, time-series data	Complex queries, transactional data
Consistency	Tunable (eventual to strong)	Strong consistency

Cassandra Query Language (CQL)

CQL Data Types

Category	Types
Numeric	int, bigint, smallint, tinyint, float, double, decimal, varint
Text	text, varchar, ascii
Time/Date	timestamp, date, time, duration
Identifiers	uuid, timeuuid
Collections	list, set, map, tuple
Others	boolean, blob, inet, counter

Basic CQL Commands

Database Operations

-- Create a keyspace
CREATE KEYSPACE my_keyspace 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

-- Use a keyspace
USE my_keyspace;

-- Drop a keyspace
DROP KEYSPACE my_keyspace;

Table Operations

-- Create a table
CREATE TABLE users (
    user_id uuid PRIMARY KEY,
    first_name text,
    last_name text,
    email text,
    created_at timestamp
);

-- Alter table (add column)
ALTER TABLE users ADD age int;

-- Drop table
DROP TABLE users;

-- Truncate table (remove all data)
TRUNCATE users;

Data Manipulation

-- Insert data
INSERT INTO users (user_id, first_name, last_name, email, created_at)
VALUES (uuid(), 'John', 'Doe', 'john@example.com', toTimestamp(now()));

-- Update data
UPDATE users 
SET email = 'newemail@example.com' 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Delete data
DELETE FROM users 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Select data
SELECT * FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

Secondary Indexes

-- Create an index
CREATE INDEX ON users (email);

-- Drop an index
DROP INDEX users_email_idx;

Collections

-- Table with collections
CREATE TABLE user_preferences (
    user_id uuid PRIMARY KEY,
    favorite_colors set<text>,
    address_history list<text>,
    phone_numbers map<text, text>
);

-- Insert into collections
INSERT INTO user_preferences (user_id, favorite_colors, address_history, phone_numbers)
VALUES (
    uuid(), 
    {'blue', 'green', 'red'},
    ['123 Main St', '456 Oak Ave'],
    {'home': '555-1234', 'work': '555-5678'}
);

-- Update collections
UPDATE user_preferences 
SET favorite_colors = favorite_colors + {'yellow'}
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Remove from collections
UPDATE user_preferences 
SET phone_numbers = phone_numbers - {'work'}
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

Cassandra Data Modeling

Key Principles

Model around queries, not entities
Denormalize for performance
Partition based on access patterns
Avoid hotspots by distributing workload
Minimize partitions read per query

Primary Key Design Patterns

Pattern	Structure	Use Case
Simple Primary Key	`PRIMARY KEY (id)`	Single record lookup
Compound Key	`PRIMARY KEY ((partition_key), clustering_column)`	Sorted data within partition
Composite Partition Key	`PRIMARY KEY ((key1, key2), clustering_column)`	Distributing data evenly
Time Series	`PRIMARY KEY ((entity_id), timestamp)`	Time-ordered events per entity
Bucketing	`PRIMARY KEY ((entity_id, bucket), timestamp)`	Managing wide partitions

Common Data Modeling Techniques

One-to-Many Relationship

CREATE TABLE posts (
    user_id uuid,
    post_id timeuuid,
    content text,
    created_at timestamp,
    PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);

Many-to-Many Relationship

-- User to groups
CREATE TABLE user_groups (
    user_id uuid,
    group_id uuid,
    joined_at timestamp,
    PRIMARY KEY (user_id, group_id)
);

-- Group to users (duplication for query efficiency)
CREATE TABLE group_users (
    group_id uuid,
    user_id uuid,
    joined_at timestamp,
    PRIMARY KEY (group_id, user_id)
);

Time Series Data

CREATE TABLE temperature_by_sensor (
    sensor_id uuid,
    day date,
    timestamp timestamp,
    temperature float,
    PRIMARY KEY ((sensor_id, day), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Consistency Levels

Read Consistency Levels

Level	Description
ONE	Return data from nearest replica
QUORUM	Return data when majority of replicas respond
LOCAL_QUORUM	Quorum of replicas in same datacenter
EACH_QUORUM	Quorum of replicas in each datacenter
ALL	Return data when all replicas respond
LOCAL_ONE	Return data from nearest replica in local datacenter

Write Consistency Levels

Level	Description
ANY	Write to any node (can be hinted handoff coordinator)
ONE	Write confirmed by at least one replica
QUORUM	Write confirmed by majority of replicas
LOCAL_QUORUM	Quorum of replicas in same datacenter
EACH_QUORUM	Quorum of replicas in each datacenter
ALL	Write confirmed by all replicas
LOCAL_ONE	Write confirmed by at least one replica in local datacenter

Setting Consistency Levels

-- Set read consistency for session
CONSISTENCY QUORUM;

-- Per-query consistency
SELECT * FROM users WHERE user_id = 123 USING CONSISTENCY QUORUM;

Cassandra Architecture

Key Components

Component	Description
Node	Single Cassandra instance
Cluster	Collection of nodes that store your data
Data Center	Group of related nodes (often a physical location)
Rack	Collection of servers (often on same switch)
Ring	The ring structure that represents data distribution
Gossip Protocol	How nodes exchange state information
Snitch	Determines network topology
Partitioner	Determines how data is distributed

Replication Strategies

Strategy	Description	Usage
SimpleStrategy	Places replicas on consecutive nodes around the ring	Testing, single datacenter
NetworkTopologyStrategy	Precise control over replica placement by datacenter	Production, multiple datacenters

Write Path

Write to commit log (durability)
Write to memtable (in-memory)
Periodically flush memtable to SSTable (immutable on disk)
Eventually compact SSTables

Read Path

Check row cache (if enabled)
Check partition key cache (if enabled)
Check memtable
Check SSTables (using Bloom filters and indexes)
Perform read repair if needed

Performance Optimization

Performance Tuning Parameters

Parameter	Description	Recommendation
`concurrent_reads`	Number of concurrent reads	16 × number of drives
`concurrent_writes`	Number of concurrent writes	8 × number of CPU cores
`memtable_flush_writers`	Writers for flushing memtables	Number of disks
`compaction_throughput_mb_per_sec`	Throttle for compaction	Start at 16-32, adjust based on load
`read_request_timeout_in_ms`	Read timeout	Default 5000ms
`write_request_timeout_in_ms`	Write timeout	Default 2000ms

Compaction Strategies

Strategy	Description	Best For
SizeTieredCompactionStrategy (STCS)	Default, groups similarly sized SSTables	Write-heavy workloads
LeveledCompactionStrategy (LCS)	Organize SSTables in levels	Read-heavy workloads
TimeWindowCompactionStrategy (TWCS)	Optimized for time series data	Time series, TTL data
DateTieredCompactionStrategy (DTCS)	Deprecated, replaced by TWCS	Legacy systems

Caching Options

Cache Type	Description	Use Case
Row Cache	Caches entire rows	Frequently accessed, rarely changing rows
Key Cache	Caches partition keys	Default cache, improves read performance
Counter Cache	Caches counters	High-volume counters
Chunk Cache	Caches chunks of data	Improves read performance for wide rows

Backup and Recovery

Backup Strategies

Snapshots

# Create snapshot of all keyspaces
nodetool snapshot

# Create snapshot of specific keyspace
nodetool snapshot my_keyspace

# Create snapshot with a name
nodetool snapshot -t backup_name my_keyspace

Incremental Backups

Enable in cassandra.yaml:

incremental_backups: true

Restore Process

# Stop Cassandra
service cassandra stop

# Clear data (except for snapshots and backups)
rm -rf /var/lib/cassandra/data/my_keyspace/my_table/*

# Restore from snapshot
cp -R /var/lib/cassandra/snapshots/snapshot_name/* /var/lib/cassandra/data/my_keyspace/my_table/

# Restart Cassandra
service cassandra start

# Run repair
nodetool repair my_keyspace my_table

Monitoring and Maintenance

Essential nodetool Commands

# Check cluster status
nodetool status

# Check node status
nodetool info

# Get statistics
nodetool tablestats my_keyspace.my_table

# Run repair
nodetool repair

# Run cleanup
nodetool cleanup

# Flush memtables to disk
nodetool flush

# Compact SSTables
nodetool compact

Monitoring Metrics

Metric Category	Key Metrics to Monitor
Latency	Read/write latency, request coordinator latency
Throughput	Read/write requests per second
Compaction	Pending compactions, compaction history
Storage	Disk usage, SSTable count
Cache	Cache hit rates, cache size
GC	Garbage collection pauses, frequency
Thread Pools	Pending/blocked tasks

Recommended Monitoring Tools

Prometheus with Cassandra exporter
Grafana dashboards
DataStax OpsCenter
Instaclustr Console
JMX tools (JConsole, jmxterm)

Common Challenges and Solutions

Challenge: Tombstones

Solution:

Set appropriate TTL values
Use USING TIMESTAMP for overwrites
Schedule regular tombstone GC with nodetool garbagecollect
Configure gc_grace_seconds based on repair frequency

Challenge: Wide Partitions

Solution:

Implement bucketing in primary key design
Split logical entities across multiple tables
Use time-based bucketing for time series
Monitor partition sizes with nodetool tablehistograms

Challenge: Hot Partitions

Solution:

Review partition key design
Add more granularity to partition key
Consider application-level sharding
Cache hot data in application

Challenge: Read Before Write

Solution:

Use lightweight transactions (WITH IF EXISTS)
Consider timestamp-based conflict resolution
Design for idempotent operations

Best Practices

Schema Design

Design tables based on query patterns
Keep related data in same partition
Limit partition size (aim for <100MB)
Choose appropriate compaction strategy
Use TTL for temporary data

Operational

Run repairs regularly
Monitor tombstone counts
Plan capacity in advance
Test with realistic data volumes
Use vnodes for easier scaling
Consider dedicated seed nodes

Application Integration

Use prepared statements
Implement retry policies
Use token-aware load balancing
Batch with caution (use unlogged batches)
Consider asynchronous operations for throughput

Resources for Further Learning

Official Resources

Books

“Cassandra: The Definitive Guide” by Jeff Carpenter and Eben Hewitt
“Mastering Apache Cassandra” by Nishant Neeraj
“Learning Apache Cassandra” by Sandeep Yarabarla

Online Courses

DataStax Academy courses
Udemy: “Apache Cassandra for Beginners”
Pluralsight: “Getting Started with Apache Cassandra”

Community Resources

Stack Overflow Cassandra tag
Cassandra mailing lists
#cassandra IRC channel
CQLSH cheat sheets

Remember: Cassandra excels at handling massive scale with high availability, but requires thinking differently about data modeling than traditional relational databases. Design for your queries, not your entities!

Introduction to Apache Cassandra

Core Cassandra Concepts

Data Model Architecture

Cassandra’s CAP Characteristics

Cassandra vs. Traditional RDBMS

Cassandra Query Language (CQL)

CQL Data Types

Basic CQL Commands

Database Operations

Table Operations

Data Manipulation

Secondary Indexes

Collections

Cassandra Data Modeling

Key Principles

Primary Key Design Patterns

Common Data Modeling Techniques

One-to-Many Relationship

Many-to-Many Relationship

Time Series Data

Consistency Levels

Read Consistency Levels

Write Consistency Levels

Setting Consistency Levels

Cassandra Architecture

Key Components

Replication Strategies

Write Path

Read Path

Performance Optimization

Performance Tuning Parameters

Compaction Strategies

Caching Options

Backup and Recovery

Backup Strategies

Snapshots

Incremental Backups

Restore Process

Monitoring and Maintenance

Essential nodetool Commands

Monitoring Metrics

Recommended Monitoring Tools

Common Challenges and Solutions

Challenge: Tombstones

Challenge: Wide Partitions

Challenge: Hot Partitions

Challenge: Read Before Write

Best Practices

Schema Design

Operational

Application Integration

Resources for Further Learning

Official Resources

Books

Online Courses

Community Resources

Related Posts