Apache Accumulo Database: Complete Reference Guide

Introduction to Apache Accumulo

Apache Accumulo is a highly scalable, distributed key-value store built on Apache Hadoop, ZooKeeper, and Thrift. Modeled after Google’s BigTable design, Accumulo extends the BigTable architecture with cell-level security and server-side programming mechanisms. It’s designed to handle massive amounts of structured data across clusters of commodity hardware with high availability and fault tolerance.

Key Characteristics:

Cell-level security through fine-grained access controls
Server-side programming through Iterators
High performance for massive datasets (petabytes of data and billions of entries)
Designed for parallel processing with MapReduce integration
Native support for multi-tenancy
Strong consistency model

Core Concepts and Architecture

Data Model

Concept	Description
Key-Value Pair	The basic unit of storage in Accumulo
Row ID	The primary key for a record
Column Family	Used to organize related columns together
Column Qualifier	Identifies specific data within a column family
Column Visibility	Controls access to data using security expressions
Timestamp	Version control for values in a cell
Value	The actual data stored in a cell

Full Key Format: row ID + column family + column qualifier + column visibility + timestamp

Accumulo Components

Component	Function
Tablet Server	Handles read/write operations for a subset of tablets
Master Server	Coordinates tablet servers and performs administrative operations
Garbage Collector	Removes deleted files from HDFS
Monitor	Web interface for monitoring the system
Tracer	Tracks timing of operations across the distributed system
ZooKeeper	Coordinates the distributed components and stores metadata
Tablet	A contiguous partition of a table
Minor Compaction	Process that combines in-memory data with files in HDFS
Major Compaction	Process that combines multiple files in HDFS

Installation and Setup

Prerequisites

Java 8 or newer
Hadoop Distributed File System (HDFS)
Apache ZooKeeper ensemble
Sufficient disk space and memory

Basic Installation Steps

Download and extract:

$ wget https://downloads.apache.org/accumulo/[VERSION]/accumulo-[VERSION]-bin.tar.gz
$ tar -xzf accumulo-[VERSION]-bin.tar.gz
$ mv accumulo-[VERSION] /path/to/accumulo

Configure environment variables:

$ export ACCUMULO_HOME=/path/to/accumulo
$ export PATH=$PATH:$ACCUMULO_HOME/bin

Configure core properties in accumulo-site.xml:

<property>
  <name>instance.volumes</name>
  <value>hdfs://namenode:8020/accumulo</value>
</property>
<property>
  <name>instance.zookeeper.host</name>
  <value>zookeeper1:2181,zookeeper2:2181,zookeeper3:2181</value>
</property>

Initialize Accumulo:

$ accumulo init --instance-name myinstance --password mypassword

Start Accumulo:
```
$ accumulo-service all start
```

Basic Operations

Command Line Interface

Login to shell:

$ accumulo shell -u myuser -p mypassword

Basic commands:

Command	Description	Example
`help`	Show available commands	`help`
`tables`	List tables	`tables`
`createtable`	Create a new table	`createtable mytable`
`deletetable`	Delete a table	`deletetable mytable`
`scan`	Scan records in a table	`scan -t mytable`
`insert`	Insert a record	`insert row cf cq value`
`delete`	Delete a record	`delete row cf cq`
`flush`	Flush a table’s data	`flush -t mytable`
`compact`	Compact a table	`compact -t mytable`
`user`	User management	`user add myuser`
`userpermissions`	Show user permissions	`userpermissions -u myuser`
`exit`	Exit the shell	`exit`

Working with Tables

Create a table:

shell> createtable mytable

Add data:

shell> insert row1 columnFamily columnQualifier value1
shell> insert row2 columnFamily columnQualifier value2

Scan data:

shell> scan -t mytable

Delete data:

shell> delete row1 columnFamily columnQualifier

Delete table:

shell> deletetable mytable

Table Configurations

Set table properties:

shell> config -t mytable -s table.cache.block.enable=true

View table properties:

shell> config -t mytable -f print

Table splits:

shell> addsplits -t mytable -sf splitfile.txt

Advanced Features

Security

Table Permissions

Permission	Description
READ	Can read table data
WRITE	Can write to the table
BULK_IMPORT	Can bulk import files
ALTER_TABLE	Can alter table properties
GRANT	Can grant permissions to others
DROP_TABLE	Can delete the table

Grant permissions:

shell> grant Table.READ -t mytable -u user1
shell> grant Table.WRITE -t mytable -u user1

Revoke permissions:

shell> revoke Table.WRITE -t mytable -u user1

Cell-Level Security

Insert with visibility labels:

shell> insert row1 cf cq value -l A&B|C

Scan with authorizations:

shell> scan -t mytable -s A,B

Set user authorizations:

shell> setauths -u user1 -s A,B,C

Iterators

Iterators are server-side programming mechanisms that allow data transformation, filtering, and aggregation operations during scans, compactions, or both.

Types of Iterators:

Type	When Applied	Use Case
Scan-time	During scan operations	Query-time filtering/transformations
Minor compaction	During minor compactions	Data reduction during compaction
Major compaction	During major compactions	Permanent data transformations

Iterator Priority Levels:

Priority Range	Purpose
1-9	User iterators (application-specific)
10-19	Accumulo system iterators
20-29	Accumulo default iterators

Attach iterator to a table:

shell> config -t mytable -s table.iterator.scan.myscanner=10,org.example.MyIterator
shell> config -t mytable -s table.iterator.scan.myscanner.opt.myOption=myValue

Common Built-in Iterators:

Iterator	Purpose
VersioningIterator	Limits entries to most recent timestamp
RegExFilter	Filters based on regular expressions
AgeOffFilter	Removes entries older than a threshold
WholeRowIterator	Processes entire rows at once
SummingCombiner	Aggregates numeric values

Bulk Import

Generate RFiles:

$ accumulo rfile-info --generate-splits --num-splits 10 --output-file splits.txt /path/to/data.txt

Import the data:

$ accumulo import /path/to/RFiles /path/to/failures mytable

Accumulo Client API (Java)

Basic connection:

// Accumulo 2.x style
ClientContext context = ClientInfo.from("/path/to/client.properties").createContext();

// Or directly:
Properties props = new Properties();
props.setProperty("instance.name", "myinstance");
props.setProperty("instance.zookeepers", "zoo1:2181,zoo2:2181");
props.setProperty("auth.type", "password");
props.setProperty("auth.principal", "user");
props.setProperty("auth.token", "password");

ClientContext context = new ClientContext(
    new ClientInfo(props)
);

Write data:

try (BatchWriter writer = context.createBatchWriter("mytable")) {
    Mutation mutation = new Mutation("row1");
    mutation.put("cf", "cq", "value");
    writer.addMutation(mutation);
    writer.flush();
}

Read data:

try (Scanner scanner = context.createScanner("mytable", Authorizations.EMPTY)) {
    scanner.setRange(Range.exact("row1"));
    scanner.fetchColumn(Text.of("cf"), Text.of("cq"));
    for (Map.Entry<Key, Value> entry : scanner) {
        System.out.println(entry.getKey() + " -> " + entry.getValue());
    }
}

Performance Tuning

Memory Management

Property	Description	Recommendation
`tserver.memory.maps.max`	Maximum memory for in-memory maps	25-30% of available RAM
`tserver.cache.data.size`	Size of data block cache	15-20% of available RAM
`tserver.cache.index.size`	Size of index cache	5-10% of available RAM
`tserver.sort.buffer.size`	Sort buffer size	10% of in-memory map size

Compaction Settings

Configure compaction thresholds:

shell> config -t mytable -s table.compaction.major.ratio=3.0
shell> config -t mytable -s table.compaction.minor.ratio=1.5

Write Performance

Strategy	Description
Increase batch writer threads	Set `table.writer.threads.max` higher for more parallelism
Increase batch writer memory	Set `tserver.mutation.queue.max` higher for larger batches
Balance split points	Ensure even distribution with proper presplitting
Tune compaction settings	Adjust ratios to avoid compaction storms

Read Performance

Strategy	Description
Increase scanner threads	Set `tserver.read.ahead.maxconcurrent` higher
Tune cache sizes	Balance `tserver.cache.data.size` and `tserver.cache.index.size`
Use locality groups	Group related column families for better scan performance
Use bloom filters	Enable for tables with random access patterns

Common Challenges and Solutions

Challenge	Solution
Slow table scans	Check for appropriate split points, increase scanner threads, use proper scan authorizations
Out of memory errors	Reduce memory maps size, check for memory leaks in iterators, tune JVM settings
Write performance issues	Increase batch writer threads and memory, check for bottlenecks in preprocessing
ZooKeeper connection issues	Ensure proper ZooKeeper quorum, check network connectivity, validate configuration
CompactionExecutor warnings	Increase compaction threads, tune compaction ratios, balance tablet distribution
Tablet server failures	Check for balancing issues, monitor resource usage, investigate logs for errors
HDFS latency issues	Validate HDFS health, check DataNode distribution, examine network performance

Best Practices

Design Guidelines

Row key design: Create row keys that distribute load evenly
Column family usage: Use a small number of column families
Iterators: Implement efficient iterators that minimize memory usage
Locality groups: Group related column families for faster scans
Splits: Pre-split tables based on expected data distribution
Write patterns: Avoid “hot-spotting” with well-distributed row keys

Operational Tips

Monitoring: Set up alerts for important metrics like tablet server health
Regular maintenance: Schedule compactions during off-peak hours
Backups: Regularly export important tables
Upgrades: Test upgrades in a staging environment before production
Resource planning: Monitor and plan for growth in data size
Security updates: Regularly review and update security policies
Client connections: Use connection pooling and proper retry logic

Resources for Further Learning

Official Documentation

Books

“Accumulo: Application Development, Table Design, and Best Practices” by Aaron Cordova, Michael Wall, and Billie Rinaldi
“Apache Accumulo for Developers” by Guomundur Jonsson

Introduction to Apache Accumulo

Core Concepts and Architecture

Data Model

Accumulo Components

Installation and Setup

Prerequisites

Basic Installation Steps

Basic Operations

Command Line Interface

Working with Tables

Table Configurations

Advanced Features

Security

Table Permissions

Cell-Level Security

Iterators

Bulk Import

Accumulo Client API (Java)

Performance Tuning

Memory Management

Compaction Settings

Write Performance

Read Performance

Common Challenges and Solutions

Best Practices

Design Guidelines

Operational Tips

Resources for Further Learning

Official Documentation

Books

Community Resources

Training Resources

Related Technologies

Introduction to Apache Accumulo

Core Concepts and Architecture

Data Model

Accumulo Components

Installation and Setup

Prerequisites

Basic Installation Steps

Basic Operations

Command Line Interface

Working with Tables

Table Configurations

Advanced Features

Security

Table Permissions

Cell-Level Security

Iterators

Bulk Import

Accumulo Client API (Java)

Performance Tuning

Memory Management

Compaction Settings

Write Performance

Read Performance

Common Challenges and Solutions

Best Practices

Design Guidelines

Operational Tips

Resources for Further Learning

Official Documentation

Books

Community Resources

Training Resources

Related Technologies

Related Posts