Introduction to Apache Accumulo
Apache Accumulo is a highly scalable, distributed key-value store built on Apache Hadoop, ZooKeeper, and Thrift. Modeled after Google’s BigTable design, Accumulo extends the BigTable architecture with cell-level security and server-side programming mechanisms. It’s designed to handle massive amounts of structured data across clusters of commodity hardware with high availability and fault tolerance.
Key Characteristics:
- Cell-level security through fine-grained access controls
- Server-side programming through Iterators
- High performance for massive datasets (petabytes of data and billions of entries)
- Designed for parallel processing with MapReduce integration
- Native support for multi-tenancy
- Strong consistency model
Core Concepts and Architecture
Data Model
| Concept | Description |
|---|---|
| Key-Value Pair | The basic unit of storage in Accumulo |
| Row ID | The primary key for a record |
| Column Family | Used to organize related columns together |
| Column Qualifier | Identifies specific data within a column family |
| Column Visibility | Controls access to data using security expressions |
| Timestamp | Version control for values in a cell |
| Value | The actual data stored in a cell |
Full Key Format: row ID + column family + column qualifier + column visibility + timestamp
Accumulo Components
| Component | Function |
|---|---|
| Tablet Server | Handles read/write operations for a subset of tablets |
| Master Server | Coordinates tablet servers and performs administrative operations |
| Garbage Collector | Removes deleted files from HDFS |
| Monitor | Web interface for monitoring the system |
| Tracer | Tracks timing of operations across the distributed system |
| ZooKeeper | Coordinates the distributed components and stores metadata |
| Tablet | A contiguous partition of a table |
| Minor Compaction | Process that combines in-memory data with files in HDFS |
| Major Compaction | Process that combines multiple files in HDFS |
Installation and Setup
Prerequisites
- Java 8 or newer
- Hadoop Distributed File System (HDFS)
- Apache ZooKeeper ensemble
- Sufficient disk space and memory
Basic Installation Steps
Download and extract:
$ wget https://downloads.apache.org/accumulo/[VERSION]/accumulo-[VERSION]-bin.tar.gz $ tar -xzf accumulo-[VERSION]-bin.tar.gz $ mv accumulo-[VERSION] /path/to/accumuloConfigure environment variables:
$ export ACCUMULO_HOME=/path/to/accumulo $ export PATH=$PATH:$ACCUMULO_HOME/binConfigure core properties in accumulo-site.xml:
<property> <name>instance.volumes</name> <value>hdfs://namenode:8020/accumulo</value> </property> <property> <name>instance.zookeeper.host</name> <value>zookeeper1:2181,zookeeper2:2181,zookeeper3:2181</value> </property>Initialize Accumulo:
$ accumulo init --instance-name myinstance --password mypasswordStart Accumulo:
$ accumulo-service all start
Basic Operations
Command Line Interface
Login to shell:
$ accumulo shell -u myuser -p mypassword
Basic commands:
| Command | Description | Example |
|---|---|---|
help | Show available commands | help |
tables | List tables | tables |
createtable | Create a new table | createtable mytable |
deletetable | Delete a table | deletetable mytable |
scan | Scan records in a table | scan -t mytable |
insert | Insert a record | insert row cf cq value |
delete | Delete a record | delete row cf cq |
flush | Flush a table’s data | flush -t mytable |
compact | Compact a table | compact -t mytable |
user | User management | user add myuser |
userpermissions | Show user permissions | userpermissions -u myuser |
exit | Exit the shell | exit |
Working with Tables
Create a table:
shell> createtable mytable
Add data:
shell> insert row1 columnFamily columnQualifier value1
shell> insert row2 columnFamily columnQualifier value2
Scan data:
shell> scan -t mytable
Delete data:
shell> delete row1 columnFamily columnQualifier
Delete table:
shell> deletetable mytable
Table Configurations
Set table properties:
shell> config -t mytable -s table.cache.block.enable=true
View table properties:
shell> config -t mytable -f print
Table splits:
shell> addsplits -t mytable -sf splitfile.txt
Advanced Features
Security
Table Permissions
| Permission | Description |
|---|---|
| READ | Can read table data |
| WRITE | Can write to the table |
| BULK_IMPORT | Can bulk import files |
| ALTER_TABLE | Can alter table properties |
| GRANT | Can grant permissions to others |
| DROP_TABLE | Can delete the table |
Grant permissions:
shell> grant Table.READ -t mytable -u user1
shell> grant Table.WRITE -t mytable -u user1
Revoke permissions:
shell> revoke Table.WRITE -t mytable -u user1
Cell-Level Security
Insert with visibility labels:
shell> insert row1 cf cq value -l A&B|C
Scan with authorizations:
shell> scan -t mytable -s A,B
Set user authorizations:
shell> setauths -u user1 -s A,B,C
Iterators
Iterators are server-side programming mechanisms that allow data transformation, filtering, and aggregation operations during scans, compactions, or both.
Types of Iterators:
| Type | When Applied | Use Case |
|---|---|---|
| Scan-time | During scan operations | Query-time filtering/transformations |
| Minor compaction | During minor compactions | Data reduction during compaction |
| Major compaction | During major compactions | Permanent data transformations |
Iterator Priority Levels:
| Priority Range | Purpose |
|---|---|
| 1-9 | User iterators (application-specific) |
| 10-19 | Accumulo system iterators |
| 20-29 | Accumulo default iterators |
Attach iterator to a table:
shell> config -t mytable -s table.iterator.scan.myscanner=10,org.example.MyIterator
shell> config -t mytable -s table.iterator.scan.myscanner.opt.myOption=myValue
Common Built-in Iterators:
| Iterator | Purpose |
|---|---|
| VersioningIterator | Limits entries to most recent timestamp |
| RegExFilter | Filters based on regular expressions |
| AgeOffFilter | Removes entries older than a threshold |
| WholeRowIterator | Processes entire rows at once |
| SummingCombiner | Aggregates numeric values |
Bulk Import
Generate RFiles:
$ accumulo rfile-info --generate-splits --num-splits 10 --output-file splits.txt /path/to/data.txtImport the data:
$ accumulo import /path/to/RFiles /path/to/failures mytable
Accumulo Client API (Java)
Basic connection:
// Accumulo 2.x style
ClientContext context = ClientInfo.from("/path/to/client.properties").createContext();
// Or directly:
Properties props = new Properties();
props.setProperty("instance.name", "myinstance");
props.setProperty("instance.zookeepers", "zoo1:2181,zoo2:2181");
props.setProperty("auth.type", "password");
props.setProperty("auth.principal", "user");
props.setProperty("auth.token", "password");
ClientContext context = new ClientContext(
new ClientInfo(props)
);
Write data:
try (BatchWriter writer = context.createBatchWriter("mytable")) {
Mutation mutation = new Mutation("row1");
mutation.put("cf", "cq", "value");
writer.addMutation(mutation);
writer.flush();
}
Read data:
try (Scanner scanner = context.createScanner("mytable", Authorizations.EMPTY)) {
scanner.setRange(Range.exact("row1"));
scanner.fetchColumn(Text.of("cf"), Text.of("cq"));
for (Map.Entry<Key, Value> entry : scanner) {
System.out.println(entry.getKey() + " -> " + entry.getValue());
}
}
Performance Tuning
Memory Management
| Property | Description | Recommendation |
|---|---|---|
tserver.memory.maps.max | Maximum memory for in-memory maps | 25-30% of available RAM |
tserver.cache.data.size | Size of data block cache | 15-20% of available RAM |
tserver.cache.index.size | Size of index cache | 5-10% of available RAM |
tserver.sort.buffer.size | Sort buffer size | 10% of in-memory map size |
Compaction Settings
Configure compaction thresholds:
shell> config -t mytable -s table.compaction.major.ratio=3.0
shell> config -t mytable -s table.compaction.minor.ratio=1.5
Write Performance
| Strategy | Description |
|---|---|
| Increase batch writer threads | Set table.writer.threads.max higher for more parallelism |
| Increase batch writer memory | Set tserver.mutation.queue.max higher for larger batches |
| Balance split points | Ensure even distribution with proper presplitting |
| Tune compaction settings | Adjust ratios to avoid compaction storms |
Read Performance
| Strategy | Description |
|---|---|
| Increase scanner threads | Set tserver.read.ahead.maxconcurrent higher |
| Tune cache sizes | Balance tserver.cache.data.size and tserver.cache.index.size |
| Use locality groups | Group related column families for better scan performance |
| Use bloom filters | Enable for tables with random access patterns |
Common Challenges and Solutions
| Challenge | Solution |
|---|---|
| Slow table scans | Check for appropriate split points, increase scanner threads, use proper scan authorizations |
| Out of memory errors | Reduce memory maps size, check for memory leaks in iterators, tune JVM settings |
| Write performance issues | Increase batch writer threads and memory, check for bottlenecks in preprocessing |
| ZooKeeper connection issues | Ensure proper ZooKeeper quorum, check network connectivity, validate configuration |
| CompactionExecutor warnings | Increase compaction threads, tune compaction ratios, balance tablet distribution |
| Tablet server failures | Check for balancing issues, monitor resource usage, investigate logs for errors |
| HDFS latency issues | Validate HDFS health, check DataNode distribution, examine network performance |
Best Practices
Design Guidelines
- Row key design: Create row keys that distribute load evenly
- Column family usage: Use a small number of column families
- Iterators: Implement efficient iterators that minimize memory usage
- Locality groups: Group related column families for faster scans
- Splits: Pre-split tables based on expected data distribution
- Write patterns: Avoid “hot-spotting” with well-distributed row keys
Operational Tips
- Monitoring: Set up alerts for important metrics like tablet server health
- Regular maintenance: Schedule compactions during off-peak hours
- Backups: Regularly export important tables
- Upgrades: Test upgrades in a staging environment before production
- Resource planning: Monitor and plan for growth in data size
- Security updates: Regularly review and update security policies
- Client connections: Use connection pooling and proper retry logic
Resources for Further Learning
Official Documentation
Books
- “Accumulo: Application Development, Table Design, and Best Practices” by Aaron Cordova, Michael Wall, and Billie Rinaldi
- “Apache Accumulo for Developers” by Guomundur Jonsson
Community Resources
Training Resources
Related Technologies
- Apache Hadoop
- Apache ZooKeeper
- Apache Thrift
- Apache HBase (alternative to Accumulo)
