1. Introduction 1
The Dawn of Big Data 1
The Problem with Relational Database Systems 5
Nonrelational Database Systems, Not-Only SQL or NoSQL? 8
Dimensions 10
Scalability 12
Database (De-)Normalization 13
Building Blocks 16
Backdrop 16
Tables, Rows, Columns, and Cells 17
Auto-Sharding 22
Storage API 23
Implementation 24
Summary 27
HBase: The Hadoop Database 28
History 28
Nomenclature 29
Summary 30
2. Installntion 31
Quick-Start Guide 31
Requirements 34
Hardware 34
Software 40
Filesystems for HBase 52
Local 54
HDFS 54
S3 54
Other Filesystems 55
Installation Choices 55
Apache Binary Release 55
Building from Source 58
Run Modes 58
Standalone Mode 59
Distributed Mode 59
Configuration 63
hbase-site.xml and hbase-default.xml 64
hbase-env.sh 65
regionserver 65
log4j.properties 65
Example Configuration 65
Client Configuration 67
Deployment 68
Script-Based 68
Apache Whirr 69
Puppet and Chef 70
Operating a Cluster 71
Running and ConfirmingYour Installation 71
Web-based UI Introduction 71
Shell Introduction 73
Stopping the Cluster 73
3. Client API: The Basics 75
General Notes 75
CRUD Operations 76
Put Method 76
Get Method 95
Delete Method 105
Batch Operations 114
Row Locks 118
Scans 122
Introduction 122
The ResultScanner Class 124
Caching Versus Batching 127
Miscellaneous Features 133
The HTable Utility Methods 133
The Bytes Class 134
4. Client APl:Advanced Features 137
Filters 137
Introduction to Filters 137
Comparison Filters 140
Dedicated Filters 147
Decorating Filters 155
FilterList 159
Custom Filters 160
Filters Summary 167
Counters 168
Introduction to Counters 168
Single Counters 171
Multiple Counters 172
Coprocessors 175
Introduction to Coprocessors 175
The Coprocessor Class 176
Coprocessor Loading 179
The RegionObserver Class 182
The MasterObserver Class 190
Endpoints 193
HTablePool 199
Connection Handling 203
5. Client API: Administrative Features 207
Schema Definition 207
Tables 207
Table Properties 210
Column Families 212
HBaseAdmin 218
Basic Operations 219
Table Operations 220
Schema Operations 228
Cluster Operations 230
Cluster Status Information 233
6. Available Clients 241
Introduction to REST, Thrift, and Avro 241
Interactive Clients 244
Native Java 244
REST 244
Thrift 251
Avro 255
Other Clients 256
Batch Clients 257
MapReduce 257
Hive 258
Pig 263
Cascading 267
Shell 268
Basics 269
Commands 271
Scripting 274
Web-based UI 277
Master UI 277
Region Server UI 283
Shared Pages 283
7. MapReduce Integration 289
Framework 289
MapReduce Introduction 289
Classes 290
Supporting Classes 293
MapReduce Locality 293
Table Splits 294
MapReduce over HBase 295
Preparation 295
Data Sink 301
Data Source 306
Data Source and Sink 308
Custom Processing 311
8. Architecture 315
Seek Versus Transfer 315
B+ Trees 315
Log-Structured Merge-Trees 316
Storage 319
Overview 319
Write Path 320
Files 321
HFile Format 329
KeyValue Format 333
Write-Ahead Log 333
Overview 334
HLog Class 335
HLogKey Class 336
WALEdit Class 336
LogSyncer Class 337
LogRoller Class 338
Replay 338
Durability 341
Read Path 342
Region Lookups 345
The Region Life Cycle 348
ZooKeeper 348
Replication 351
Life of a Log Edit 352
Internals 353
9. Advanced Usage 357
Key Design 357
Concepts 357
Tall-Narrow Versus Flat-Wide Tables 359
Partial Key Scans 360
Pagination 362
Time Series Data 363
Time-Ordered Relations 367
Advanced Schemas 369
Secondary Indexes 370
Search Integration 374
Transactions 377
Bloom Filters 377
Versioning 381
Implicit Versioning 381
Custom Versioning 384
10. Cluster Monitoring 387
Introduction 387
The Metrics Framework 388
Contexts, Records, and Metrics 389
Master Metrics 394
Region Server Metrics 394
RPC Metrics 396
JVM Metrics 397
Info Metrics 399
Ganglia 400
Installation 401
Usage 405
JMX 408
JConsole 410
JMX Remote API 413
Nagios 417
11. Performance Tuning 419
Garbage Collection Tuning 419
Memstore-Local Allocation Buffer 422
Compression 424
Available Codecs 424
Verifying Installation 426
Enabling Compression 427
Optimizing Splits and Compactions 429
Managed Splitting 429
Region Hotspotting 430
Presplitting Regions 430
Load Balancing 432
Merging Regions 433
Client API: Best Practices 434
Configuration 436
Load Tests 439
Performance Evaluation 439
YCSB 440
12. Cluster Administration 445
Operational Tasks 445
Node Decommissioning 445
Rolling Restarts 447
Adding Servers 447
Data Tasks 452
Import and Export Tools 452
CopyTable Tool 457
Bulk Import 459
Replication 462
Additional Tasks 464
Coexisting Clusters 464
Required Ports 466
Changing Logging Levels 466
Troubleshooting 467
HBase Fsck 467
Analyzing the Logs 469
Common Issues 471
A. HBase Configuration Properties 475
B. Road Map 489
C. Upgrade from Previous Releases 491
D. Distributions 493
E. Hush SQL Schema 495
F. HBaseVersus Bigtable 497
Index 501