1.Introduction 1
An Overview of Hadoop and MapReduce 3
Hive in the Hadoop Ecosystem 6
Pig 8
HBase 8
Cascading,Crunch,and Others 9
Java Versus Hive:The Word Count Algorithm 11
What's Next 13
2.Getting Started 15
Installing a Preconfigured Virtual Machine 15
Detailed Installation 16
Installing Java 16
Installing Hadoop 18
Local Mode,Pseudodistributed Mode,and Distributed Mode 19
Testing Hadoop 20
Installing Hive 21
What Is Inside Hive? 22
Starting Hive 23
Configuring Your Hadoop Environment 24
Local Mode Configuration 24
Distributed and Pseudodistributed Mode Configuration 26
Metastore Using JDBC 28
The Hive Command 29
Command Options 29
The Command-Line Interface 30
CLI Options 31
Variables and Properties 31
Hive"One Shot"Commands 34
Executing Hive Queries from Files 35
Thehiverc File 36
More on Using the Hive CLI 36
Command History 37
Shell Execution 37
Hadoop dfs Commands from Inside Hive 38
Comments in Hive Scripts 38
Query Column Headers 38
3.Data Typesand File Formats 41
Primitive Data Types 41
Collection Data Types 43
Text File Encoding of Data Values 45
Schema on Read 48
4.HiveQL:Data Definition 49
Databases in Hive 49
Alter Database 52
Creating Tables 53
Managed Tables 56
External Tables 56
Partitioned,Managed Tables 58
External Partitioned Tables 61
Customizing Table Storage Formats 63
Dropping Tables 66
Alter Table 66
Renaming a Table 66
Adding,Modifying,and Dropping a Table Partition 66
Changing Columns 67
Adding Columns 68
Deleting or Replacing Columns 68
Alter Table Properties 68
Alter Storage Properties 68
Miscellaneous Alter Table Statements 69
5.HiveQL:Data Manipulation 71
Loading Data into Managed Tables 71
Inserting Data into Tables from Queries 73
Dynamic Partition Inserts 74
Creating Tables and Loading Them in One Query 75
Exporting Data 76
6.HiveQL:Queries 79
SELECT...FROM Clauses 79
Specify Columns with Regular Expressions 81
Computing with Column Values 81
Arithmetic Operators 82
Using Functions 83
LIMIT Clause 91
Column Aliases 91
Nested SELECT Statements 91
CASE...WHEN...THEN Statements 91
When Hive Can Avoid MapReduce 92
WHERE Clauses 92
Predicate Operators 93
Gotchas with Floating-Point Comparisons 94
LIKE and RLIKE 96
GROUP BY Clauses 97
HAVING Clauses 97
JOIN Statements 98
Inner JOIN 98
Join Optimizations 100
LEFT OUTER JOIN 101
OUTER JOIN Gotcha 101
RIGHT OUTER JOIN 103
FULL OUTER JOIN 104
LEFT SEMI-JOIN 104
Cartesian Product JOINs 105
Map-side Joins 105
ORDER BY and SORT BY 107
DISTRIBUTE BY with SORT BY 107
CLUSTER BY 108
Casting 109
Casting BINARY Values 109
Queries that Sample Data 110
Block Sampling 111
Input Pruning for Bucket Tables 111
UNION ALL 112
7.HiveQL:Views 113
Views to Reduce Query Complexity 113
Views that Restrict Data Based on Conditions 114
Views and Map Type for Dynamic Tables 114
View Odds and Ends 115
8.HiveQL:Indexes 117
Creating an Index 117
Bitmap Indexes 118
Rebuilding the Index 118
Showing an Index 119
Dropping an Index 119
Implementing a Custom Index Handler 119
9.Schema Design 121
Table-by-Day 121
Over Partitioning 122
Unique Keys and Normalization 123
Making Multiple Passes over the Same Data 124
The Case for Partitioning Every Table 124
Bucketing Table Data Storage 125
Adding Columns to a Table 127
Using Columnar Tables 128
Repeated Data 128
Many Columns 128
(Almost)Always Use Compression! 128
10.Tuning 131
Using EXPLAIN 131
EXPLAIN EXTENDED 134
Limit Tuning 134
Optimized Joins 135
Local Mode 135
Parallel Execution 136
Strict Mode 137
Tuning the Number of Mappers and Reducers 138
JVM Reuse 139
Indexes 140
Dynamic Partition Tuning 140
Speculative Execution 141
Single MapReduce MultiGROUP BY 142
Virtual Columns 142
11.Other File Formats and Compression 145
Determining Installed Codecs 145
Choosing a Compression Codec 146
Enabling Intermediate Compression 147
Final Output Compression 148
Sequence Files 148
Compression in Action 149
Archive Partition 152
Compression:Wrapping Up 154
12.Developing 155
Changing Log4J Properties 155
Connecting a Java Debugger to Hive 156
Building Hive from Source 156
Running Hive Test Cases 156
Execution Hooks 158
Setting Up Hive and Eclipse 158
Hive in a Maven Project 158
Unit Testing in Hive with hive_test 159
The New Plugin Developer Kit 161
13.Functions 163
Discovering and Describing Functions 163
Calling Functions 164
Standard Functions 164
Aggregate Functions 164
Table Generating Functions 165
A UDF for Finding a Zodiac Sign from a Day 166
UDF Versus GenericUDF 169
Permanent Functions 171
User-Defined Aggregate Functions 172
Creating a COLLECT UDAF to Emulate GROUP_CONCAT 172
User-Defined Table Generating Functions 177
UDTFs that Produce Multiple Rows 177
UDTFs that Produce a Single Row with Multiple Columns 179
UDTFs that Simulate Complex Types 179
Accessing the Distributed Cache from a UDF 182
Annotations for Use with Functions 184
Deterministic 184
Stateful 184
DistinctLike 185
Macros 185
14.Streaming 187
Identity Transformation 188
Changing Types 188
Projecting Transformation 188
Manipulative Transformations 189
Using the Distributed Cache 189
Producing Multiple Rows from a Single Row 190
Calculating Aggregates with Streaming 191
CLUSTER BY,DISTRIBUTE BY,SORT BY 192
GenericMR Tools for Streaming to Java 194
Calculating Cogroups 196
15.Customizing Hive File and Record Formats 199
File Versus Record Formats 199
Demystifying CREATE TABLE Statements 199
File Formats 201
Sequence File 201
RCFile 202
Example of a Custom Input Format:DualInputFormat 203
Record Formats:SerDes 205
CSV and TSV SerDes 206
ObjectInspector 206
Think Big Hive Reflection ObjectInspector 206
XMLUDF 207
XPath-Related Functions 207
JSON SerDe 208
Avro Hive SerDe 209
Defining Avro Schema Using Table Properties 209
Defining a Schema from a URI 210
Evolving Schema 211
Binary Output 211
16.Hive Thrift Service 213
Starting the Thrift Server 213
Setting Up Groovy to Connect to HiveService 214
Connecting to HiveServer 214
Getting Cluster Status 215
Result Set Schema 215
Fetching Results 215
Retrieving Query Plan 216
Metastore Methods 216
Example Table Checker 216
Administrating HiveServer 217
Productionizing HiveService 217
Cleanup 218
Hive ThriftMetastore 219
ThriftMetastore Configuration 219
Client Configuration 219
17.Storage Handlers and NoSQL 221
Storage Handler Background 221
HiveStorageHandler 222
HBase 222
Cassandra 224
Static Column Mapping 224
Transposed Column Mapping for Dynamic Columns 224
Cassandra SerDe Properties 224
DynamoDB 225
18.Security 227
Integration with Hadoop Security 228
Authentication with Hive 228
Authorization in Hive 229
Users,Groups,and Roles 230
Privileges to Grant and Revoke 231
Partition-Level Privileges 233
Automatic Grants 233
19.Locking 235
Locking Support in Hive with Zookeeper 235
Explicit,Exclusive Locks 238
20.Hive Integration with Oozie 239
Oozie Actions 239
Hive Thrift Service Action 240
A Two-Query Workflow 240
Oozie Web Console 242
Variables in Workflows 242
Capturing Output 243
Capturing Output to Variables 243
21.Hive and Amazon Web Services(AWS) 245
Why Elastic MapReduce? 245
Instances 245
Before Yon Start 246
Managing Your EMR Hive Cluster 246
Thrift Serveron EMR Hive 247
Instance Groups on EMR 247
Configuring Your EMR Cluster 248
Deploying hive-site.xml 248
Deploying a.hiverc Script 249
Setting Up a Memory-Intensive Contiguration 249
Persistence and the Metastore on EMR 250
HDFS and S3 on EMR Cluster 251
Putting Resources,Configs,and Bootstrap Scripts on S3 252
Logs on S3 252
Spot Instances 252
Security Groups 253
EMR Versus EC2 and Apache Hive 254
Wrapping Up 254
22.HCatalog 255
Introduction 255
MapReduce 256
Reading Data 256
Writing Data 258
Command Line 261
Security Model 261
Architecture 262
23.Case Studies 265
m6d.com(Media6Degrees) 265
Data Science at M6D Using Hive and R 265
M6D UDF Pseudorank 270
M6D Managing Hive Data Across Multiple MapReduce Clusters 274
Outbrain 278
In-Site Referrer Identification 278
Counting Uniques 280
Sessionization 282
NASA's Jet Propulsion Laboratory 287
The Regional Climate Model Evaluation System 287
Our Experience:Why Hive? 290
Some Challenges and How We Overcame Them 291
Photobucket 292
Big Data at Photobucket 292
What Hardware Do We Use for Hive? 293
What's in Hive? 293
Who Does It Support? 293
SimpleReach 294
Experiences and Needs from the Customer Trenches 296
A Karmasphere Perspective 296
Introduction 296
Use Case Examples from the Customer Trenches 297
Glossary 305
Appendix:References 309
Index 313