《Hive编程 英文版》PDF下载

  • 购买积分:12 如何计算积分?
  • 作  者:EdwardCapriolo,DeanWampler,JasonRutberglen著
  • 出 版 社:南京:东南大学出版社
  • 出版年份:2013
  • ISBN:9787564141974
  • 页数:332 页
图书介绍:本书由实例驱动的指南为你展示了如何在你的环境中搭建和配置Hive,它也提供了对Hadoop和MapReduce的概括介绍,并且演示了Hive是如何在Hadoop的生态系统中工作的。你还将在其中找到现实世界的实例分析,它们展示了那些使用Hive的公司是如何解决PB容量数据层面上的独特问题。

1.Introduction 1

An Overview of Hadoop and MapReduce 3

Hive in the Hadoop Ecosystem 6

Pig 8

HBase 8

Cascading,Crunch,and Others 9

Java Versus Hive:The Word Count Algorithm 11

What's Next 13

2.Getting Started 15

Installing a Preconfigured Virtual Machine 15

Detailed Installation 16

Installing Java 16

Installing Hadoop 18

Local Mode,Pseudodistributed Mode,and Distributed Mode 19

Testing Hadoop 20

Installing Hive 21

What Is Inside Hive? 22

Starting Hive 23

Configuring Your Hadoop Environment 24

Local Mode Configuration 24

Distributed and Pseudodistributed Mode Configuration 26

Metastore Using JDBC 28

The Hive Command 29

Command Options 29

The Command-Line Interface 30

CLI Options 31

Variables and Properties 31

Hive"One Shot"Commands 34

Executing Hive Queries from Files 35

Thehiverc File 36

More on Using the Hive CLI 36

Command History 37

Shell Execution 37

Hadoop dfs Commands from Inside Hive 38

Comments in Hive Scripts 38

Query Column Headers 38

3.Data Typesand File Formats 41

Primitive Data Types 41

Collection Data Types 43

Text File Encoding of Data Values 45

Schema on Read 48

4.HiveQL:Data Definition 49

Databases in Hive 49

Alter Database 52

Creating Tables 53

Managed Tables 56

External Tables 56

Partitioned,Managed Tables 58

External Partitioned Tables 61

Customizing Table Storage Formats 63

Dropping Tables 66

Alter Table 66

Renaming a Table 66

Adding,Modifying,and Dropping a Table Partition 66

Changing Columns 67

Adding Columns 68

Deleting or Replacing Columns 68

Alter Table Properties 68

Alter Storage Properties 68

Miscellaneous Alter Table Statements 69

5.HiveQL:Data Manipulation 71

Loading Data into Managed Tables 71

Inserting Data into Tables from Queries 73

Dynamic Partition Inserts 74

Creating Tables and Loading Them in One Query 75

Exporting Data 76

6.HiveQL:Queries 79

SELECT...FROM Clauses 79

Specify Columns with Regular Expressions 81

Computing with Column Values 81

Arithmetic Operators 82

Using Functions 83

LIMIT Clause 91

Column Aliases 91

Nested SELECT Statements 91

CASE...WHEN...THEN Statements 91

When Hive Can Avoid MapReduce 92

WHERE Clauses 92

Predicate Operators 93

Gotchas with Floating-Point Comparisons 94

LIKE and RLIKE 96

GROUP BY Clauses 97

HAVING Clauses 97

JOIN Statements 98

Inner JOIN 98

Join Optimizations 100

LEFT OUTER JOIN 101

OUTER JOIN Gotcha 101

RIGHT OUTER JOIN 103

FULL OUTER JOIN 104

LEFT SEMI-JOIN 104

Cartesian Product JOINs 105

Map-side Joins 105

ORDER BY and SORT BY 107

DISTRIBUTE BY with SORT BY 107

CLUSTER BY 108

Casting 109

Casting BINARY Values 109

Queries that Sample Data 110

Block Sampling 111

Input Pruning for Bucket Tables 111

UNION ALL 112

7.HiveQL:Views 113

Views to Reduce Query Complexity 113

Views that Restrict Data Based on Conditions 114

Views and Map Type for Dynamic Tables 114

View Odds and Ends 115

8.HiveQL:Indexes 117

Creating an Index 117

Bitmap Indexes 118

Rebuilding the Index 118

Showing an Index 119

Dropping an Index 119

Implementing a Custom Index Handler 119

9.Schema Design 121

Table-by-Day 121

Over Partitioning 122

Unique Keys and Normalization 123

Making Multiple Passes over the Same Data 124

The Case for Partitioning Every Table 124

Bucketing Table Data Storage 125

Adding Columns to a Table 127

Using Columnar Tables 128

Repeated Data 128

Many Columns 128

(Almost)Always Use Compression! 128

10.Tuning 131

Using EXPLAIN 131

EXPLAIN EXTENDED 134

Limit Tuning 134

Optimized Joins 135

Local Mode 135

Parallel Execution 136

Strict Mode 137

Tuning the Number of Mappers and Reducers 138

JVM Reuse 139

Indexes 140

Dynamic Partition Tuning 140

Speculative Execution 141

Single MapReduce MultiGROUP BY 142

Virtual Columns 142

11.Other File Formats and Compression 145

Determining Installed Codecs 145

Choosing a Compression Codec 146

Enabling Intermediate Compression 147

Final Output Compression 148

Sequence Files 148

Compression in Action 149

Archive Partition 152

Compression:Wrapping Up 154

12.Developing 155

Changing Log4J Properties 155

Connecting a Java Debugger to Hive 156

Building Hive from Source 156

Running Hive Test Cases 156

Execution Hooks 158

Setting Up Hive and Eclipse 158

Hive in a Maven Project 158

Unit Testing in Hive with hive_test 159

The New Plugin Developer Kit 161

13.Functions 163

Discovering and Describing Functions 163

Calling Functions 164

Standard Functions 164

Aggregate Functions 164

Table Generating Functions 165

A UDF for Finding a Zodiac Sign from a Day 166

UDF Versus GenericUDF 169

Permanent Functions 171

User-Defined Aggregate Functions 172

Creating a COLLECT UDAF to Emulate GROUP_CONCAT 172

User-Defined Table Generating Functions 177

UDTFs that Produce Multiple Rows 177

UDTFs that Produce a Single Row with Multiple Columns 179

UDTFs that Simulate Complex Types 179

Accessing the Distributed Cache from a UDF 182

Annotations for Use with Functions 184

Deterministic 184

Stateful 184

DistinctLike 185

Macros 185

14.Streaming 187

Identity Transformation 188

Changing Types 188

Projecting Transformation 188

Manipulative Transformations 189

Using the Distributed Cache 189

Producing Multiple Rows from a Single Row 190

Calculating Aggregates with Streaming 191

CLUSTER BY,DISTRIBUTE BY,SORT BY 192

GenericMR Tools for Streaming to Java 194

Calculating Cogroups 196

15.Customizing Hive File and Record Formats 199

File Versus Record Formats 199

Demystifying CREATE TABLE Statements 199

File Formats 201

Sequence File 201

RCFile 202

Example of a Custom Input Format:DualInputFormat 203

Record Formats:SerDes 205

CSV and TSV SerDes 206

ObjectInspector 206

Think Big Hive Reflection ObjectInspector 206

XMLUDF 207

XPath-Related Functions 207

JSON SerDe 208

Avro Hive SerDe 209

Defining Avro Schema Using Table Properties 209

Defining a Schema from a URI 210

Evolving Schema 211

Binary Output 211

16.Hive Thrift Service 213

Starting the Thrift Server 213

Setting Up Groovy to Connect to HiveService 214

Connecting to HiveServer 214

Getting Cluster Status 215

Result Set Schema 215

Fetching Results 215

Retrieving Query Plan 216

Metastore Methods 216

Example Table Checker 216

Administrating HiveServer 217

Productionizing HiveService 217

Cleanup 218

Hive ThriftMetastore 219

ThriftMetastore Configuration 219

Client Configuration 219

17.Storage Handlers and NoSQL 221

Storage Handler Background 221

HiveStorageHandler 222

HBase 222

Cassandra 224

Static Column Mapping 224

Transposed Column Mapping for Dynamic Columns 224

Cassandra SerDe Properties 224

DynamoDB 225

18.Security 227

Integration with Hadoop Security 228

Authentication with Hive 228

Authorization in Hive 229

Users,Groups,and Roles 230

Privileges to Grant and Revoke 231

Partition-Level Privileges 233

Automatic Grants 233

19.Locking 235

Locking Support in Hive with Zookeeper 235

Explicit,Exclusive Locks 238

20.Hive Integration with Oozie 239

Oozie Actions 239

Hive Thrift Service Action 240

A Two-Query Workflow 240

Oozie Web Console 242

Variables in Workflows 242

Capturing Output 243

Capturing Output to Variables 243

21.Hive and Amazon Web Services(AWS) 245

Why Elastic MapReduce? 245

Instances 245

Before Yon Start 246

Managing Your EMR Hive Cluster 246

Thrift Serveron EMR Hive 247

Instance Groups on EMR 247

Configuring Your EMR Cluster 248

Deploying hive-site.xml 248

Deploying a.hiverc Script 249

Setting Up a Memory-Intensive Contiguration 249

Persistence and the Metastore on EMR 250

HDFS and S3 on EMR Cluster 251

Putting Resources,Configs,and Bootstrap Scripts on S3 252

Logs on S3 252

Spot Instances 252

Security Groups 253

EMR Versus EC2 and Apache Hive 254

Wrapping Up 254

22.HCatalog 255

Introduction 255

MapReduce 256

Reading Data 256

Writing Data 258

Command Line 261

Security Model 261

Architecture 262

23.Case Studies 265

m6d.com(Media6Degrees) 265

Data Science at M6D Using Hive and R 265

M6D UDF Pseudorank 270

M6D Managing Hive Data Across Multiple MapReduce Clusters 274

Outbrain 278

In-Site Referrer Identification 278

Counting Uniques 280

Sessionization 282

NASA's Jet Propulsion Laboratory 287

The Regional Climate Model Evaluation System 287

Our Experience:Why Hive? 290

Some Challenges and How We Overcame Them 291

Photobucket 292

Big Data at Photobucket 292

What Hardware Do We Use for Hive? 293

What's in Hive? 293

Who Does It Support? 293

SimpleReach 294

Experiences and Needs from the Customer Trenches 296

A Karmasphere Perspective 296

Introduction 296

Use Case Examples from the Customer Trenches 297

Glossary 305

Appendix:References 309

Index 313