Hadoop权威指南 英文 第4版PDF电子书下载
- 电子书积分:20 积分如何计算积分?
- 作 者:(美)怀特著
- 出 版 社:南京:东南大学出版社
- 出版年份:2015
- ISBN:9787564159177
- 页数:730 页
Part Ⅰ.Hadoop Fundamentals 3
1.Meet Hadoop 3
Data! 3
Data Storage and Analysis 5
Querying All Your Data 6
Beyond Batch 7
Comparison with Other Systems 8
Relational Database Management Systems 8
Grid Computing 10
Volunteer Computing 11
A Brief History of Apache Hadoop 12
What's in This Book? 15
2.MapReduce 19
A Weather Dataset 19
Data Format 19
Analyzing the Data with Unix Tools 21
Analyzing the Data with Hadoop 22
Map and Reduce 22
Java Map Reduce 24
Scaling Out 30
Data Flow 30
Combiner Functions 34
Running a Distributed Map Reduce Job 37
Hadoop Streaming 37
Ruby 37
Python 40
3.The Hadoop Distributed Filesystem 43
The Design of HDFS 43
HDFS Concepts 45
Blocks 45
Namenodes and Datanodes 46
Block Caching 47
HDFS Federation 48
HDFS High Availability 48
The Command-Line Interface 50
Basic Filesystem Operations 51
Hadoop Filesystems 53
Interfaces 54
The Java Interface 56
Reading Data from a Hadoop URL 57
Reading Data Using the FileSystem API 58
Writing Data 61
Directories 63
Querying the Filesystem 63
Deleting Data 68
Data Flow 69
Anatomy of a File Read 69
Anatomy of a File Write 72
Coherency Model 74
Parallel Copying with distcp 76
Keeping an HDFS Cluster Balanced 77
4.YARN 79
Anatomy of a YARN Application Run 80
Resource Requests 81
Application Lifespan 82
Building YARN Applications 82
YARN Compared to MapReduce 1 83
Scheduling in YARN 85
Scheduler Options 86
Capacity Scheduler Configuration 88
Fair Scheduler Configuration 90
Delay Scheduling 94
Dominant Resource Fairness 95
Further Reading 96
5.Hadoop I/O 97
Data Integrity 97
Data Integrity in HDFS 98
LocalFileSystem 99
ChecksumFileSystem 99
Compression 100
Codecs 101
Compression and Input Splits 105
Using Compression in MapReduce 107
Serialization 109
The Writable Interface 110
Writable Classes 113
Implementing a Custom Writable 121
Serialization Frameworks 126
File-Based Data Structures 127
SequenceFile 127
MapFile 135
Other File Formats and Column-Oriented Formats 136
Part Ⅱ.MapReduce 141
6.Developing a MapReduce Application 141
The Configuration API 141
Combining Resources 143
Variable Expansion 143
Setting Up the Development Environment 144
Managing Configuration 146
GeneficOptionsParser,Tool,and ToolRunner 148
Writing a Unit Test with MRUnit 152
Mapper 153
Reducer 156
Running Locally on Test Data 156
Running a Job in a Local Job Runner 157
Testing the Driver 158
Running on a Cluster 160
Packaging a Job 160
Launching a Job 162
The MapReduce Web UI 165
Retrieving the Results 167
Debugging a Job 168
Hadoop Logs 172
Remote Debugging 174
Tuning a Job 175
Profiling Tasks 175
MapReduce Workflows 177
Decomposing a Problem into MapReduce Jobs 177
JobControl 178
Apache Oozie 179
7.How Map Reduce Works 185
Anatomy of a MapReduce Job Run 185
Job Submission 186
Job Initialization 187
Task Assignment 188
Task Execution 189
Progress and Status Updates 190
Job Completion 192
Failures 193
Task Failure 193
Application Master Failure 194
Node Manager Failure 195
Resource Manager Failure 196
Shuffle and Sort 197
The Map Side 197
The Reduce Side 198
Configuration Tuning 201
Task Execution 203
The Task Execution Environment 203
Speculative Execution 204
Output Committers 206
8.MapReduce Types and Formats 209
MapReduce Types 209
The Default MapReduce Job 214
Input Formats 220
Input Splits and Records 220
Text Input 232
Binary Input 236
Multiple Inputs 237
Database Input(and Output) 238
Output Formats 238
Text Output 239
Binary Output 239
Multiple Outputs 240
Lazy Output 245
Database Output 245
9.MapReduce Features 247
Counters 247
Built-in Counters 247
User-Defined Java Counters 251
User-Defined Streaming Counters 255
Sorting 255
Preparation 256
Partial Sort 257
Total Sort 259
Secondary Sort 262
Joins 268
Map-Side Joins 269
Reduce-Side Joins 270
Side Data Distribution 273
Using the Job Configuration 273
Distributed Cache 274
MapReduce Library Classes 279
Part Ⅲ.Hadoop Operations 283
1O.Setting Up a Hadoop Cluster 283
Cluster Specification 284
Cluster Sizing 285
Network Topology 286
Cluster Setup and Installation 288
Installing Java 288
Creating Unix User Accounts 288
Installing Hadoop 289
Configuring SSH 289
Configuring Hadoop 290
Formatting the HDFS Filesystem 290
Starting and Stopping the Daemons 290
Creating User Directories 292
Hadoop Configuration 292
Configuration Management 293
Environment Settings 294
Important Hadoop Daemon Properties 296
Hadoop Daemon Addresses and Ports 304
Other Hadoop Properties 307
Security 309
Kerberos and Hadoop 309
Delegation Tokens 312
Other Security Enhancements 313
Benchmarking a Hadoop Cluster 314
Hadoop Benchmarks 314
User Jobs 316
11.Administering Hadoop 317
HDFS 317
Persistent Data Structures 317
Safe Mode 322
Audit Logging 324
Tools 325
Monitoring 330
Logging 330
Metrics and JMX 331
Maintenance 332
Routine Administration Procedures 332
Commissioning and Decommissioning Nodes 334
Upgrades 337
Part Ⅳ.Related Projects 345
12.Avro 345
Avro Data Types and Schemas 346
In-Memory Serialization and Deserialization 349
The Specific API 351
Avro Datafiles 352
Interoperability 354
Python API 354
Avro Tools 355
Schema Resolution 355
Sort Order 358
Avro MapReduce 359
Sorting Using Avro MapReduce 363
Avro in Other Languages 365
13.Parquet 367
Data Model 368
Nested Encoding 370
Parquet File Format 370
Parquet Configuration 372
Writing and Reading Parquet Files 373
Avro,Protocol Buffers,and Thrift 375
Parquet MapReduce 377
14.Flume 381
Installing Flume 381
An Example 382
Transactions and Reliability 384
Batching 385
The HDFS Sink 385
Partitioning and Interceptors 387
File Formats 387
Fan Out 388
Delivery Guarantees 389
Replicating and Multiplexing Selectors 390
Distribution:Agent Tiers 390
Delivery Guarantees 393
Sink Groups 395
Integrating Flume with Applications 398
Component Catalog 399
Further Reading 400
15.Sqoop 401
Getting Sqoop 401
Sqoop Connectors 403
A Sample Import 404
Text and Binary File Formats 406
Generated Code 407
Additional Serialization Systems 408
Imports:A Deeper Look 408
Controlling the Import 410
Imports and Consistency 411
Incremental Imports 411
Direct-Mode Imports 411
Working with Imported Data 412
Imported Data and Hive 413
Importing Large Objects 415
Performing an Export 417
Exports:A Deeper Look 419
Exports and Transactionality 420
Exports and SequenceFiles 421
Further Reading 422
16.Pig 423
Installing and Running Pig 424
Execution Types 424
Running Pig Programs 426
Grunt 426
Pig Latin Editors 427
An Example 427
Generating Examples 429
Comparison with Databases 430
Pig Latin 432
Structure 432
Statements 433
Expressions 438
Types 439
Schemas 441
Functions 445
Macros 447
User-Defined Functions 448
A Filter UDF 448
An Eval UDF 452
A Load UDF 453
Data Processing Operators 457
Loading and Storing Data 457
Filtering Data 457
Grouping and Joining Data 459
Sorting Data 465
Combining and Splitting Data 466
Pig in Pracfice 467
Parallelism 467
Anonymous Relations 467
Parameter Substitution 468
Further Reading 469
17.Hive 471
Installing Hive 472
The Hive Shell 473
An Example 474
Running Hive 475
Configuring Hive 475
Hive Services 478
The Metastore 480
Comparison with Traditional Databases 482
Schema on Read Versus Schema on Write 482
Updates,Transactions,and Indexes 483
SQL-on-Hadoop Alternatives 484
HiveQL 485
Data Types 486
Operators and Functions 488
Tables 489
Managed Tables and External Tables 490
Partitions and Buckets 491
Storage Formats 496
Importing Data 500
Altering Tables 502
Dropping Tables 502
Querying Data 503
Sorting and Aggregating 503
MapReduce Scripts 503
Joins 505
Subqueries 508
Views 509
User-Defined Functions 510
Writing a UDF 511
Writing a UDAF 513
Further Reading 518
18.Crunch 519
An Example 520
The Core Crunch API 523
Primitive Operations 523
Types 528
Sources and Targets 531
Functions 533
Materialization 535
Pipeline Execution 538
Running a Pipeline 538
Stopping a Pipeline 539
Inspecting a Crunch Plan 540
Iterative Algorithms 543
Checkpointing a Pipeline 545
Crunch Libraries 545
Further Reading 548
19.Spark 549
Installing Spark 550
An Example 550
Spark Applications,Jobs,Stages,and Tasks 552
A Scala Standalone Application 552
A Java Example 554
A Python Example 555
Resilient Distributed Datasets 556
Creation 556
Transformations and Actions 557
Persistence 560
Serialization 562
Shared Variables 564
Broadcast Variables 564
Accumulators 564
Anatomy of a Spark Job Run 565
Job Submission 565
DAG Construction 566
Task Scheduling 569
Task Execution 570
Executors and Cluster Managers 570
Spark on YARN 571
Further Reading 574
20.HBase 575
HBasics 575
Backdrop 576
Concepts 576
Whirlwind Tour of the Data Model 576
Implementation 578
Installation 581
Test Drive 582
Clients 584
Java 584
MapReduce 587
REST and Thrift 589
Building an Online Query Application 589
Schema Design 590
Loading Data 591
Online Queries 594
HBase Versus RDBMS 597
Successful Service 598
HBase 599
Praxis 600
HDFS 600
UI 601
Metrics 601
Counters 601
Further Reading 601
21.ZooKeeper 603
Installing and Running ZooKeeper 604
An Example 606
Group Membership in ZooKeeper 606
Creating the Group 607
Joining a Group 609
Listing Members in a Group 610
Deleting a Group 612
The ZooKeeper Service 613
Data Model 614
Operations 616
Implementation 620
Consistency 622
Sessions 624
States 625
Building Applications with ZooKeeper 627
A Configuration Service 627
The Resilient ZooKeeper Application 630
A Lock Service 634
More Distributed Data Structures and Protocols 636
ZooKeeper in Production 637
Resilience and Performance 637
Configuration 639
Further Reading 640
Part Ⅴ.Case Studies 643
22.Composable Data at Cerner 643
From CPUs to Semantic Integration 643
Enter Apache Crunch 644
Building a Complete Picture 644
Integrating Healthcare Data 647
Composability over Frameworks 650
Moving Forward 651
23.Biological Data Science:Saving Lives with Software 653
The Structure of DNA 655
The Genetic Code:Turning DNA Letters into Proteins 656
Thinking of DNA as Source Code 657
The Human Genome Project and Reference Genomes 659
Sequencing and Aligning DNA 660
ADAM,A Scalable Genome Analysis Platform 661
Literate programming with the Avro interface description language(IDL) 662
Column-oriented access with Parquet 663
A simple example:k-mer counting using Spark and ADAM 665
From Personalized Ads to Personalized Medicine 667
Join In 668
24.Cascading 669
Fields,Tuples,and Pipes 670
Operations 673
Taps,Schemes,and Flows 675
Cascading in Practice 676
Flexibility 679
Hadoop and Cascading at ShareThis 680
Summary 684
A.Installing Apache Hadoop 685
B.Cloudera's Distribution Including Apache Hadoop 691
C.Preparing the NCDC Weather Data 693
D.The Old and New Java MapReduce APIs 697
Index 701
- 《近代旅游指南汇刊二编 16》王强主编 2017
- 《甘肃省档案馆指南》甘肃省档案馆编 2018
- 《近代旅游指南汇刊 31》王强主编 2014
- 《近代旅游指南汇刊二编 10》王强主编 2017
- 《手工咖啡 咖啡爱好者的完美冲煮指南》(美国)杰茜卡·伊斯托,安德烈亚斯·威尔霍夫 2019
- 《近代旅游指南汇刊 13》王强主编 2014
- 《近代旅游指南汇刊 28》王强主编 2014
- 《近代旅游指南汇刊 23》王强主编 2014
- 《重庆市绿色建筑评价技术指南》重庆大学,重庆市建筑节能协会绿色建筑专业委员会主编 2018
- 《ESG指标管理与信息披露指南》管竹笋,林波,代奕波主编 2019