当前位置:首页 > 工业技术
Hadoop权威指南  英文  第4版
Hadoop权威指南  英文  第4版

Hadoop权威指南 英文 第4版PDF电子书下载

工业技术

  • 电子书积分:20 积分如何计算积分?
  • 作 者:(美)怀特著
  • 出 版 社:南京:东南大学出版社
  • 出版年份:2015
  • ISBN:9787564159177
  • 页数:730 页
图书介绍:通过这本综合性指南的第四版,你将会学习到如何通过Apache Hadoop建立和维护可靠的、可扩展的分布式系统。本书是期望分析任意大小的数据集的程序员以及想建立和运行Hadoop集群的管理员的理想选择。在这本面向Hadoop 2的新版书籍中,作者Tom White增加了关于YARN和一些Hadoop的相关项目如Parquet, Flume, Crunch和Spark的新章节。你将会了解到Hadoop版本的最新变化,并且研究在医疗健康系统和基因数据处理中Hadoop的应用案例。
《Hadoop权威指南 英文 第4版》目录
标签:权威 指南

Part Ⅰ.Hadoop Fundamentals 3

1.Meet Hadoop 3

Data! 3

Data Storage and Analysis 5

Querying All Your Data 6

Beyond Batch 7

Comparison with Other Systems 8

Relational Database Management Systems 8

Grid Computing 10

Volunteer Computing 11

A Brief History of Apache Hadoop 12

What's in This Book? 15

2.MapReduce 19

A Weather Dataset 19

Data Format 19

Analyzing the Data with Unix Tools 21

Analyzing the Data with Hadoop 22

Map and Reduce 22

Java Map Reduce 24

Scaling Out 30

Data Flow 30

Combiner Functions 34

Running a Distributed Map Reduce Job 37

Hadoop Streaming 37

Ruby 37

Python 40

3.The Hadoop Distributed Filesystem 43

The Design of HDFS 43

HDFS Concepts 45

Blocks 45

Namenodes and Datanodes 46

Block Caching 47

HDFS Federation 48

HDFS High Availability 48

The Command-Line Interface 50

Basic Filesystem Operations 51

Hadoop Filesystems 53

Interfaces 54

The Java Interface 56

Reading Data from a Hadoop URL 57

Reading Data Using the FileSystem API 58

Writing Data 61

Directories 63

Querying the Filesystem 63

Deleting Data 68

Data Flow 69

Anatomy of a File Read 69

Anatomy of a File Write 72

Coherency Model 74

Parallel Copying with distcp 76

Keeping an HDFS Cluster Balanced 77

4.YARN 79

Anatomy of a YARN Application Run 80

Resource Requests 81

Application Lifespan 82

Building YARN Applications 82

YARN Compared to MapReduce 1 83

Scheduling in YARN 85

Scheduler Options 86

Capacity Scheduler Configuration 88

Fair Scheduler Configuration 90

Delay Scheduling 94

Dominant Resource Fairness 95

Further Reading 96

5.Hadoop I/O 97

Data Integrity 97

Data Integrity in HDFS 98

LocalFileSystem 99

ChecksumFileSystem 99

Compression 100

Codecs 101

Compression and Input Splits 105

Using Compression in MapReduce 107

Serialization 109

The Writable Interface 110

Writable Classes 113

Implementing a Custom Writable 121

Serialization Frameworks 126

File-Based Data Structures 127

SequenceFile 127

MapFile 135

Other File Formats and Column-Oriented Formats 136

Part Ⅱ.MapReduce 141

6.Developing a MapReduce Application 141

The Configuration API 141

Combining Resources 143

Variable Expansion 143

Setting Up the Development Environment 144

Managing Configuration 146

GeneficOptionsParser,Tool,and ToolRunner 148

Writing a Unit Test with MRUnit 152

Mapper 153

Reducer 156

Running Locally on Test Data 156

Running a Job in a Local Job Runner 157

Testing the Driver 158

Running on a Cluster 160

Packaging a Job 160

Launching a Job 162

The MapReduce Web UI 165

Retrieving the Results 167

Debugging a Job 168

Hadoop Logs 172

Remote Debugging 174

Tuning a Job 175

Profiling Tasks 175

MapReduce Workflows 177

Decomposing a Problem into MapReduce Jobs 177

JobControl 178

Apache Oozie 179

7.How Map Reduce Works 185

Anatomy of a MapReduce Job Run 185

Job Submission 186

Job Initialization 187

Task Assignment 188

Task Execution 189

Progress and Status Updates 190

Job Completion 192

Failures 193

Task Failure 193

Application Master Failure 194

Node Manager Failure 195

Resource Manager Failure 196

Shuffle and Sort 197

The Map Side 197

The Reduce Side 198

Configuration Tuning 201

Task Execution 203

The Task Execution Environment 203

Speculative Execution 204

Output Committers 206

8.MapReduce Types and Formats 209

MapReduce Types 209

The Default MapReduce Job 214

Input Formats 220

Input Splits and Records 220

Text Input 232

Binary Input 236

Multiple Inputs 237

Database Input(and Output) 238

Output Formats 238

Text Output 239

Binary Output 239

Multiple Outputs 240

Lazy Output 245

Database Output 245

9.MapReduce Features 247

Counters 247

Built-in Counters 247

User-Defined Java Counters 251

User-Defined Streaming Counters 255

Sorting 255

Preparation 256

Partial Sort 257

Total Sort 259

Secondary Sort 262

Joins 268

Map-Side Joins 269

Reduce-Side Joins 270

Side Data Distribution 273

Using the Job Configuration 273

Distributed Cache 274

MapReduce Library Classes 279

Part Ⅲ.Hadoop Operations 283

1O.Setting Up a Hadoop Cluster 283

Cluster Specification 284

Cluster Sizing 285

Network Topology 286

Cluster Setup and Installation 288

Installing Java 288

Creating Unix User Accounts 288

Installing Hadoop 289

Configuring SSH 289

Configuring Hadoop 290

Formatting the HDFS Filesystem 290

Starting and Stopping the Daemons 290

Creating User Directories 292

Hadoop Configuration 292

Configuration Management 293

Environment Settings 294

Important Hadoop Daemon Properties 296

Hadoop Daemon Addresses and Ports 304

Other Hadoop Properties 307

Security 309

Kerberos and Hadoop 309

Delegation Tokens 312

Other Security Enhancements 313

Benchmarking a Hadoop Cluster 314

Hadoop Benchmarks 314

User Jobs 316

11.Administering Hadoop 317

HDFS 317

Persistent Data Structures 317

Safe Mode 322

Audit Logging 324

Tools 325

Monitoring 330

Logging 330

Metrics and JMX 331

Maintenance 332

Routine Administration Procedures 332

Commissioning and Decommissioning Nodes 334

Upgrades 337

Part Ⅳ.Related Projects 345

12.Avro 345

Avro Data Types and Schemas 346

In-Memory Serialization and Deserialization 349

The Specific API 351

Avro Datafiles 352

Interoperability 354

Python API 354

Avro Tools 355

Schema Resolution 355

Sort Order 358

Avro MapReduce 359

Sorting Using Avro MapReduce 363

Avro in Other Languages 365

13.Parquet 367

Data Model 368

Nested Encoding 370

Parquet File Format 370

Parquet Configuration 372

Writing and Reading Parquet Files 373

Avro,Protocol Buffers,and Thrift 375

Parquet MapReduce 377

14.Flume 381

Installing Flume 381

An Example 382

Transactions and Reliability 384

Batching 385

The HDFS Sink 385

Partitioning and Interceptors 387

File Formats 387

Fan Out 388

Delivery Guarantees 389

Replicating and Multiplexing Selectors 390

Distribution:Agent Tiers 390

Delivery Guarantees 393

Sink Groups 395

Integrating Flume with Applications 398

Component Catalog 399

Further Reading 400

15.Sqoop 401

Getting Sqoop 401

Sqoop Connectors 403

A Sample Import 404

Text and Binary File Formats 406

Generated Code 407

Additional Serialization Systems 408

Imports:A Deeper Look 408

Controlling the Import 410

Imports and Consistency 411

Incremental Imports 411

Direct-Mode Imports 411

Working with Imported Data 412

Imported Data and Hive 413

Importing Large Objects 415

Performing an Export 417

Exports:A Deeper Look 419

Exports and Transactionality 420

Exports and SequenceFiles 421

Further Reading 422

16.Pig 423

Installing and Running Pig 424

Execution Types 424

Running Pig Programs 426

Grunt 426

Pig Latin Editors 427

An Example 427

Generating Examples 429

Comparison with Databases 430

Pig Latin 432

Structure 432

Statements 433

Expressions 438

Types 439

Schemas 441

Functions 445

Macros 447

User-Defined Functions 448

A Filter UDF 448

An Eval UDF 452

A Load UDF 453

Data Processing Operators 457

Loading and Storing Data 457

Filtering Data 457

Grouping and Joining Data 459

Sorting Data 465

Combining and Splitting Data 466

Pig in Pracfice 467

Parallelism 467

Anonymous Relations 467

Parameter Substitution 468

Further Reading 469

17.Hive 471

Installing Hive 472

The Hive Shell 473

An Example 474

Running Hive 475

Configuring Hive 475

Hive Services 478

The Metastore 480

Comparison with Traditional Databases 482

Schema on Read Versus Schema on Write 482

Updates,Transactions,and Indexes 483

SQL-on-Hadoop Alternatives 484

HiveQL 485

Data Types 486

Operators and Functions 488

Tables 489

Managed Tables and External Tables 490

Partitions and Buckets 491

Storage Formats 496

Importing Data 500

Altering Tables 502

Dropping Tables 502

Querying Data 503

Sorting and Aggregating 503

MapReduce Scripts 503

Joins 505

Subqueries 508

Views 509

User-Defined Functions 510

Writing a UDF 511

Writing a UDAF 513

Further Reading 518

18.Crunch 519

An Example 520

The Core Crunch API 523

Primitive Operations 523

Types 528

Sources and Targets 531

Functions 533

Materialization 535

Pipeline Execution 538

Running a Pipeline 538

Stopping a Pipeline 539

Inspecting a Crunch Plan 540

Iterative Algorithms 543

Checkpointing a Pipeline 545

Crunch Libraries 545

Further Reading 548

19.Spark 549

Installing Spark 550

An Example 550

Spark Applications,Jobs,Stages,and Tasks 552

A Scala Standalone Application 552

A Java Example 554

A Python Example 555

Resilient Distributed Datasets 556

Creation 556

Transformations and Actions 557

Persistence 560

Serialization 562

Shared Variables 564

Broadcast Variables 564

Accumulators 564

Anatomy of a Spark Job Run 565

Job Submission 565

DAG Construction 566

Task Scheduling 569

Task Execution 570

Executors and Cluster Managers 570

Spark on YARN 571

Further Reading 574

20.HBase 575

HBasics 575

Backdrop 576

Concepts 576

Whirlwind Tour of the Data Model 576

Implementation 578

Installation 581

Test Drive 582

Clients 584

Java 584

MapReduce 587

REST and Thrift 589

Building an Online Query Application 589

Schema Design 590

Loading Data 591

Online Queries 594

HBase Versus RDBMS 597

Successful Service 598

HBase 599

Praxis 600

HDFS 600

UI 601

Metrics 601

Counters 601

Further Reading 601

21.ZooKeeper 603

Installing and Running ZooKeeper 604

An Example 606

Group Membership in ZooKeeper 606

Creating the Group 607

Joining a Group 609

Listing Members in a Group 610

Deleting a Group 612

The ZooKeeper Service 613

Data Model 614

Operations 616

Implementation 620

Consistency 622

Sessions 624

States 625

Building Applications with ZooKeeper 627

A Configuration Service 627

The Resilient ZooKeeper Application 630

A Lock Service 634

More Distributed Data Structures and Protocols 636

ZooKeeper in Production 637

Resilience and Performance 637

Configuration 639

Further Reading 640

Part Ⅴ.Case Studies 643

22.Composable Data at Cerner 643

From CPUs to Semantic Integration 643

Enter Apache Crunch 644

Building a Complete Picture 644

Integrating Healthcare Data 647

Composability over Frameworks 650

Moving Forward 651

23.Biological Data Science:Saving Lives with Software 653

The Structure of DNA 655

The Genetic Code:Turning DNA Letters into Proteins 656

Thinking of DNA as Source Code 657

The Human Genome Project and Reference Genomes 659

Sequencing and Aligning DNA 660

ADAM,A Scalable Genome Analysis Platform 661

Literate programming with the Avro interface description language(IDL) 662

Column-oriented access with Parquet 663

A simple example:k-mer counting using Spark and ADAM 665

From Personalized Ads to Personalized Medicine 667

Join In 668

24.Cascading 669

Fields,Tuples,and Pipes 670

Operations 673

Taps,Schemes,and Flows 675

Cascading in Practice 676

Flexibility 679

Hadoop and Cascading at ShareThis 680

Summary 684

A.Installing Apache Hadoop 685

B.Cloudera's Distribution Including Apache Hadoop 691

C.Preparing the NCDC Weather Data 693

D.The Old and New Java MapReduce APIs 697

Index 701

返回顶部