《高性能Spark 影印版》PDF下载

  • 购买积分:12 如何计算积分?
  • 作  者:Holden Karau,Rachel Warren著
  • 出 版 社:南京:东南大学出版社
  • 出版年份:2018
  • ISBN:9787564175184
  • 页数:344 页
图书介绍:Apache Spark易学易用令人惊喜。但如果你尚未看到期望的性能改善效果,或者还是没有足够信心在生产环境中使用Spark,这本实用书籍就是给你准备的。作者Holden Karau和Rachel Warren展示了如何使用更少资源,让Spark查询运行更快、处理更大数据的性能优化方法。本书描述了减少数据基础设施成本和开发时间的技巧,适用于软件工程师、数据工程师、开发者和系统管理员。不仅可以从中获得关于Spark的全面理解,你也将学会如何让它运转自如。

1.Introduction to High Performance Spark 1

What Is Spark and Why Performance Matters 1

What You Can Expect to Get from This Book 2

Spark Versions 3

Why Scala? 3

To Be a Spark Expert You Have to Learn a Little Scala Anyway 3

The Spark Scala API Is Easier to Use Than the Java API 4

Scala Is More Performant Than Python 4

Why Not Scala? 4

Learning Scala 5

Conclusion 6

2.How SparkWorks 7

How Spark Fits into the Big Data Ecosystem 8

Spark Components 8

Spark Model of Parallel Computing:RDDs 10

Lazy Evaluation 11

In-Memory Persistence and Memory Management 13

Immutability and the RDD Interface 14

Types of RDDs 16

Functions on RDDs:Transformations Versus Actions 17

Wide Versus Narrow Dependencies 17

Spark Job Scheduling 19

Resource Allocation Across Applications 20

The Spark Application 20

The Anatomy of a Spark Job 22

The DAG 22

Jobs 23

Stages 23

Tasks 24

Conclusion 26

3.Data Frames,Datasets,and Spark SQL 27

Getting Started with the SparkSession(or HiveContext or SQLContext) 28

Spark SQL Dependencies 30

Managing Spark Dependencies 31

Avoiding Hive JARs 32

Basics of Schemas 33

DataFrame API 36

Transformations 36

Multi-DataFrame Transformations 48

Plain Old SQL Queries and Interacting with Hive Data 49

Data Representation in DataFrames and Datasets 49

Tungsten 50

Data Loading and Saving Functions 51

DataFrameWriter and DataFrameReader 51

Formats 52

Save Modes 61

Partitions(Discovery and Writing) 62

Datasets 62

Interoperability with RDDs,DataFrames,and Local Collections 63

Compile-Time Strong Typing 64

Easier Functional(RDD“like”)Transformations 65

Relational Transformations 65

Multi-Dataset Relational Transformations 65

Grouped Operations on Datasets 66

Extending with User-Defined Functions and Aggregate Functions(UDFs,UDAFs) 67

Query Optimizer 69

Logical and Physical Plans 69

Code Generation 70

Large Query Plans and Iterative Algorithms 70

Debugging Spark SQL Queries 71

JDBC/ODBC Server 71

Conclusion 72

4.Joins(SQL and Core) 75

Core Spark Joins 75

Choosing a Join Type 77

Choosing an Execution Plan 78

Spark SQL Joins 81

DataFrame Joins 82

Dataset Joins 85

Conclusion 86

5.Effective Transformations 87

Narrow Versus Wide Transformations 88

Implications for Performance 90

Implications for Fault Tolerance 91

The Special Case of coalesce 92

What Type of RDD Does Your Transformation Return? 92

Minimizing Object Creation 94

Reusing Existing Objects 94

Using Smaller Data Structures 97

Iterator-to-Iterator Transformations with mapPartitions 100

What Is an Iterator-to-Iterator Transformation? 101

Space and Time Advantages 102

An Example 103

Set Operations 106

Reducing Setup Overhead 107

Shared Variables 108

Broadcast Variables 108

Accumulators 109

Reusing RDDs 114

Cases for Reuse 114

Deciding if Recompute Is Inexpensive Enough 117

Types of Reuse:Cache,Persist,Checkpoint,Shuffle Files 118

Alluxio(nee Tachyon) 122

LRU Caching 123

Noisy Cluster Considerations 124

Interaction with Accumulators 125

Conclusion 126

6.Working with Key/Value Data 127

The Goldilocks Example 129

Goldilocks Version 0:Iterative Solution 130

How to Use PairRDDFunctions and OrderedRDDFunctions 132

Actions on Key/Value Pairs 133

What’s So Dangerous About the groupByKey Function 134

Goldilocks Version 1:groupByKey Solution 134

Choosing an Aggregation Operation 138

Dictionary of Aggregation Operations with Performance Considerations 138

Multiple RDD Operations 141

Co-Grouping 141

Partitioners and Key/Value Data 142

Using the Spark Partitioner Object 144

Hash Partitioning 144

Range Partitioning 144

Custom Partitioning 145

Preserving Partitioning Information Across Transformations 146

Leveraging Co-Located and Co-Partitioned RDDs 146

Dictionary of Mapping and Partitioning Functions PairRDDFunctions 148

Dictionary of OrderedRDDOperations 149

Sorting by Two Keys with SortByKey 151

Secondary Sort and repartitionAndSortWithinPartitions 151

Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function 152

How Not to Sort by Two Orderings 155

Goldilocks Version 2:Secondary Sort 156

A Different Approach to Goldilocks 159

Goldilocks Version 3:Sort on Cell Values 164

Straggler Detection and Unbalanced Data 165

Back to Goldilocks(Again) 167

Goldilocks Version 4:Reduce to Distinct on Each Partition 167

Conclusion 173

7.Going Beyond Scala 175

Beyond Scala within the JVM 176

Beyond Scala,and Beyond the JVM 180

How PySpark Works 181

How SparkR Works 189

Spark.jl(Julia Spark) 191

How Eclair JS Works 192

Spark on the Common Language Runtime(CLR)—C#and Friends 193

Calling Other Languages from Spark 193

Using Pipe and Friends 193

JNI 195

Java Native Access(JNA) 198

Underneath Everything Is FORTRAN 199

Getting to the GPU 200

The Future 201

Conclusion 201

8.Testing and Validation 203

Unit Testing 203

General Spark Unit Testing 204

Mocking RDDs 208

Getting Test Data 210

Generating Large Datasets 210

Sampling 211

Property Checking with ScalaCheck 213

Computing RDD Difference 213

Integration Testing 216

Choosing Your Integration Testing Environment 216

Verifying Performance 217

Spark Counters for Verifying Performance 217

Projects for Verifying Performance 218

Job Validation 219

Conclusion 220

9.Spark MLlib and ML 221

Choosing Between Spark MLlib and Spark ML 221

Working with MLlib 222

Getting Started with MLlib(Organization and Imports) 222

MLlib Feature Encoding and Data Preparation 223

Feature Scaling and Selection 228

MLlib Model Training 228

Predicting 229

Serving and Persistence 230

Model Evaluation 232

Working with Spark ML 233

Spark ML Organization and Imports 233

Pipeline Stages 234

Explain Params 235

Data Encoding 236

Data Cleaning 239

Spark ML Models 239

Putting It All Together in a Pipeline 240

Training a Pipeline 241

Accessing Individual Stages 241

Data Persistence and Spark ML 242

Extending Spark ML Pipelines with Your Own Algorithms 244

Model and Pipeline Persistence and Serving with Spark ML 252

General Serving Considerations 252

Conclusion 253

10.Spark Components and Packages 255

Stream Processing with Spark 257

Sources and Sinks 257

Batch Intervals 259

Data Checkpoint Intervals 260

Considerations for DStreams 261

Considerations for Structured Streaming 262

High Availability Mode(or Handling Driver Failure or Checkpointing) 270

GraphX 271

Using Community Packages and Libraries 271

Creating a Spark Package 273

Conclusion 274

A.Tuning,Debugging,and Other Things Developers Like to Pretend Don’t Exist 275

Index 325