1.Introduction to High Performance Spark 1
What Is Spark and Why Performance Matters 1
What You Can Expect to Get from This Book 2
Spark Versions 3
Why Scala? 3
To Be a Spark Expert You Have to Learn a Little Scala Anyway 3
The Spark Scala API Is Easier to Use Than the Java API 4
Scala Is More Performant Than Python 4
Why Not Scala? 4
Learning Scala 5
Conclusion 6
2.How SparkWorks 7
How Spark Fits into the Big Data Ecosystem 8
Spark Components 8
Spark Model of Parallel Computing:RDDs 10
Lazy Evaluation 11
In-Memory Persistence and Memory Management 13
Immutability and the RDD Interface 14
Types of RDDs 16
Functions on RDDs:Transformations Versus Actions 17
Wide Versus Narrow Dependencies 17
Spark Job Scheduling 19
Resource Allocation Across Applications 20
The Spark Application 20
The Anatomy of a Spark Job 22
The DAG 22
Jobs 23
Stages 23
Tasks 24
Conclusion 26
3.Data Frames,Datasets,and Spark SQL 27
Getting Started with the SparkSession(or HiveContext or SQLContext) 28
Spark SQL Dependencies 30
Managing Spark Dependencies 31
Avoiding Hive JARs 32
Basics of Schemas 33
DataFrame API 36
Transformations 36
Multi-DataFrame Transformations 48
Plain Old SQL Queries and Interacting with Hive Data 49
Data Representation in DataFrames and Datasets 49
Tungsten 50
Data Loading and Saving Functions 51
DataFrameWriter and DataFrameReader 51
Formats 52
Save Modes 61
Partitions(Discovery and Writing) 62
Datasets 62
Interoperability with RDDs,DataFrames,and Local Collections 63
Compile-Time Strong Typing 64
Easier Functional(RDD“like”)Transformations 65
Relational Transformations 65
Multi-Dataset Relational Transformations 65
Grouped Operations on Datasets 66
Extending with User-Defined Functions and Aggregate Functions(UDFs,UDAFs) 67
Query Optimizer 69
Logical and Physical Plans 69
Code Generation 70
Large Query Plans and Iterative Algorithms 70
Debugging Spark SQL Queries 71
JDBC/ODBC Server 71
Conclusion 72
4.Joins(SQL and Core) 75
Core Spark Joins 75
Choosing a Join Type 77
Choosing an Execution Plan 78
Spark SQL Joins 81
DataFrame Joins 82
Dataset Joins 85
Conclusion 86
5.Effective Transformations 87
Narrow Versus Wide Transformations 88
Implications for Performance 90
Implications for Fault Tolerance 91
The Special Case of coalesce 92
What Type of RDD Does Your Transformation Return? 92
Minimizing Object Creation 94
Reusing Existing Objects 94
Using Smaller Data Structures 97
Iterator-to-Iterator Transformations with mapPartitions 100
What Is an Iterator-to-Iterator Transformation? 101
Space and Time Advantages 102
An Example 103
Set Operations 106
Reducing Setup Overhead 107
Shared Variables 108
Broadcast Variables 108
Accumulators 109
Reusing RDDs 114
Cases for Reuse 114
Deciding if Recompute Is Inexpensive Enough 117
Types of Reuse:Cache,Persist,Checkpoint,Shuffle Files 118
Alluxio(nee Tachyon) 122
LRU Caching 123
Noisy Cluster Considerations 124
Interaction with Accumulators 125
Conclusion 126
6.Working with Key/Value Data 127
The Goldilocks Example 129
Goldilocks Version 0:Iterative Solution 130
How to Use PairRDDFunctions and OrderedRDDFunctions 132
Actions on Key/Value Pairs 133
What’s So Dangerous About the groupByKey Function 134
Goldilocks Version 1:groupByKey Solution 134
Choosing an Aggregation Operation 138
Dictionary of Aggregation Operations with Performance Considerations 138
Multiple RDD Operations 141
Co-Grouping 141
Partitioners and Key/Value Data 142
Using the Spark Partitioner Object 144
Hash Partitioning 144
Range Partitioning 144
Custom Partitioning 145
Preserving Partitioning Information Across Transformations 146
Leveraging Co-Located and Co-Partitioned RDDs 146
Dictionary of Mapping and Partitioning Functions PairRDDFunctions 148
Dictionary of OrderedRDDOperations 149
Sorting by Two Keys with SortByKey 151
Secondary Sort and repartitionAndSortWithinPartitions 151
Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function 152
How Not to Sort by Two Orderings 155
Goldilocks Version 2:Secondary Sort 156
A Different Approach to Goldilocks 159
Goldilocks Version 3:Sort on Cell Values 164
Straggler Detection and Unbalanced Data 165
Back to Goldilocks(Again) 167
Goldilocks Version 4:Reduce to Distinct on Each Partition 167
Conclusion 173
7.Going Beyond Scala 175
Beyond Scala within the JVM 176
Beyond Scala,and Beyond the JVM 180
How PySpark Works 181
How SparkR Works 189
Spark.jl(Julia Spark) 191
How Eclair JS Works 192
Spark on the Common Language Runtime(CLR)—C#and Friends 193
Calling Other Languages from Spark 193
Using Pipe and Friends 193
JNI 195
Java Native Access(JNA) 198
Underneath Everything Is FORTRAN 199
Getting to the GPU 200
The Future 201
Conclusion 201
8.Testing and Validation 203
Unit Testing 203
General Spark Unit Testing 204
Mocking RDDs 208
Getting Test Data 210
Generating Large Datasets 210
Sampling 211
Property Checking with ScalaCheck 213
Computing RDD Difference 213
Integration Testing 216
Choosing Your Integration Testing Environment 216
Verifying Performance 217
Spark Counters for Verifying Performance 217
Projects for Verifying Performance 218
Job Validation 219
Conclusion 220
9.Spark MLlib and ML 221
Choosing Between Spark MLlib and Spark ML 221
Working with MLlib 222
Getting Started with MLlib(Organization and Imports) 222
MLlib Feature Encoding and Data Preparation 223
Feature Scaling and Selection 228
MLlib Model Training 228
Predicting 229
Serving and Persistence 230
Model Evaluation 232
Working with Spark ML 233
Spark ML Organization and Imports 233
Pipeline Stages 234
Explain Params 235
Data Encoding 236
Data Cleaning 239
Spark ML Models 239
Putting It All Together in a Pipeline 240
Training a Pipeline 241
Accessing Individual Stages 241
Data Persistence and Spark ML 242
Extending Spark ML Pipelines with Your Own Algorithms 244
Model and Pipeline Persistence and Serving with Spark ML 252
General Serving Considerations 252
Conclusion 253
10.Spark Components and Packages 255
Stream Processing with Spark 257
Sources and Sinks 257
Batch Intervals 259
Data Checkpoint Intervals 260
Considerations for DStreams 261
Considerations for Structured Streaming 262
High Availability Mode(or Handling Driver Failure or Checkpointing) 270
GraphX 271
Using Community Packages and Libraries 271
Creating a Spark Package 273
Conclusion 274
A.Tuning,Debugging,and Other Things Developers Like to Pretend Don’t Exist 275
Index 325