PART 1 Foundations 1
Chapter 1 Introduction 1
The Power and Potential of Parallelism 2
Parallelism,a Familiar Concept 2
Parallelism in Computer Programs 3
Multi-Core Computers,an Opportunity 4
Even More Opportunities to Use Parallel Hardware 5
Parallel Computing versus Distributed Computing 6
System Level Parallelism 6
Convenience of Parallel Abstractions 8
Examining Sequential and Parallel Programs 8
Parallelizing Compilers 8
A Paradigm Shift 9
Parallel Prefix Sum 13
Parallelism Using Multiple Instruction Streams 15
The Concept of a Thread 15
A Multithreaded Solution to Counting 3s 15
The Goals:Scalability and Performance Portability 25
Scalability 25
Performance Portability 26
Principles First 27
Chapter Summary 27
Historical Perspective 28
Exercises 28
Chapter 2 Understanding Parallel Computers 30
Balancing Machine Specifics with Portability 30
A Look at Six Parallel Computers 31
Chip Multiprocessors 31
Symmetric Multiprocessor Architectures 34
Heterogeneous Chip Designs 36
Clusters 39
Supercomputers 40
Observations from Our Six Parallel Computers 43
An Abstraction of a Sequential Computer 44
Applying the RAM Model 44
Evaluating the RAM Model 45
The PRAM:A Parallel Computer Model 46
The CTA:A Practical Parallel Computer Model 47
The CTA Model 47
Communication Latency 49
Properties of the CTA 52
Memory Reference Mechanisms 53
Shared Memory 53
One-Sided Communication 54
Message Passing 54
Memory Consistency Models 55
Programming Models 56
A Closer Look at Communication 57
Applying the CTA Model 58
Chapter Summary 59
Historical Perspective 59
Exercises 59
Chapter 3 Reasoning about Performance 61
Motivation and Basic Concepts 61
Parallelism versus Performance 61
Threads and Processes 62
Latency and Throughput 62
Sources of Performance Loss 64
Overhead 64
Non-Parallelizable Code 65
Contention 67
Idle Time 67
Parallel Structure 68
Dependences 68
Dependences Limit Parallelism 70
Granularity 72
Locality 73
Performance Trade-Offs 73
Communication versus Computation 74
Memory versus Parallelism 75
Overhead versus Parallelism 75
Measuring Performance 77
Execution Time 77
Speedup 78
Superlinear Speedup 78
Efficiency 79
Concerns with Speedup 79
Scaled Speedup versus Fixed-Size Speedup 81
Scalable Performance 81
Scalable Performance Is Difficult to Achieve 81
Implications for Hardware 82
Implications for Software 83
Scaling the Problem Size 83
Chapter Summary 84
Historical Perspective 84
Exercises 85
PART 2 Parallel Abstractions 87
Chapter 4 First Steps Toward Parallel Programming 88
Data and Task Parallelism 88
Definitions 88
Illustrating Data and Task Parallelism 89
The Peril-L Notation 89
Extending C 90
Parallel Threads 90
Synchronization and Coordination 91
Memory Model 92
Synchronized Memory 94
Reduce and Scan 95
The Reduce Abstraction 96
Count 3s Example 97
Formulating Parallelism 97
Fixed Parallelism 97
Unlimited Parallelism 98
Scalable Parallelism 99
Alphabetizing Example 100
Unlimited Parallelism 101
Fixed Parallelism 102
Scalable Parallelism 104
Comparing the Three Solutions 109
Chapter Summary 110
Historical Perspective 110
Exercises 110
Chapter 5 Scalable Algorithmic Techniques 112
Blocks of Independent Computation 112
Schwartz'Algorithm 113
The Reduce and Scan Abstractions 115
Example of Generalized Reduces and Scans 116
The Basic Structure 118
Structure for Generalized Reduce 119
Example of Components of a Generalized Scan 122
Applying the Generalized Scan 124
Generalized Vector Operations 125
Assigning Work to Processes Statically 125
Block Allocations 126
Overlap Regions 128
Cyclic and Block Cyclic Allocations 129
Irregular Allocations 132
Assigning Work to Processes Dynamically 134
Work Queues 134
Variations of Work Queues 137
Case Study:Concurrent Memory Allocation 137
Trees 139
Allocation by Sub-Tree 139
Dynamic Allocations 140
Chapter Summary 141
Historical Perspective 142
Exercises 142
PART 3 Parallel Programming Languages 143
Chapter 6 Programming with Threads 145
POSIX Threads 145
Thread Creation and Destruction 146
Mutual Exclusion 150
Synchronization 153
Safety Issues 163
Performance Issues 167
Case Study:Successive Over-Relaxation 174
Case Study:Overlapping Synchronization with Computation 179
Case Study:Streaming Computations on a Multi-Core Chip 187
Java Threads 187
Synchronized Methods 189
Synchronized Statements 189
The Count 3s Example 190
Volatile Memory 192
Atomic Objects 192
Lock Objects 193
Executors 193
Concurrent Collections 193
OpenMP 193
The Count 3s Example 194
Semantic Limitations on parallel for 195
Reduction 196
Thread Behavior and Interaction 197
Sections 199
Summary of OpenMP 199
Chapter Summary 200
Historical Perspective 200
Exercises 200
Chapter 7 MPI and Other Local View Languages 202
MPI:The Message Passing Interface 202
The Count 3s Example 203
Groups and Communicators 211
Point-to-Point Communication 212
Collective Communication 214
Example:Successive Over-Relaxation 219
Performance Issues 222
Safety Issues 228
Partitioned Global Address Space Languages 229
Co-Array Fortran 230
Unified Parallel C 231
Titanium 232
Chapter Summary 233
Historical Perspective 234
Exercises 234
Chapter 8 ZPL and Other Global View Languages 236
The ZPL Programming Language 236
Basic Concepts of ZPL 237
Regions 237
Array Computation 240
Life,an Example 242
The Problem 242
The Solution 242
How It Works 243
The Philosophv of Life 245
Distinguishing Features of ZPL 245
Regions 245
Statement-Level Indexing 245
Restrictions Imposed by Regions 246
Performance Model 246
Addition by Subtraction 247
Manipulating Arrays of Different Ranks 247
Partial Reduce 248
Flooding 249
The Flooding Principle 250
Data Manipulation,an Example 251
Flood Regions 252
Matrix Multiplication 253
Reordering Data with Remap 255
Index Arrays 255
Remap 256
Ordering Example 258
Parallel Execution of ZPL Programs 260
Role of the Compiler 260
Specifying the Number of Processes 261
Assigning Regions to Processes 261
Array Allocation 262
Scalar Allocation 263
Work Assignment 263
Performance Model 264
Applying the Performance Model:Life 265
Applying the Performance Model:SUMMA 266
Summary of the Performance Model 266
NESL Parallel Language 267
Language Concepts 267
Matrix Product Using Nested Parallelism 268
NESL Complexity Model 269
Chapter Summary 269
Historical Perspective 269
Exercises 270
Chapter 9 Assessing the State of the Art 271
Four Important Properties of Parallel Languages 271
Correctness 271
Performance 273
Scalability 274
Portability 274
Evaluating Existing Approaches 275
POSIX Threads 275
Java Threads 276
OpenMP 276
MPI 276
PGAS Languages 277
ZPL 278
NESL 278
Lessons for the Future 279
Hidden Parallelism 279
Transparent Performance 280
Locality 280
Constrained Parallelism 280
Implicit versus Explicit Parallelism 281
Chapter Summary 282
Historical Perspective 282
Exercises 282
PART 4 Looking Forward 283
Chapter 10 Future Directions in Parallel Programming 284
Attached Processors 284
Graphics Processing Units 285
Cell Processors 288
Attached Processors Summary 288
Grid Computing 290
Transactional Memory 291
Comparison with Locks 292
Implementation Issues 293
Open Research Issues 295
MapReduce 296
Problem Space Promotion 298
Emerging Languages 299
Chapel 300
Fortress 300
X10 302
Chapter Summary 304
Historical Perspective 304
Exercises 304
Chapter 11 Writing Parallel Programs 305
Getting Started 305
Access and Software 305
Hello,World 306
Parallel Programming Recommendations 307
Incremental Development 307
Focus on the Parallel Structure 307
Testing the Parallel Structure 308
Sequential Programming 309
Be Willing to Write Extra Code 309
Controlling Parameters during Testing 310
Functional Debugging 310
Capstone Project Ideas 311
Implementing Existing Parallel Algorithms 311
Competing with Standard Benchmarks 312
Developing New Parallel Computations 313
Performance Measurement 314
Comparing against a Sequential Solution 315
Maintaining a Fair Experimental Setting 315
Understanding Parallel Performance 316
Performance Analysis 317
Experimental Methodology 318
Portability and Tuning 319
Chapter Summary 319
Historical Perspective 319
Exercises 320
Glossary 321
References 325
Index 328