当前位置:首页 > 工业技术
大规模并行处理器程序设计  第2版  英文版
大规模并行处理器程序设计  第2版  英文版

大规模并行处理器程序设计 第2版 英文版PDF电子书下载

工业技术

  • 电子书积分:15 积分如何计算积分?
  • 作 者:(美)柯克,(美)胡文美著
  • 出 版 社:北京:机械工业出版社
  • 出版年份:2013
  • ISBN:9787111416296
  • 页数:496 页
图书介绍:本书介绍了学生和专业人员都适合的并行编程与GPU体系结构的基本概念,详细剖析了编写并行程序所需的各种技术,用案例研究说明了并行程序设计的整个开发过程,即从计算机思想开始,直到最终实现高效可行的并行程序。
《大规模并行处理器程序设计 第2版 英文版》目录

CHAPTER 11 Introduction 1

1.1 Heterogeneous Parallel Computing 2

1.2 Architecture of a Modern GPU 8

1.3 Why More Speed or Parallelism? 10

1.4 Speeding Up Real Applications 12

1.5 Parallel Programming Languages and Models 14

1.6 Overarching Goals 16

1.7 Organization of the Book 17

References 21

CHAPTER 2 History of GPU Computing 23

2.1 Evolution of Graphics Pipelines 23

The Era of Fixed-Function Graphics Pipelines 24

Evolution of Programmable Real-Time Graphics 28

Unified Graphics and Computing Processors 31

2.2 GPGPU:An Intermediate Step 33

2.3 GPU Computing 34

Scalable GPUs 35

Recent Developments 36

Future Trends 37

References and Further Reading 37

CHAPTER 3 Introduction to Data Parallelism and CUDA C 41

3.1 Data Parallelism 42

3.2 CUDA Program Structure 43

3.3 A Vector Addition Kernel 45

3.4 Device Global Memory and Data Transfer 48

3.5 Kernel Functions and Threading 53

3.6 Summary 58

Function Declarations 59

Kernel Launch 59

Predefined Variables 59

Runtime API 60

3.7 Exercises 60

References 62

CHAPTER 4 Data-Parallel Execution Model 63

4.1 Cuda Thread Organization 64

4.2 Mapping Threads to Multidimensional Data 68

4.3 Matrix-Matrix Multiplication—A More Complex Kernel 74

4.4 Synchronization and Transparent Scalabilitv 81

4.5 Assigning Resources to Blocks 83

4.6 Querying Device Properties 85

4.7 Thread Scheduling and Latency Tolerance 87

4.8 Summary 91

4.9 Exercises 91

CHAPTER 5 CUDA Memories 95

5.1 Importance of Memory Access Efficiency 96

5.2 CUDA Device Memory Types 97

5.3 A Strategy for Reducing Global Memory Traffic 105

5.4 A Tiled Matrix—Matrix Multiplication Kernel 109

5.5 Memory as a Limiting Factor to Parallelism 115

5.6 Summary 118

5.7 Exercises 119

CHAPTER 6 Performance Considerations 123

6.1 Warps and Thread Execution 124

6.2 Global Memory Bandwidth 132

6.3 Dynamic Partitioning of Execution Resources 141

6.4 Instruction Mix and Thread Granularity 143

6.5 Summary 145

6.6 Exercises 145

References 149

CHAPTER 7 Floating-Point Considerations 151

7.1 Floating-Point Format 152

Normalized Representation of M 152

Excess Encoding of E 153

7.2 Representable Numbers 155

7.3 Special Bit Patterns and Precision in IEEE Format 160

7.4 Arithmetic Accuracy and Rounding 161

7.5 Algorithm Considerations 162

7.6 Numerical Stability 164

7.7 Summary 169

7.8 Exercises 170

References 171

CHAPTER 8 Parallel Patterns:Convolution 173

8.1 Background 174

8.2 1D Parallel Convolution—A Basic Algorithm 179

8.3 Constant Memory and Caching 181

8.4 Tiled 1D Convolution with Halo Elements 185

8.5 A Simpler Tiled 1D Convolution—General Caching 192

8.6 Summary 193

8.7 Exercises 194

CHAPTER 9 Parallel Patterns:Prefix Sum 197

9.1 Background 198

9.2 A Simple Parallel Scan 200

9.3 Work Efficiency Considerations 204

9.4 A Work-Efficient Parallel Scan 205

9.5 Parallel Scan for Arbitrary-Length Inputs 210

9.6 Summary 214

9.7 Exercises 215

Reference 216

CHAPTER 10 Parallel Patterns:Sparse Matrix—Vector Multiplication 217

10.1 Background 218

10.2 Parallel SpMV Using CSR 222

10.3 Padding and Transposition 224

10.4 Using Hybrid to Control Padding 226

10.5 Sorting and Partitioning for Regularization 230

10.6 Summary 232

10.7 Exercises 233

References 234

CHAPTER 11 Application Case Study:Advanced MRI Reconstruction 235

11.1 Application Background 236

11.2 Iterative Reconstruction 239

11.3 Computing FHD 241

Step 1:Determine the Kernel Parallelism Structure 243

Step 2:Getting Around the Memory Bandwidth Limitation 249

Step 3:Using Hardware Trigonometry Functions 255

Step 4:Experimental Performance Tuning 259

11.4 Final Evaluation 260

11.5 Exercises 262

References 264

CHAPTER 12 Application Case Study:Molecular Visualization and Analysis 265

12.1 Application Background 266

12.2 A Simple Kernel Implementation 268

12.3 Thread Granularity Adiustment 272

12.4 Memory Coalescing 274

12.5 Summary 277

12.6 Exercises 279

References 279

CHAPTER 13 Parallel Programming and Computational Thinking 281

13.1 Goals of Parallel Computing 282

13.2 Problem Decomposition 283

13.3 Algorithm Selection 287

13.4 Computational Thinking 293

13.5 Summary 294

13.6 Exercises 294

References 295

CHAPTER 14 An Introduction to OpenCLTM 297

14.1 Background 297

14.2 Data Parallelism Model 299

14.3 Device Architecture 301

14.4 Kernel Functions 303

14.5 Device Management and Kernel Launch 304

14.6 Electrostatic Potential Map in OpenCL 307

14.7 Summary 311

14.8 Exercises 312

References 313

CHAPTER 15 Parallel Programming with OpenACC 315

15.1 OpenACC Versus CUDA C 315

15.2 Execution Model 318

15.3 Memory Model 319

15.4 Basic OpenACC Programs 320

Parallel Construct 320

Loop Construct 322

Kernels Construct 327

Data Management 331

Asynchronous Computation and Data Transfer 335

15.5 Future Directions of OpenACC 336

15.6 Exercises 337

CHAPTER 16 Thrust:A Productivity-Oriented Library for CUDA 339

16.1 Background 339

16.2 Motivation 342

16.3 Basic Thrust Features 343

Iterators and Memory Space 344

Interoperability 345

16.4 Generic Programming 347

16.5 Benefits of Abstraction 349

16.6 Programmer Productivity 349

Robustness 350

Real-World Performance 350

16.7 Best Practices 352

Fusion 353

Structure of Arrays 354

Implicit Ranges 356

16.8 Exercises 357

References 358

CHAPTER 17 CUDA FORTRAN 359

17.1 CUDA FORTRAN and CUDA C Differences 360

17.2 A First CUDA FORTR AN Program 361

17.3 Multidimensional Array in CUDA FORTRAN 363

17.4 Overloading Host/Device Routines With Generic Interfaces 364

17.5 Calling CUDA C Via Iso_C_Binding 367

17.6 Kernel Loop Directives and Reduction Operations 369

17.7 Dynamic Shared Memory 370

17.8 Asynchronous Data Transfers 371

17.9 Compilation and Profiling 377

17.1 0 Calling Thrust from CUDA FORTR AN 378

17.1 1 Exercises 382

CHAPTER 18 An Introduction to C++AMP 383

18.1 Core C++Amp Features 384

18.2 Details of the C++AMP Execution Model 391

Explicit and Implicit Data Copies 391

Asynchronous Operation 393

Section Summary 395

18.3 Managing Accelerators 395

18.4 Tiled Execution 398

18.5 C++AMP Graphics Features 401

18.6 Summary 405

18.7 Exercises 405

CHAPTER 19 Programming a Heterogeneous Computing Cluster 407

19.1 Background 408

19.2 A Running Example 408

19.3 MPI Basics 410

19.4 MPI Point-to-Point Communication Types 414

19.5 Overlapping Computation and Communication 421

19.6 MPI Collective Communication 431

19.7 Summary 431

19.8 Exercises 432

Reference 433

CHAPTER 20 CUDA Dynamic Parallelism 435

20.1 Background 436

20.2 Dynamic Parallelism Overview 438

20.3 Important Details 439

Launch Environment Configuration 439

API Errors and Launch Failures 439

Events 439

Streams 440

Synchronization Scope 441

20.4 Memory Visibility 442

Global Memory 442

Zero-Copy Memory 442

Constant Memory 442

Texture Memory 443

20.5 A Simple Example 444

20.6 Runtime Limitations 446

Memory Footprint 446

Nesting Depth 448

Memory Allocation and Lifetime 448

ECC Errors 449

Streams 449

Events 449

Launch Pool 449

20.7 A More Complex Example 449

Linear Bezier Curves 450

Quadratic Bezier Curves 450

Bezier Curve Calculation(Predynamic Parallelism) 450

Bezier Curve Calculation(with Dynamic Parallelism) 453

20.8 Summary 456

Reference 457

CHAPTER 21 Conclusion and Future Outlook 459

21.1 Goals Revisited 459

21.2 Memory Model Evolution 461

21.3 Kernel Execution Control Evolution 464

21.4 Core Performance 467

21.5 Programming Environment 467

21.6 Future Outlook 468

References 469

Appendix A:Matrix Multiplication Host-Only Version Source Code 471

Appendix B:GPU Compute Capabilities 481

Index 487

相关图书
作者其它书籍
返回顶部