1 High-Performance Computing 1
1.1 Trends in Computer Design 1
1.2 Traditional Computers and Their Limitations 2
1.3 Parallelism within a Single Processor 3
1.3.1 Multiple Functional Units 3
1.3.2 Pipelining 3
1.3.3 Overlapping 4
1.3.4 RISC 5
1.3.5 VLIW 6
1.3.6 Vector Instructions 7
1.3.7 Chaining 7
1.3.8 Memory-to-Memory and Register-to-Register Organizations 8
1.3.9 Register Set 9
1.3.10 Stripmining 9
1.3.11 Reconfigurable Vector Registers 10
1.3.12 Memory Organization 10
1.4 Data Organization 11
1.4.1 Main Memory 12
1.4.2 Cache 14
1.4.3 Local Memory 15
1.5 Memory Management 15
1.6 Parallelism through Multiple Pipes or Multiple Processors 18
1.7 Message Passing 19
1.8 Virtual Shared Memory 21
1.8.1 Routing 21
1.9 Interconnection Topology 22
1.9.1 Crossbar Switch 23
1.9.2 Timeshared Bus 23
1.9.3 Ring Connection 24
1.9.4 Mesh Connection 24
1.9.5 Hypercube 25
1.9.6 Multi-staged Network 26
1.10 Programming Techniques 26
1.11 Trends:Network-Based Computing 29
2 Overview of Current High-Performance Computers 31
2.1 Supercomputers 31
2.2 RISC-Based Processors 34
2.3 Parallel Processors 35
3 Implementation Details and Overhead 39
3.1 Parallel Decomposition and Data Dependency Graphs 39
3.2 Synchronization 42
3.3 Load Balancing 43
3.4 Recurrence 44
3.5 Indirect Addressing 46
3.6 Message Passing 47
3.6.1 Performance Prediction 49
3.6.2 Message-Passing Standards 50
3.6.3 Routing 55
4 Performance:Analysis,Modeling,and Measurements 57
4.1 Amdahl's Law 58
4.1.1 Simple Case of Amdahl's Law 58
4.1.2 General Form of Amdahl's Law 59
4.2 Vector Speed and Vector Length 59
4.3 Amdahl's Law—Parallel Processing 60
4.3.1 A Simple Model 61
4.3.2 Gustafson's Model 63
4.4 Examples of(r∞,n1/2)-values for Various Computers 64
4.4.1 CRAY J90 and CRAY T90 (One Processor) 65
4.4.2 General Observations 66
4.5 LINPACK Benchmark 66
4.5.1 Description of the Benchmark 66
4.5.2 Calls to the BLAS 67
4.5.3 Asymptotic Performance 67
5 Building Blocks in Linear Algebra 71
5.1 Basic Linear Algebra Subprograms 71
5.1.1 Level 1 BLAS 72
5.1.2 Level 2 BLAS 73
5.1.3 Level 3 BLAS 74
5.2 Levels of Parallelism 76
5.2.1 Vector Computers 76
5.2.2 Parallel Processors with Shared Memory 78
5.2.3 Parallel-Vector Computers 78
5.2.4 Clusters Computing 78
5.3 Basic Factorizations of Linear Algebra 79
5.3.1 Point Algorithm:Gaussian Elimination with Partial Piv-oting 79
5.3.2 Special Matrices 80
5.4 Blocked Algorithms:Matrix-Vector and Matrix-Matrix Versions 83
5.4.1 Right-Looking Algorithm 85
5.4.2 Left-Looking Algorithm 86
5.4.3 Crout Algorithm 87
5.4.4 Typical Performance of Blocked LU Decomposition 88
5.4.5 Blocked Symmetric Indefinite Factorization 89
5.4.6 Typical Performance of Blocked Symmetric Indefinite Fac-torization 91
5.5 Linear Least Squares 92
5.5.1 Householder Method 93
5.5.2 Blocked Householder Method 94
5.5.3 Typical Performance of the Blocked Householder Factor-ization 95
5.6 Organization of the Modules 95
5.6.1 Matrix-Vector Product 96
5.6.2 Matrix-Matrix Product 97
5.6.3 Typical Performance for Parallel Processing 98
5.6.4 Benefits 98
5.7 LAPACK 99
5.8 ScaLAPACK 100
5.8.1 The Basic Linear Algebra Communication Subprograms(BLACS) 101
5.8.2 PBLAS 102
5.8.3 ScaLAPACK Sample Code 103
6 Direct Solution of Sparse Linear Systems 107
6.1 Introduction to Direct Methods for Sparse Linear Systems 111
6.1.1 Four Approaches 111
6.1.2 Description of Sparse Data Structure 112
6.1.3 Manipulation of Sparse Data Structures 113
6.2 General Sparse Matrix Methods 115
6.2.1 Fill-in and Sparsity Ordering 115
6.2.2 Indirect Addressing—Its Effect and How to Avoid It 118
6.2.3 Comparison with Dense Codes 120
6.2.4 Other Approaches 121
6.3 Methods for Symmetric Matrices and Band Systems 123
6.3.1 The Clique Concept in Gaussian Elimination 124
6.3.2 Further Comments on Ordering Schemes 126
6.4 Frontal Methods 126
6.4.1 Frontal Methods—Link to Band Methods and Numerical Pivoting 128
6.4.2 Vector Performance 129
6.4.3 Parallel Implementation of Frontal Schemes 130
6.5 Multifrontal Methods 131
6.5.1 Performance on Vector Machines 135
6.5.2 Performance on RISC Machines 136
6.5.3 Performance on Parallel Machines 137
6.5.4 Exploitation of Structure 142
6.5.5 Unsymmetric Multifrontal Methods 143
6.6 Other Approaches for Exploitation of Parallelism 144
6.7 Software 145
6.8 Brief Summary 147
7 Krylov Subspaces:Projection 149
7.1 Notation 149
7.2 Basic Iteration Methods:Richardson Iteration,Power Method 150
7.3 Orthogonal Basis(Arnoldi,Lanczos) 153
8 Iterative Methods for Linear Systems 157
8.1 Krylov Subspace Solution Methods:Basic Principles 157
8.1.1 The Ritz-Galerkin Approach:FOM and CG 158
8.1.2 The Minimum Residual Approach:GMRES and MINRES 159
8.1.3 The Petrov-Galerkin Approach:Bi-CG and QMR 159
8.1.4 The Minimum Error Approach:SYMMLQ and GMERR 161
8.2 Iterative Methods in More Detail 162
8.2.1 The CG Method 163
8.2.2 Parallelism in the CG Method:General Aspects 165
8.2.3 Parallelism in the CG Method:Communication Overhead 166
8.2.4 MINRES 168
8.2.5 Least Squares CG 170
8.2.6 GMRES and GMRES(m) 172
8.2.7 GMRES with Variable Preconditioning 175
8.2.8 Bi-CG and QMR 179
8.2.9 CGS 182
8.2.10 Bi-CGSTAB 184
8.2.11 Bi-CGSTAB(e)and Variants 186
8.3 Other Issues 189
8.4 How to Test Iterative Methods 191
9 Preconditioning and Parallel Preconditioning 195
9.1 Preconditioning and Parallel Preconditioning 195
9.2 The Purpose of Preconditioning 195
9.3 Incomplete LU Decompositions 199
9.3.1 Efficient Implementations of ILU(0) Preconditioning 203
9.3.2 General Incomplete Decompositions 204
9.3.3 Variants of ILU Preconditioners 207
9.3.4 Some General Comments on ILU 208
9.4 Some Other Forms of Preconditioning 209
9.4.1 Sparse Approximate Inverse(SPAI) 209
9.4.2 Polynomial Preconditioning 211
9.4.3 Preconditioning by Blocks or Domains 211
9.4.4 Element by Element Preconditioners 213
9.5 Vector and Parallel Implementation of Preconditioners 215
9.5.1 Partial Vectorization 215
9.5.2 Reordering the Unknowns 217
9.5.3 Changing the Order of Computation 219
9.5.4 Some Other Vectorizable Preconditioners 222
9.5.5 Parallel Aspects of Reorderings 225
9.5.6 Experiences with Parallelism 227
10 Linear Eigenvalue Problems Ax=λx 231
10.1 Theoretical Background and Notation 231
10.2 Single-Vector Methods 232
10.3 The QR Algorithm 234
10.4 Subspace Projection Methods 235
10.5 The Arnoldi Factorization 237
10.6 Restarting the Arnoldi Process 239
10.6.1 Explicit Restarting 239
10.7 Implicit Restarting 240
10.8 Lanczos'Method 243
10.9 Harmonic Ritz Values and Vectors 245
10.10 Other Subspace Iteration Methods 246
10.11 Davidson's Method 248
10.12 The Jacobi-Davidson Iteration Method 249
10.12.1 JDQR 252
10.13 Eigenvalue Software:ARPACK,P_ARPACK 253
10.13.1 Reverse Communication Interface 254
10.13.2 Parallelizing ARPACK 255
10.13.3 Data Distribution of the Arnoldi Factorization 256
10.14 Message Passing 259
10.15 Parallel Performance 260
10.16 Availability 261
10.17 Summary 261
11 The Generalized Eigenproblem 263
11.1 Arnoldi/Lanczos with Shift-Invert 263
11.2 Alternatives to Arnoldi/Lanczos with Shift-Invert 265
11.3 The Jacobi-Davidson QZ Algorithm 266
11.4 The Jacobi-Davidson QZ Method:Restart and Deflation 268
11.5 Parallel Aspects 271
A Acquiring Mathematical Software 273
A.1 netlib 273
A.1.1 Mathematical Software 275
A.2 Mathematical Software Libraries 275
B Glossary 277
C Level 1,2,and 3 BLAS Quick Reference 291
D Operation Counts for Various BLAS and Decompositions 295
Bibliography 301
Index 329