CHAPTER 1 An Introduction to Embedded Processing 1
1.1 What Is Embedded Computing? 3
1.1.1 Attributes of Embedded Devices 4
1.1.2 Embedded Is Growing 5
1.2 Distinguishing Between Embedded and General-Purpose Computing 6
1.2.1 The "Run One Program Only" Phenomenon 8
1.2.2 Backward and Binary Compatibility 9
1.2.3 Physical Limits in the Embedded Domain 10
1.3 Characterizing Embedded Computing 11
1.3.1 Categorization by Type of Processing Engine 12
Digital Signal Processors 13
Network Processors 16
1.3.2 Categorization by Application Area 17
The Image Processing and Consumer Market 18
The Communications Market 20
The Automotive Market 22
1.3.3 Categorization by Workload Differences 22
1.4 Embedded Market Structure 23
1.4.1 The Market for Embedded Processor Cores 24
1.4.2 Business Model of Embedded Processors 25
1.4.3 Costs and Product Volume 26
1.4.4 Software and the Embedded Software Market 28
1.4.5 Industry Standards 28
1.4.6 Product Life Cycle 30
1.4.7 The Transition to SoC Design 31
Effects of SoC on the Business Model 34
Centers of Embedded Design 35
1.4.8 The Future of Embedded Systems 36
Connectivity:Always-on Infrastructure 36
State:Personal Storage 36
Administration 37
Security 37
The Next Generation 37
1.5 Further Reading 38
1.6 Exercises 40
CHAPTER 2 An Overview of VLIW and ILP 45
2.1 Semantics and Parallelism 46
2.1.1 Baseline:Sequential Program Semantics 46
2.1.2 Pipelined Execution,Overlapped Execution,and Multiple Execution Units 47
2.1.3 Dependence and Program Rearrangement 51
2.1.4 ILP and Other Forms of Parallelism 52
2.2 Design Philosophies 54
2.2.1 An Illustration of Design Philosophies:RISC Versus CISC 56
2.2.2 First Definition of VLIW 57
2.2.3 A Design Philosophy:VLIW 59
VLIW Versus Superscalar 59
VLIW Versus DSP 62
2.3 Role of the Compiler 63
2.3.1 The Phases of a High-Performance Compiler 63
2.3.2 Compiling for ILP and VLIW 65
2.4 VLIW in the Embedded and DSP Domains 69
2.5 Historical Perspective and Further Reading 71
2.5.1 ILP Hardware in the 1960s and 1970s 71
Early Supercomputer Arithmetic Units 71
Attached Signal Processors 72
Horizontal Microcode 72
2.5.2 The Development of ILP Code Generation in the 1980s 73
Acyclic Microcode Compaction Techniques 73
Cyclic Techniques:Software Pipelining 75
2.5.3 VLIW Development in the 1980s 76
2.5.4 ILP in the 1990s and 2000s 77
2.6 Exercises 78
CHAPTER 3 An Overview of ISA Design 83
3.1 Overview:What to Hide 84
3.1.1 Architectural State:Memory and Registers 84
3.1.2 Pipelining and Operational Latency 85
3.1.3 Multiple Issue and Hazards 86
Exposing Dependence and Independence 86
Structural Hazards 87
Resource Hazards 89
3.1.4 Exception and Interrupt Handling 89
3.1.5 Discussion 90
3.2 Basic VLIW Design Principles 91
3.2.1 Implications for Compilers and Implementations 92
3.2.2 Execution Model Subtleties 93
3.3 Designing a VLIW ISA for Embedded Systems 95
3.3.1 Application Domain 96
3.3.2 ILP Style 98
3.3.3 Hardware/Software Tradeoffs 100
3.4 Instruction-set Encoding 101
3.4.1 A Larger Definition of Architecture 101
3.4.2 Encoding and Architectural Style 105
RISC Encodings 107
CISC Encodings 108
VLIW Encodings 109
Why Not Superscalar Encodings? 109
DSP Encodings 110
Vector Encodings 111
3.5 VLIW Encoding 112
3.5.1 Operation Encoding 113
3.5.2 Instruction Encoding 113
Fixed-overhead Encoding 115
Distributed Encoding 115
Template-based Encoding 116
3.5.3 Dispatching and Opcode Subspaces 117
3.6 Encoding and Instruction-set Extensions 119
3.7 Further Reading 121
3.8 Exercises 121
CHAPTER 4 Architectural Structures in ISA Design 125
4.1 The Datapath 127
4.1.1 Location of Operands and Results 127
4.1.2 Datapath Width 127
4.1.3 Operation Repertoire 129
Simple Integer and Compare Operations 131
Carry,Overflow,and Other Flags 131
Common Bitwise Utilities 132
Integer Multiplication 132
Fixed-point Multiplication 133
Integer Division 135
Floating-point Operations 136
Saturated Arithmetic 137
4.1.4 Micro-SIMD Operations 139
Alignment Issues 141
Precision Issues 141
Dealing with Control Flow 142
Pack,Unpack,and Mix 143
Reductions 143
4.1.5 Constants 144
4.2 Registers and Clusters 144
4.2.1 Clustering 145
Architecturally Invisible Clustering 147
Architecturally Visible Clustering 147
4.2.2 Heterogeneous Register Files 149
4.2.3 Address and Data Registers 149
4.2.4 Special Register File Features 150
Indexed Register Files 150
Rotating Register Files 151
4.3 Memory Architecture 151
4.3.1 Addressing Modes 152
4.3.2 Access Sizes 153
4.3.3 Alignment Issues 153
4.3.4 Caches and Local Memories 154
Prefetching 154
Local Memories and Lockable Caches 156
4.3.5 Exotic Addressing Modes for Embedded Processing 156
4.4 Branch Architecture 156
4.4.1 Unbundling Branches 158
Two-step Branching 159
Three-step Branching 159
4.4.2 Multiway Branches 160
4.4.3 Multicluster Branches 161
4.4.4 Branches and Loops 162
4.5 Speculation and Predication 163
4.5.1 Speculation 163
Control Speculation 164
Data Speculation 167
4.5.2 Predication 168
Full Predication 169
Partial Predication 170
Cost and Benefits of Predication 171
Predication in the Embedded Domain 172
4.6 System Operations 173
4.7 Further Reading 174
4.8 Exercises 175
CHAPTER 5 Microarchitecture Design 179
5.1 Register File Design 182
5.1.1 Register File Structure 182
5.1.2 Register Files,Technology,and Clustering 183
5.1.3 Separate Address and Data Register Files 184
5.1.4 Special Registers and Register File Features 186
5.2 Pipeline Design 186
5.2.1 Balancing a Pipeline 187
5.3 VLIW Fetch,Sequencing,and Decoding 191
5.3.1 Instruction Fetch 191
5.3.2 Alignment and Instruction Length 192
5.3.3 Decoding and Dispersal 194
5.3.4 Decoding and ISA Extensions 195
5.4 The Datapath 195
5.4.1 Execution Units 197
5.4.2 Bypassing and Forwarding Logic 200
5.4.3 Exposing Latencies 202
5.4.4 Predication and Selects 204
5.5 Memory Architecture 206
5.5.1 Local Memory and Caches 206
5.5.2 Byte Manipulation 209
5.5.3 Addressing,Protection,and Virtual Memory 210
5.5.4 Memories in Multiprocessor Systems 211
5.5.5 Memory Speculation 213
5.6 The Control Unit 214
5.6.1 Branch Architecture 214
5.6.2 Predication and Selects 215
5.6.3 Interrupts and Exceptions 216
5.6.4 Exceptions and Pipelining 218
Drain and Flush Pipeline Models 218
Early Commit 219
Delayed Commit 220
5.7 Control Registers 221
5.8 Power Considerations 221
5.8.1 Energy Efficiency and ILP 222
System-level Power Considerations 224
5.9 Further Reading 225
5.10 Exercises 227
CHAPTER 6 System Design and Simulation 231
6.1 System-on-a-Chip(SoC) 231
6.1.1 IP Blocks and Design Reuse 232
A Concrete SoC Example 233
Virtual Components and the VSIA Alliance 235
6.1.2 Design Flows 236
Creation Flow 236
Verification Flow 238
6.1.3 SoC Buses 239
Data Widths 240
Masters,Slaves,and Arbiters 241
Bus Transactions 242
Test Modes 244
6.2 Processor Cores and SoC 245
6.2.1 Nonprogrammable Accelerators 246
Reconfigurable Logic 248
6.2.2 Multiprocessing on a Chip 250
Symmetric Multiprocessing 250
Heterogeneous Multiprocessing 251
Example:A Multicore Platform for Mobile Multimedia 252
6.3 Overview of Simulation 254
6.3.1 Using Simulators 256
6.4 Simulating a VLIW Architecture 257
6.4.1 Interpretation 258
6.4.2 Compiled Simulation 259
Memory 262
Registers 263
Control Flow 263
Exceptions 266
Analysis of Compiled Simulation 267
Performance Measurement and Compiled Simulation 268
6.4.3 Dynamic Binary Translation 268
6.4.4 Trace-driven Simulation 270
6.5 System Simulation 271
6.5.1 I/O and Concurrent Activities 272
6.5.2 Hardware Simulation 272
Discrete Event Simulation 274
6.5.3 Accelerating Simulation 275
In-Circuit Emulation 275
Hardware Accelerators for Simulation 276
6.6 Validation and Verification 276
6.6.1 Co-simulation 278
6.6.2 Simulation,Verification,and Test 279
Formal Verification 280
Design for Testability 280
Debugging Support for SoC 281
6.7 Further Reading 282
6.8 Exercises 284
CHAPTER 7 Embedded Compiling and Toolchains 287
7.1 What Is Important in an ILP Compiler? 287
7.2 Embedded Cross-Developmant Toolchains 290
7.2.1 Compiler 291
7.2.2 Assembler 292
7.2.3 Libraries 294
7.2.4 Linker 296
7.2.5 Post-link Optimizer 297
7.2.6 Run-time Program Loader 297
7.2.7 Simulator 299
7.2.8 Debuggers and Monitor ROMs 300
7.2.9 Automated Test Systems 301
7.2.10 Profiling Tools 302
7.2.11 Binary Utilities 302
7.3 Structure of an ILP Compiler 302
7.3.1 Front End 304
7.3.2 Machine-independent Optimizer 304
7.3.3 Back End:Machine-specific Optimizations 306
7.4 Code Layout 306
7.4.1 Code Layout Techniques 306
DAG-based Placement 308
The "Pettis-Hansen" Technique 310
Procedure Inlining 310
Cache Line Coloring 311
Temporal-order Placement 311
7.5 Embedded-Specific Tradeoffs for Compilers 311
7.5.1 Space,Time,and Energy Tradeoffs 312
7.5.2 Power-specific Optimizations 315
Fundamentals of Power Dissipation 316
Power-aware Software Techniques 317
7.6 DSP-Specific Compiler Optimizations 320
7.6.1 Compiler-visible Features of DSPs 322
Heterogeneous Registers 322
Addressing Modes 322
Limited Connectivity 323
Local Memories 323
Harvard Architecture 324
7.6.2 Instruction Selection and Scheduling 325
7.6.3 Address Computation and Offset Assignment 327
7.6.4 Local Memories 327
7.6.5 Register Assignment Techniques 328
7.6.6 Retargetable DSP and ASIP Compilers 329
7.7 Further Reading 332
7.8 Exercises 333
CHAPTER 8 Compiling for VLIWs and ILP 337
8.1 Profiling 338
8.1.1 Types of Profiles 338
8.1.2 Profile Collection 341
8.1.3 Synthetic Profiles(Heuristics in Lieu of Profiles) 341
8.1.4 Profile Bookkeeping and Methodology 342
8.1.5 Profiles and Embedded Applications 342
8.2 Scheduling 343
8.2.1 Acyclic Region Types and Shapes 345
Basic Blocks 345
Traces 345
Superblocks 345
Hyperblocks 347
Treegions 347
Percolation Scheduling 348
8.2.2 Region Formation 350
Region Selection 351
Enlargement Techniques 353
Phase-ordering Considerations 356
8.2.3 Schedule Construction 357
Analyzing Programs for Schedule Construction 359
Compaction Techniques 362
Compensation Code 365
Another View of Scheduling Problems 367
8.2.4 Resource Management During Scheduling 368
Resource Vectors 368
Finite-state Automata 369
8.2.5 Loop Scheduling 371
Modulo Scheduling 373
8.2.6 Clustering 380
8.3 Register Allocation 382
8.3.1 Phase-ordering Issues 383
Register Allocation and Scheduling 383
8.4 Speculation and Predication 385
8.4.1 Control and Data Speculation 385
8.4.2 Predicated Execution 386
8.4.3 Prefetching 389
8.4.4 Data Layout Methods 390
8.4.5 Static and Hybrid Branch Prediction 390
8.5 Instruction Selection 390
8.6 Further Reading 391
8.7 Exercises 395
CHAPTER 9 The Run-time System 399
9.1 Exceptions,Interrupts,and Traps 400
9.1.1 Exception Handling 400
9.2 Application Binary Interface Considerations 402
9.2.1 Loading Programs 404
9.2.2 Data Layout 406
9.2.3 Accessing Global Data 407
9.2.4 Calling Conventions 409
Registers 409
Call Instructions 409
Call Sites 410
Function Prologues and Epilogues 412
9.2.5 Advanced ABI Topics 412
Variable-length Argument Lists 412
Dynamic Stack Allocation 413
Garbage Collection 414
Linguistic Exceptions 414
9.3 Code Compression 415
9.3.1 Motivations 416
9.3.2 Compression and Information Theory 417
9.3.3 Architectural Compression Options 417
Decompression on Fetch 420
Decompression on Refill 420
Load-time Decompression 420
9.3.4 Compression Methods 420
Hand-tuned ISAs 421
Ad Hoc Compression Schemes 421
RAM Decompression 422
Dictionary-based Software Compression 422
Cache-based Compression 422
Quantifying Compression Benefits 424
9.4 Embedded Operating Systems 427
9.4.1 "Traditional" OS Issues Revisited 427
9.4.2 Real-time Systems 428
Real-time Scheduling 429
9.4.3 Multiple Flows of Control 431
Threads,Processes,and Microkernels 432
9.4.4 Market Considerations 433
Embedded Linux 435
9.4.5 Downloadable Code and Virtual Machines 436
9.5 Multiprocessing and Multithreading 438
9.5.1 Multiprocessing in the Embedded World 438
9.5.2 Multiprocessing and VLIW 439
9.6 Further Reading 440
9.7 Exercises 441
CHAPTER 10 Application Design and Customization 443
10.1 Programming Language Choices 443
10.1.1 Overview of Embedded Programming Languages 444
10.1.2 Traditional C and ANSI C 445
10.1.3 C++ and Embedded C++ 447
Embedded C++ 449
10.1.4 Matlab 450
10.1.5 Embedded Java 452
The Allure of Embedded Java 452
Embedded Java:The Dark Side 455
10.1.6 C Extensions for Digital Signal Processing 456
Restricted Pointers 456
Fixed-point Data Types 459
Circular Arrays 461
Matrix Referencing and Operators 462
10.1.7 Pragmas,Intrinsics,and Inline Assembly Language Code 462
Compiler Pragmas and Type Annotations 462
Assembler Inserts and Intrinsics 463
10.2 Performance,Benchmarking,and Tuning 465
10.2.1 Importance and Methodology 465
10.2.2 Tuning an Application for Performance 466
Profiling 466
Performance Tuning and Compilers 467
Developing for ILP Targets 468
10.2.3 Benchmarking 473
10.3 Scalability and Customizability 475
10.3.1 Scalability and Architecture Families 476
10.3.2 Exploration and Scalability 477
10.3.3 Customization 478
Customized Implementations 479
10.3.4 Reconfigurable Hardware 480
Using Programmable Logic 480
10.3.5 Customizable Processors and Tools 481
Describing Processors 481
10.3.6 Tools for Customization 483
Customizable Compilers 485
10.3.7 Architecture Exploration 487
Dealing with the Complexity 488
Other Barriers to Customization 488
Wrapping Up 489
10.4 Further Reading 489
10.5 Exercises 490
CHAPTER 11 Application Areas 493
11.1 Digital Printing and Imaging 493
11.1.1 Photo Printing Pipeline 495
JPEG Decompression 495
Scaling 496
Color Space Conversion 497
Dithering 499
11.1.2 Implementation and Performance 501
Summary 505
11.2 Telecom Applications 505
11.2.1 Voice Coding 506
Waveform Codecs 506
Vocoders 507
Hybrid Coders 508
11.2.2 Multiplexing 509
11.2.3 The GSM Enhanced Full-rate Codec 510
Implementation and Performance 510
11.3 Other Application Areas 514
11.3.1 Digital Video 515
MPEG-1and MPEG-2 516
MPEG-4 518
11.3.2 Automotive 518
Fail-safety and Fault Tolerance 519
Engine Control Units 520
In-vehicle Networking 520
11.3.3 Hard Disk Drives 522
Motor Control 524
Data Decoding 525
Disk Scheduling and On-disk Management Tasks 526
Disk Scheduling and Off-disk Management Tasks 527
11.3.4 Networking and Network Processors 528
Network Processors 531
11.4 Further Reading 535
11.5 Exercises 537
APPENDIX A The VEX System 539
A.1 The VEX Instruction-set Architecture 540
A.1.1 VEX Assembly Language Notation 541
A.1.2 Clusters 542
A.1.3 Execution Model 544
A.1.4 Architecture State 545
A.1.5 Arithmetic and Logic Operations 545
Examples 547
A.1.6 Intercluster Communication 549
A.1.7 Memory Operations 550
A.1.8 Control Operations 552
Examples 553
A.1.9 Structure of the Default VEX Cluster 554
Register Files and Immediates 555
A.1.10 VEX Semantics 556
A.2 The VEX Run-time Architecture 558
A.2.1 Data Allocation and Layout 559
A.2.2 Register Usage 560
A.2.3 Stack Layout and Procedure Linkage 560
Procedure Linkage 563
A.3 The VEX C Compiler 566
A.3.1 Command Line Options 568
Output Files 569
Preprocessing 570
Optimization 570
Profiling 572
Language Definition 573
Libraries 574
Passing Options to Compile Phases 574
Terminal Output and Process Control 575
Other Options 575
A.3.2 Compiler Pragmas 576
Unrolling and Profiling 576
Assertions 578
Memory Disambiguation 578
Cache Control 581
A.3.3 Inline Expansion 583
Multiflow-style Inlining 583
C99-style Inlining 584
A.3.4 Machine Model Parameters 585
A.3.5 Custom Instructions 586
A.4 Visualization Tools 588
A.5 The VEX Simulation System 589
A.5.1 gprofSupport 591
A.5.2 Simulating Custom Instructions 594
A.5.3 Simulating the Memory Hierarchy 595
A.6 Customizing the VEX Toolchain 596
A.6.1 Clusters 596
A.6.2 Machine Model Resources 597
A.6.3 Memory Hierarchy Parameters 599
A.7 Examples of Tool Usage 599
A.7.1 Compile and Run 599
A.7.2 Profiling 602
A.7.3 Custom Architectures 603
A.8 Exercises 605
APPENDIX B Glossary 607
APPENDIX C Bibliography 631
Index 661