《设计数据密集型应用》PDF下载

  • 购买积分:17 如何计算积分?
  • 作  者:Martin Kleppmann著
  • 出 版 社:南京:东南大学出版社
  • 出版年份:2017
  • ISBN:9787564173852
  • 页数:594 页
图书介绍:今天,数据是系统设计的众多挑战中最核心的部分。我们需要解决许多难题,例如可伸缩性、一致性、可靠性、效率以及可维护性。此外,工具的选择纷繁复杂,包括关系数据库、NoSQL 数据库、流式处理器或批处理器以及消息中间件。对于应用程序来说,哪个才是正确的选择?如何才能搞清楚所有这些时髦词?在这本务实且全面的指导之作中,作者 Martin Kleppmann 会带你领略这一领域的多样性,他会分析各种数据处理工具和数据存储工具的优缺点。软件在不断变化,不过基本的原则没有变。通过本书,软件工程师和架构师会学到如何在实际中应用这些原则,如何在现代应用程序中充分使用数据。

Part Ⅰ.Foundations of Data Systems 3

1.Reliable,Scalable,and Maintainable Applications 3

Thinking About Data Systems 4

Reliability 6

Hardware Faults 7

Software Errors 8

Human Errors 9

How Important Is Reliability? 10

Scalability 10

Describing Load 11

Describing Performance 13

Approaches for Coping with Load 17

Maintainability 18

Operability:Making Life Easy for Operations 19

Simplicity:Managing Complexity 20

Evolvability:Making Change Easy 21

Summary 22

2.Data Models and Query Languages 27

Relational Model Versus Document Model 28

The Birth of NoSQL 29

The Object-Relational Mismatch 29

Many-to-One and Many-to-Many Relationships 33

Are Document Databases Repeating History? 36

Relational Versus Document Databases Today 38

Query Languages for Data 42

Declarative Queries on the Web 44

MapReduce Querying 46

Graph-Like Data Models 49

Property Graphs 50

The Cypher Query Language 52

Graph Queries in SQL 53

Triple-Stores and SPARQL 55

The Foundation:Datalog 60

Summary 63

3.Storage and Retrieval 69

Data Structures That Power Your Database 70

Hash Indexes 72

SSTables and LSM-Trees 76

B-Trees 79

Comparing B-Trees and LSM-Trees 83

Other Indexing Structures 85

Transaction Processing or Analytics? 90

Data Warehousing 91

Stars and Snowflakes:Schemas for Analytics 93

Column-Oriented Storage 95

Column Compression 97

Sort Order in Column Storage 99

Writing to Column-Oriented Storage 101

Aggregation:Data Cubes and Materialized Views 101

Summary 103

4.Encoding and Evolution 111

Formats for Encoding Data 112

Language-Specific Formats 113

JSON,XML,and Binary Variants 114

Thrift and Protocol Buffers 117

Avro 122

The Merits of Schemas 127

Modes of Dataflow 128

Dataflow Through Databases 129

Dataflow Through Services:REST and RPC 131

Message-Passing Dataflow 136

Summary 139

Part Ⅱ.Distributed Data 151

5.Replication 151

Leaders and Followers 152

Synchronous Versus Asynchronous Replication 153

Setting Up New Followers 155

Handling Node Outages 156

Implementation of Replication Logs 158

Problems with Replication Lag 161

Reading Your Own Writes 162

Monotonic Reads 164

Consistent Prefix Reads 165

Solutions for Replication Lag 167

Multi-Leader Replication 168

Use Cases for Multi-Leader Replication 168

Handling Write Conflicts 171

Multi-Leader Replication Topologies 175

Leaderless Replication 177

Writing to the Database When a Node Is Down 177

Limitations of Quorum Consistency 181

Sloppy Quorums and Hinted Handoff 183

Detecting Concurrent Writes 184

Summary 192

6.Partitioning 199

Partitioning and Replication 200

Partitioning of Key-Value Data 201

Partitioning by Key Range 202

Partitioning by Hash of Key 203

Skewed Workloads and Relieving Hot Spots 205

Partitioning and Secondary Indexes 206

Partitioning Secondary Indexes by Document 206

Partitioning Secondary Indexes by Term 208

Rebalancing Partitions 209

Strategies for Rebalancing 210

Operations:Automatic or Manual Rebalancing 213

Request Routing 214

Parallel Query Execution 216

Summary 216

7.Transactions 221

The Slippery Concept of a Transaction 222

The Meaning of ACID 223

Single-Object and Multi-Object Operations 228

Weak Isolation Levels 233

Read Committed 234

Snapshot Isolation and Repeatable Read 237

Preventing Lost Updates 242

Write Skew and Phantoms 246

Serializability 251

Actual Serial Execution 252

Two-Phase Locking(2PL) 257

Serializable Snapshot Isolation(SSI) 261

Summary 266

8.The Trouble with Distributed Systems 273

Faults and Partial Failures 274

Cloud Computing and Supercomputing 275

Unreliable Networks 277

Network Faults in Practice 279

Detecting Faults 280

Timeouts and Unbounded Delays 281

Synchronous Versus Asynchronous Networks 284

Unreliable Clocks 287

Monotonic Versus Time-of-Day Clocks 288

Clock Synchronization and Accuracy 289

Relving on Synchronized Clocks 291

Process Pauses 295

Knowledge,Truth,and Lies 300

The Truth Is Defined by the Majority 300

Byzantine Faults 304

System Model and Reality 306

Summary 310

9.Consistency and Consensus 321

Consistency Guarantees 322

Linearizability 324

What Makes a System Linearizable? 325

Relying on Linearizabillty 330

Implementing Linearizable Systems 332

The Cost of Linearizability 335

Ordering Guarantees 339

Ordering and Causality 339

Sequence Number Ordering 343

Total Order Broadcast 348

Distributed Transactions and Consensus 352

Atomic Commit and Two-Phase Commit(2PC) 354

Distributed Transactions in Practice 360

Fault-Tolerant Consensus 364

Membership and Coordination Services 370

Summary 373

Part Ⅲ.Derived Data 389

10.Batch Processing 389

Batch Processing with Unix Tools 391

Simple Log Analysis 391

The Unix Philosophy 394

MapReduce and Distributed Filesystems 397

MapReduce Job Execution 399

Reduce-Side Joins and Grouping 403

Map-Side Joins 408

The Output of Batch Workflows 411

Comparing Hadoop to Distributed Databases 414

Beyond MapReduce 419

Materialization of Intermediate State 419

Graphs and Iterative Processing 424

High-Level APIs and Languages 426

Summary 429

11.Stream Processing 439

Transmitting Event Streams 440

Messaging Systems 441

Partitioned Logs 446

Databases and Streams 451

Keeping Systems in Sync 452

Change Data Capture 454

Event Sourcing 457

State,Streams,and Immutability 459

Processing Streams 464

Uses of Stream Processing 465

Reasoning About Time 468

Stream Joins 472

Fault Tolerance 476

Summary 479

12.The Future of Data Systems 489

Data Integration 490

Combining Specialized Tools by Deriving Data 490

Batch and Stream Processing 494

Unbundling Databases 499

Composing Data Storage Technologies 499

Designing Applications Around Dataflow 504

Observing Derived State 509

Aiming for Correctness 515

The End-to-End Argument for Databases 516

Enforcing Constraints 521

Timeliness and Integrity 524

Trust,but Verify 528

Doing the Right Thing 533

Predictive Analytics 533

Privacy and Tracking 536

Summary 543

Glossary 553

Index 559