网站运维工程(影印版)PDF电子书下载
- 电子书积分:16 积分如何计算积分?
- 作 者:BETSY BEYER,CHRIS JONES,JENNIFER PETOFF,NIALL RICHARD MURPHY编
- 出 版 社:南京:东南大学出版社
- 出版年份:2018
- ISBN:9787564172961
- 页数:528 页
Part Ⅰ.Introduction 3
1.Introduction 3
The Sysadmin Approach to Service Management 3
Google’s Approach to Service Management:Site Reliability Engineering 5
Tenets of SRE 7
The End of the Beginning 12
2.The Production Environment at Google,from the Viewpoint of an SRE 13
Hardware 13
System Software That“Organizes”the Hardware 15
Other System Software 18
Our Software Infrastructure 19
Our Development Environment 19
Shakespeare:A Sample Service 20
Part Ⅱ.Principles 25
3.Embracing Risk 25
Managing Risk 25
Measuring Service Risk 26
Risk Tolerance of Services 28
Motivation for Error Budgets 33
4.Service Level Objectives 37
Service Level Terminology 37
Indicators in Practice 40
Objectives in Practice 43
Agreements in Practice 47
5.Eliminating Toil 49
Toil Defined 49
Why Less Toil Is Better 51
What Qualifies as Engineering? 52
Is Toil Always Bad? 52
Conclusion 54
6.Monitoring Distributed Systems 55
Definitions 55
Why Monitor? 56
Setting Reasonable Expectations for Monitoring 57
Symptoms Versus Causes 58
Black-Box Versus White-Box 59
The Four Golden Signals 60
Worrying About Your Tail(or,Instrumentation and Performance) 61
Choosing an Appropriate Resolution for Measurements 62
As Simple as Possible,No Simpler 62
Tying These Principles Together 63
Monitoring for the Long Term 64
Conclusion 66
7.The Evolution of Automation at Google 67
The Value of Automation 67
The Value for Google SRE 70
The Use Cases for Automation 70
Automate Yourself Out of a Job:Automate ALL the Things! 73
Soothing the Pain:Applying Automation to Cluster Turnups 75
Borg:Birth of the Warehouse-Scale Computer 81
Reliability Is the Fundamental Feature 83
Recommendations 84
8.Release Engineering 87
The Role of a Release Engineer 87
Philosophy 88
Continuous Build and Deployment 90
Configuration Management 93
Conclusions 95
9.Simplicity 97
System Stability Versus Agility 97
The Virtue of Boring 98
I Won’t Give Up My Code! 98
The“Negative Lines of Code”Metric 99
Minimal APIs 99
Modularity 100
Release Simplicity 100
A Simple Conclusion 101
Part Ⅲ.Practices 107
10.Practical Alerting from Time-Series Data 107
The Rise of Borgmon 108
Instrumentation of Applications 109
Collection of Exported Data 110
Storage in the Time-Series Arena 111
Rule Evaluation 114
Alerting 118
Sharding the Monitoring Topology 119
Black-Box Monitoring 120
Maintaining the Configuration 121
Ten Years On... 122
11.Being On-Call 125
Introduction 125
Life of an On-Call Engineer 126
Balanced On-Call 127
Feeling Safe 128
Avoiding Inappropriate Operational Load 130
Conclusions 132
12.Effective Troubleshooting 133
Theory 134
In Practice 136
Negative Results Are Magic 144
Case Study 146
Making Troubleshooting Easier 150
Conclusion 150
13.Emergency Response 151
What to Do When Systems Break 151
Test-Induced Emergency 152
Change-Induced Emergency 153
Process-Induced Emergency 155
All Problems Have Solutions 158
Learn from the Past.Don’t Repeat It. 158
Conclusion 159
14.Managing Incidents 161
Unmanaged Incidents 161
The Anatomy of an Unmanaged Incident 162
Elements of Incident Management Process 163
A Managed Incident 165
When to Declare an Incident 166
In Summary 166
15.Postmortem Culture:Learning from Failure 169
Google’s Postmortem Philosophy 169
Collaborate and Share Knowledge 171
Introducing a Postmortem Culture 172
Conclusion and Ongoing Improvements 175
16.Tracking Outages 177
Escalator 178
Outalator 178
17.Testing for Reliability 183
Types of Software Testing 185
Creating a Test and Build Environment 190
Testing at Scale 192
Conclusion 204
18.Software Engineering in SRE 205
Why Is Software Engineering Within SRE Important? 205
Auxon Case Study:Project Background and Problem Space 207
Intent-Based Capacity Planning 209
Fostering Software Engineering in SRE 218
Conclusions 222
19.Load Balancing at the Frontend 223
Power Isn’t the Answer 223
Load Balancing Using DNS 224
Load Balancing at the Virtual IP Address 227
20.Load Balancing in the Datacenter 231
The Ideal Case 232
Identifying Bad Tasks:Flow Control and Lame Ducks 233
Limiting the Connections Pool with Subsetting 235
Load Balancing Policies 240
21.Handling Overload 247
The Pitfalls of“Queries per Second” 248
Per-Customer Limits 248
Client-Side Throttling 249
Criticality 251
Utilization Signals 253
Handling Overload Errors 253
Load from Connections 257
Conclusions 258
22.Addressing Cascading Failures 259
Causes of Cascading Failures and Designing to Avoid Them 260
Preventing Server Overload 265
Slow Startup and Cold Caching 274
Triggering Conditions for Cascading Failures 276
Testing for Cascading Failures 278
Immediate Steps to Address Cascading Failures 280
Closing Remarks 283
23.Managing Critiol State:Distributed Consensus for Reliability 285
Motivating the Use of Consensus:Distributed Systems Coordination Failure 288
How Distributed Consensus Works 289
System Architecture Patterns for Distributed Consensus 291
Distributed Consensus Performance 296
Deploying Distributed Consensus-Based Systems 304
Monitoring Distributed Consensus Systems 312
Conclusion 313
24.Distributed Periodic Scheduling with Cron 315
Cron 315
Cron Jobs and Idempotency 316
Cron at Large Scale 317
Building Cron at Google 319
Summary 326
25.Data Processing Pipelines 327
Origin of the Pipeline Design Pattern 327
Initial Effect of Big Data on the Simple Pipeline Pattern 328
Challenges with the Periodic Pipeline Pattern 328
Trouble Caused By Uneven Work Distribution 328
Drawbacks of Periodic Pipelines in Distributed Environments 329
Introduction to Google Workflow 333
Stages of Execution in Workflow 335
Ensuring Business Continuity 337
Summary and Concluding Remarks 338
26.Data Integrity:What You Read Is What You Wrote 339
Data Integrity’s Strict Requirements 340
Google SRE Objectives in Maintaining Data Integrity and Availability 344
How Google SRE Faces the Challenges of Data Integrity 349
Case Studies 360
General Principles of SRE as Applied to Data Integrity 367
Conclusion 368
27.Reliable Product Launchesat Scale 369
Launch Coordination Engineering 370
Setting Up a Launch Process 372
Developing a Launch Checklist 375
Selected Techniques for Reliable Launches 380
Development of LCE 384
Conclusion 387
Part Ⅳ.Management 391
28.Accelerating SREs to On-Call and Beyond 391
You’ve Hired Your Next SRE(s),Now What? 391
Initial Learning Experiences:The Case for Structure Over Chaos 394
Creating Stellar Reverse Engineers and Improvisational Thinkers 397
Five Practices for Aspiring On-Callers 400
On-Call and Beyond:Rites of Passage,and Practicing Continuing Education 406
Closing Thoughts 406
29.Dealing with Interrupts 407
Managing Operational Load 408
Factors in Determining How Interrupts Are Handled 408
Imperfect Machines 409
30.Embedding an SRE to Recover from Operational Overload 417
Phase 1:Learn the Service and Get Context 418
Phase 2:Sharing Context 420
Phase 3:Driving Change 421
Conclusion 423
31.Communication and Collaboration in SRE 425
Communications:Production Meetings 426
Collaboration within SRE 430
Case Study of Collaboration in SRE:Viceroy 432
Collaboration Outside SRE 437
Case Study:Migrating DFP to F1 437
Conclusion 440
32.The Evolving SRE Engagement Model 441
SRE Engagement:What,How,and Why 441
The PRR Model 442
The SRE Engagement Model 443
Production Readiness Reviews:Simple PRR Model 444
Evolving the Simple PRR Model:Early Engagement 448
Evolving Services Development:Frameworks and SRE Platform 451
Conclusion 456
Part Ⅴ.Conclusions 459
33.Lessons Learned from Other Industries 459
Meet Our Industry Veterans 460
Preparedness and Disaster Testing 462
Postmortem Culture 465
Automating Away Repetitive Work and Operational Overhead 467
Structured and Rational Decision Making 469
Conclusions 470
34.Conclusion 473
A.Availability Table 477
B.A Collection of Best Practices for Production Services 479
C.Example Incident State Document 485
D.Example Postmortem 487
E.Launch Coordination Checklist 493
F.Example Production Meeting Minutes 497
Bibliography 501
Index 513
- 《市政工程基础》杨岚编著 2009
- 《工程静力学》王科盛主编 2019
- 《中央财政支持提升专业服务产业发展能力项目水利工程专业课程建设成果 设施农业工程技术》赵英编 2018
- 《化学反应工程》许志美主编 2019
- 《绿色过程工程与清洁生产技术 张懿院士论文集精选 上》《绿色过程工程与清洁生产技术》编写组编 2019
- 《软件工程》齐治昌,谭庆平,宁洪编著 2019
- 《化学工程与工艺专业实验指导》郭跃萍主编 2019
- 《微笑 影印本》N.达列基作 1947
- 《天水师范学院60周年校庆文库 新工科视域下的工程基础与应用研究》《天水师范学院60周年校庆文库》编委会编 2019
- 《高等工程教育改革探析》李瀛心,吴价宝著 1997
- 《第三帝国的兴亡》(英)克里斯·毕晓普(Chris Bishop),(英)戴维·乔丹(David Jordan)著 2019
- 《当代国际政治理论》(美)布朗 (Chris Brown)著;邓凯元,张裕斌译;国家教育研究院译 2013
- 《基础毒理学》Chris Kent原著;赖俊雄总校阅;江秀梅,段蕴雯,赖珊湖,侯钰琪,张月惠,王耀宏译 2012
- 《自造者时代:启动人人制造的第三次工业革命》克里斯·安德森(CHRIS ANDERSON)著;连育德译 2013
- 《后现代主义》(英)Richard Appignanesi著;Chris Garratt绘画;黄训庆译 1996
- 《文化研究智典》CHRIS BARKER著;许梦芸译 2007
- 《军事物流与射频识别》ErickC·Jones(埃丽卡C·琼斯),ChristopherA·Chung(克里斯托弗A·丘) 2014
- 《另类世界史 打开历史广角》(英)克里斯·布雷瑟(Chris Brazier)著;黄中宪译 2002
- 《Starbucks咖啡王国传奇》(美)霍华萧兹(Howard Schultz),(美)朵莉琼斯杨(Dori Jones Yang)著;韩怀宗译 1998
- 《通行规则:美国慈善法指南》(美)贝希·布查尔特·艾德勒(Betsy Buchalter Adler),(美)大卫·艾维特(David Levitt),(美)英格里德·米特梅尔(Ingrid Mitter-Maier)著;金锦萍,朱卫国,周虹译 2007