Hadoop应用架构PDF电子书下载
- 电子书积分:13 积分如何计算积分?
- 作 者:MarkGrover,TedMalaska,JonatbanSeidman等著
- 出 版 社:南京:东南大学出版社
- 出版年份:2017
- ISBN:9787564170011
- 页数:376 页
Part Ⅰ.Architectural Considerations for Hadoop Applications 1
1.Data Modeling in Hadoop 1
Data Storage Options 2
Standard File Formats 4
Hadoop File Types 5
Serialization Formats 7
Columnar Formats 9
Compression 12
HDFS Schema Design 14
Location of HDFS Files 16
Advanced HDFS Schema Design 17
HDFS Schema Design Summary 21
HBase Schema Design 21
Row Key 22
Timestamp 25
Hops 25
Tables and Regions 26
Using Columns 28
Using Column Families 30
Time-to-Live 30
Managing Metadata 31
What Is Metadata? 31
Why Care About Metadata? 32
Where to Store Metadata? 32
Examples of Managing Metadata 34
Limitations of the Hive Metastore and HCatalog 34
Other Ways of Storing Metadata 35
Conclusion 36
2.Data Movement 39
Data Ingestion Considerations 39
Timeliness of Data Ingestion 40
Incremental Updates 42
Access Patterns 43
Original Source System and Data Structure 44
Transformations 47
Network Bottlenecks 48
Network Security 49
Push or Pull 49
Failure Handling 50
Level of Complexity 51
Data Ingestion Options 51
File Transfers 52
Considerations for File Transfers versus Other Ingest Methods 55
Sqoop:Batch Transfer Between Hadoop and Relational Databases 56
Flume:Event-Based Data Collection and Processing 61
Kafka 71
Data Extraction 76
Conclusion 77
3.Processing Data in Hadoop 79
MapReduce 80
MapReduce Overview 80
Example for MapReduce 88
When to Use MapReduce 94
Spark 95
Spark Overview 95
Overview of Spark Components 96
Basic Spark Concepts 97
Benefits of Using Spark 100
Spark Example 102
When to Use Spark 104
Abstractions 104
Pig 106
Pig Example 106
When to Use Pig 109
Crunch 110
Crunch Example 110
When to Use Crunch 115
Cascading 115
Cascading Example 116
When to Use Cascading 119
Hive 119
Hive Overview 119
Example of Hive Code 121
When to Use Hive 125
Impala 126
Impala Overview 127
Speed-Oriented Design 128
Impala Example 130
When to Use Impala 131
Conclusion 132
4.Common Hadoop Processing Patterns 135
Pattern:Removing Duplicate Records by Primary Key 135
Data Generation for Deduplication Example 136
Code Example:Spark Deduplication in Scala 137
Code Example:Deduplication in SQL 139
Pattern:Windowing Analysis 140
Data Generation for Windowing Analysis Example 141
Code Example:Peaks and Valleys in Spark 142
Code Example:Peaks and Valleys in SQL 146
Pattern:Time Series Modifications 147
Use HBase and Versioning 148
Use HBase with a RowKey of RecordKey and StartTime 149
Use HDFS and Rewrite the Whole Table 149
Use Partitions on HDFS for Current and Historical Records 150
Data Generation for Time Series Example 150
Code Example:Time Series in Spark 151
Code Example:Time Series in SQL 154
Conclusion 157
5.Graph Processing on Hadoop 159
What Is a Graph? 159
What Is Graph Processing? 161
How Do You Process a Graph in a Distributed System? 162
The Bulk Synchronous Parallel Model 163
BSP by Example 163
Giraph 165
Read and Partition the Data 166
Batch Process the Graph with BSP 168
Write the Graph Back to Disk 172
Putting It All Together 173
When Should You Use Giraph? 174
GraphX 174
Just Another RDD 175
GraphX Pregel Interface 177
vprog() 178
sendMessage() 179
mergeMessage() 179
Which Tool to Use? 180
Conclusion 180
6.Orchestration 183
Why We Need Workflow Orchestration 183
The Limits of Scripting 184
The Enterprise Job Scheduler and Hadoop 186
Orchestration Frameworks in the Hadoop Ecosystem 186
Oozie Terminology 188
Oozie Overview 188
Oozie Workflow 191
Workflow Patterns 194
Point-to-Point Workflow 194
Fan-Out Workflow 196
Capture-and-Decide Workflow 198
Parameterizing Workflows 201
Classpath Definition 203
Scheduling Patterns 204
Frequency Scheduling 205
Time and Data Triggers 205
Executing Workflows 210
Conclusion 210
7.Near-Real-Time Processing with Hadoop 213
Stream Processing 215
Apache Storm 217
Storm High-Level Architecture 218
Storm Topologies 219
Tuples and Streams 221
Spouts and Bolts 221
Stream Groupings 222
Reliability of Storm Applications 223
Exactly-Once Processing 223
Fault Tolerance 224
Integrating Storm with HDFS 225
Integrating Storm with HBase 225
Storm Example:Simple Moving Average 226
Evaluating Storm 233
Trident 233
Trident Example:Simple Moving Average 234
Evaluating Trident 237
Spark Streaming 237
Overview of Spark Streaming 238
Spark Streaming Example:Simple Count 238
Spark Streaming Example:Multiple Inputs 240
Spark Streaming Example:Maintaining State 241
Spark Streaming Example:Windowing 243
Spark Streaming Example:Streaming versus ETL Code 244
Evaluating Spark Streaming 245
Flume Interceptors 246
Which Tool to Use? 247
Low-Latency Enrichment,Validation,Alerting,and Ingestion 247
NRT Counting,Rolling Averages,and Iterative Processing 248
Complex Data Pipelines 249
Conclusion 250
Part Ⅱ.Case Studies 250
8.Clickstream Analysis 253
Defining the Use Case 253
Using Hadoop for Clickstream Analysis 255
Design Overview 256
Storage 257
Ingestion 260
The Client Tier 264
The Collector Tier 266
Processing 268
Data Deduplication 270
Sessionization 272
Analyzing 275
Orchestration 276
Conclusion 279
9.Fraud Detection 281
Continuous Improvement 281
Taking Action 282
Architectural Requirements of Fraud Detection Systems 283
Introducing Our Use Case 283
High-Level Design 284
Client Architecture 286
Profile Storage and Retrieval 287
Caching 288
HBase Data Definition 289
Delivering Transaction Status:Approved or Denied? 294
Ingest 295
Path Between the Client and Flume 296
Near-Real-Time and Exploratory Analytics 302
Near-Real-Time Processing 302
Exploratory Analytics 304
What About Other Architectures? 305
Flume Interceptors 305
Kafka to Storm or Spark Streaming 306
External Business Rules Engine 306
Conclusion 307
10.Data Warehouse 309
Using Hadoop for Data Warehousing 312
Defining the Use Case 314
OLTP Schema 316
Data Warehouse:Introduction and Terminology 317
Data Warehousing with Hadoop 319
High-Level Design 319
Data Modeling and Storage 320
Ingestion 332
Data Processing and Access 337
Aggregations 341
Data Export 343
Orchestration 344
Conclusion 345
A.Joins in Impala 347
Index 353
- 《钒产业技术及应用》高峰,彭清静,华骏主编 2019
- 《现代水泥技术发展与应用论文集》天津水泥工业设计研究院有限公司编 2019
- 《英汉翻译理论的多维阐释及应用剖析》常瑞娟著 2019
- 《数据库技术与应用 Access 2010 微课版 第2版》刘卫国主编 2020
- 《区块链DAPP开发入门、代码实现、场景应用》李万胜著 2019
- 《虚拟流域环境理论技术研究与应用》冶运涛蒋云钟梁犁丽曹引等编著 2019
- 《当代翻译美学的理论诠释与应用解读》宁建庚著 2019
- 《第一性原理方法及应用》李青坤著 2019
- 《教师教育系列教材 心理学原理与应用 第2版 视频版》郑红,倪嘉波,刘亨荣编;陈冬梅责编 2020
- 《物联网与嵌入式技术及其在农业上的应用》马德新 2019