《Mining of massive datasets》PDF下载

购买积分：12 如何计算积分？
作　　者：Anand Rajaraman ; Jeffrey D. Ullman
出版社：Cambridge University Press
出版年份：2012
ISBN：1107015357
页数：316 页

图书介绍：

点击购买此书全本PDF电子书

1 Data Mining 1

1.1 What is Data Mining？ 1

1.2 Statistical Limits on Data Mining 4

1.3 Things Useful to Know 7

1.4 Outline of the Book 15

1.5 Summary of Chapter 1 16

1.6 References for Chapter 1 17

2 Large-Scale File Systems and Map-Reduce 18

2.1 Distributed File Systems 18

2.2 Map-Reduce 21

2.3 Algorithms Using Map-Reduce 26

2.4 Extensions to Map-Reduce 37

2.5 Efficiency of Cluster-Computing Algorithms 42

2.6 Summary of Chapter 2 49

2.7 References for Chapter 2 51

3 Finding Similar Items 53

3.1 Applications of Near-Neighbor Search 53

3.2 Shingling of Documents 57

3.3 Similarity-Preserving Summaries of Sets 60

3.4 Locality-Sensitive Hashing for Documents 67

3.5 Distance Measures 71

3.6 The Theory of Locality-Sensitive Functions 77

3.7 LSH Families for Other Distance Measures 83

3.8 Applications of Locality-Sensitive Hashing 88

3.9 Methods for High Degrees of Similarity 96

3.10 Summary of Chapter 3 104

3.11 References for Chapter 3 106

4 Mining Data Streams 108

4.1 The Stream Data Model 108

4.2 Sampling Data in a Stream 112

4.3 Filtering Streams 115

4.4 Counting Distinct Elements in a Stream 118

4.5 Estimating Moments 122

4.6 Counting Ones in a Window 127

4.7 Decaying Windows 133

4.8 Summary of Chapter 4 136

4.9 References for Chapter 4 137

5 Link Analysis 139

5.1 PageRank 139

5.2 Efficient Computation of PageRank 153

5.3 Topic-Sensitive PageRank 159

5.4 Link Spam 163

5.5 Hubs and Authorities 167

5.6 Summary of Chapter 5 172

5.7 References for Chapter 5 175

6 Frequent Itemsets 176

6.1 The Market-Basket Model 176

6.2 Market Baskets and the A-Priori Algorithm 183

6.3 Handling Larger Datasets in Main Memory 192

6.4 Limited-Pass Algorithms 199

6.5 Counting Frequent Items in a Stream 205

6.6 Summary of Chapter 6 209

6.7 References for Chapter 6 211

7 Clustering 213

7.1 Introduction to Clustering Techniques 213

7.2 Hierarchical Clustering 217

7.3 K-means Algorithms 226

7.4 The CURE Algorithm 234

7.5 Clustering in Non-Euclidean Spaces 237

7.6 Clustering for Streams and Parallelism 241

7.7 Summary of Chapter 7 247

7.8 References for Chapter 7 250

8 Advertising on the Web 252

8.1 Issues in On-Line Advertising 252

8.2 On-Line Algorithms 255

8.3 The Matching Problem 258

8.4 The Adwords Problem 261

8.5 Adwords Implementation 270

8.6 Summary of Chapter 8 273

8.7 References for Chapter 8 275

9 Recommendation Systems 277

9.1 A Model for Recommendation Systems 277

9.2 Content-Based Recommendations 281

9.3 Collaborative Filtering 291

9.4 Dimensionality Reduction 297

9.5 The NetFlix Challenge 305

9.6 Summary of Chapter 9 306

9.7 References for Chapter 9 308

Index 310