1 Data Mining 1
1.1 What is Data Mining? 1
1.2 Statistical Limits on Data Mining 4
1.3 Things Useful to Know 7
1.4 Outline of the Book 15
1.5 Summary of Chapter 1 16
1.6 References for Chapter 1 17
2 Large-Scale File Systems and Map-Reduce 18
2.1 Distributed File Systems 18
2.2 Map-Reduce 21
2.3 Algorithms Using Map-Reduce 26
2.4 Extensions to Map-Reduce 37
2.5 Efficiency of Cluster-Computing Algorithms 42
2.6 Summary of Chapter 2 49
2.7 References for Chapter 2 51
3 Finding Similar Items 53
3.1 Applications of Near-Neighbor Search 53
3.2 Shingling of Documents 57
3.3 Similarity-Preserving Summaries of Sets 60
3.4 Locality-Sensitive Hashing for Documents 67
3.5 Distance Measures 71
3.6 The Theory of Locality-Sensitive Functions 77
3.7 LSH Families for Other Distance Measures 83
3.8 Applications of Locality-Sensitive Hashing 88
3.9 Methods for High Degrees of Similarity 96
3.10 Summary of Chapter 3 104
3.11 References for Chapter 3 106
4 Mining Data Streams 108
4.1 The Stream Data Model 108
4.2 Sampling Data in a Stream 112
4.3 Filtering Streams 115
4.4 Counting Distinct Elements in a Stream 118
4.5 Estimating Moments 122
4.6 Counting Ones in a Window 127
4.7 Decaying Windows 133
4.8 Summary of Chapter 4 136
4.9 References for Chapter 4 137
5 Link Analysis 139
5.1 PageRank 139
5.2 Efficient Computation of PageRank 153
5.3 Topic-Sensitive PageRank 159
5.4 Link Spam 163
5.5 Hubs and Authorities 167
5.6 Summary of Chapter 5 172
5.7 References for Chapter 5 175
6 Frequent Itemsets 176
6.1 The Market-Basket Model 176
6.2 Market Baskets and the A-Priori Algorithm 183
6.3 Handling Larger Datasets in Main Memory 192
6.4 Limited-Pass Algorithms 199
6.5 Counting Frequent Items in a Stream 205
6.6 Summary of Chapter 6 209
6.7 References for Chapter 6 211
7 Clustering 213
7.1 Introduction to Clustering Techniques 213
7.2 Hierarchical Clustering 217
7.3 K-means Algorithms 226
7.4 The CURE Algorithm 234
7.5 Clustering in Non-Euclidean Spaces 237
7.6 Clustering for Streams and Parallelism 241
7.7 Summary of Chapter 7 247
7.8 References for Chapter 7 250
8 Advertising on the Web 252
8.1 Issues in On-Line Advertising 252
8.2 On-Line Algorithms 255
8.3 The Matching Problem 258
8.4 The Adwords Problem 261
8.5 Adwords Implementation 270
8.6 Summary of Chapter 8 273
8.7 References for Chapter 8 275
9 Recommendation Systems 277
9.1 A Model for Recommendation Systems 277
9.2 Content-Based Recommendations 281
9.3 Collaborative Filtering 291
9.4 Dimensionality Reduction 297
9.5 The NetFlix Challenge 305
9.6 Summary of Chapter 9 306
9.7 References for Chapter 9 308
Index 310