《大数据分析 R语言实现》PDF下载

  • 购买积分:15 如何计算积分?
  • 作  者:(英)西蒙?沃克威克
  • 出 版 社:南京:东南大学出版社
  • 出版年份:2017
  • ISBN:9787564173616
  • 页数:490 页
图书介绍:大数据分析是检视庞大的复杂数据集的过程,这些数据集通常超出了你所拥有的计算能力。R作为数据科学的领军编程语言,包含了诸多功能强大的函数,足以解决大数据处理相关的所有问题。本书首先简要叙述了大数据领域及其当前的行业标准。然后介绍了R语言的发展、结构、现实应用和不足之处,接着引入了用于数据管理和转换的主要R函数的修订版。读者会了解到基于云的大数据解决方案(例如Amazon EC2和Amazon RDS,Microsoft Azure及其HDInsight集群)以及R与关系/非关系数据库(如MongoDB和HBase)之间如何建立连接。除此之外,还进一步涵盖了大数据工具,如Apache Hadoop、HDFS和MapReduce,还有其他一些R兼容工具,如Apache Spark及其机器学习库Spark MLlib、H2O。

Preface 1

Chapter 1:The Era of Big Data 7

Big Data-The monster re-defined 7

Big Data toolbox-dealing with the giant 11

Hadoop-the elephant in the room 12

Databases 15

Hadoop Spark-ed up 16

R-The unsung Big Data hero 17

Summary 24

Chapter 2:Introduction to R Programming Language and Statistical Environment 25

Learning R 25

Revisiting R basics 28

Getting R and RStudio ready 28

Setting the URLs to R repositories 30

R data structures 32

Vectors 32

Scalars 35

Matrices 35

Arrays 37

Data frames 38

Lists 41

Exporting R data objects 42

Applied data science with R 47

Importing data from different formats 48

Exploratory Data Analysis 50

Data aggregations and contingency tables 53

Hypothesis testing and statistical inference 56

Tests of differences 57

Independent t-test example(with power and effect size estimates) 57

ANOVA example 60

Tests of relationships 63

An example of Pearson's r correlations 63

Multiple regression example 65

Data visualization packages 70

Summary 71

Chapter 3:Unleashing the Power of R from Within 73

Traditional limitations of R 74

Out-of-memory data 74

Processing speed 75

To the memory limits and beyond 76

Data transformations and aggregations with the ff and ffbase packages 76

Generalized linear models with the ff and ffbase packages 87

Logistic regression example with ffbase and biglm 89

Expanding memory with the bigmemory package 97

Parallel R 106

From bigmemory to faster computations 107

An apply()example with the big.matrix object 108

A for()loop example with the ffdf object 108

Using apply()and for()loop examples on a data.frame 109

A parallel package example 110

A foreach package example 113

The future of parallel processing in R 115

Utilizing Graphics Processing Units with R 115

Multi-threading with Microsoft R Open distribution 117

Parallel machine learning with H2O and R 118

Boosting R performance with the data.table package and other tools 118

Fast data import and manipulation with the data.table package 118

Data import with data.table 119

Lightning-fast subsets and aggregations on data.table 120

Chaining,more complex aggregations,and pivot tables with data.table 123

Writing better R code 126

Summary 127

Chapter 4:Hadoop and MapReduce Framework for R 129

Hadoop architecture 130

Hadoop Distributed File System 130

MapReduce framework 131

A simple MapReduce word count example 132

Other Hadoop native tools 134

Learning Hadoop 136

A single-node Hadoop in Cloud 137

Deploying Hortonworks Sandbox on Azure 138

A word count example in Hadoop using Java 159

A word count example in Hadoop using the R language 169

RStudio Server on a Linux RedHat/CentOS virtual machine 169

Installing and configuring RHadoop packages 177

HDFS management and MapReduce in R-a word count example 179

HDInsight-a multi-node Hadoop cluster on Azure 194

Creating your first HDInsight cluster 194

Creating a new Resource Group 195

Deploying a Virtual Network 197

Creating a Network Security Group 200

Setting up and configuring an HDInsight cluster 203

Starting the cluster and exploring Ambari 211

Connecting to the HDInsight cluster and installing RStudio Server 215

Adding a new inbound security rule for port 8787 218

Editing the Virtual Network's public IP address for the head node 221

Smart energy meter readings analysis example-using R on HDInsight cluster 229

Summary 241

Chapter 5:R with Relational Database Management Systems(RDBMSs) 243

Relational Database Management Systems(RDBMSs) 244

A short overview of used RDBMSs 244

Structured Query Language(SQL) 245

SQLite with R 247

Preparing and importing data into a local SQLite database 248

Connecting to SQLite from RStudio 250

MariaDB with R on a Amazon EC2 instance 255

Preparing the EC2 instance and RStudio Server for use 255

Preparing MariaDB and data for use 257

Working with MariaDB from RStudio 266

PostgreSQL with R on Amazon RDS 281

Launching an Amazon RDS database instance 281

Preparing and uploading data to Amazon RDS 290

Remotely querying PostgreSQL on Amazon RDS from RStudio 304

Summary 314

Chapter 6:R with Non-Relational(NoSQL)Databases 315

Introduction to NoSQL databases 315

Review of leading non-relational databases 316

MongoDB with R 319

Introduction to MongoDB 319

MongoDB data models 319

Installing MongoDB with R on Amazon EC2 322

Processing Big Data using MongoDB with R 325

Importing data into MongoDB and basic MongoDB commands 326

MongoDB with R using the rmongodb package 333

MongoDB with R using the RMongo package 346

MongoDB with R using the mongolite package 350

HBase with R 355

Azure HDInsight with HBase and RStudio Server 355

Importing the data to HDFS and HBase 363

Reading and querying HBase using the rhbase package 367

Summary 372

Chapter 7:Faster than Hadoop-Spark with R 373

Spark for Big Data analytics 374

Spark with R on a multi-node HDInsight cluster 375

Launching HDInsight with Spark and R/RStudio 375

Reading the data into HDFS and Hive 383

Getting the data into HDFS 385

Importing data from HDFS to Hive 386

Bay Area Bike Share analysis using SparkR 393

Summary 411

Chapter 8:Machine Learning Methods for Big Data in R 413

What is machine learning? 414

Supervised and unsupervised machine learning methods 415

Classification and clustering algorithms 416

Machine learning methods with R 417

Big Data machine learning tools 418

GLM example with Spark and R on the HDInsight cluster 419

Preparing the Spark cluster and reading the data from HDFS 419

Logistic regression in Spark with R 425

Naive Bayes with H2O on Hadoop with R 437

Running an H2O instance on Hadoop with R 437

Reading and exploring the data in H2O 441

Naive Bayes on H2O with R 446

Neural Networks with H2O on Hadoop with R 458

How do Neural Networks work? 458

Running Deep Learning models on H2O 461

Summary 469

Chapter 9:The Future of R-Big,Fast,and Smart Data 471

The current state of Big Data analytics with R 471

Out-of-memory data on a single machine 471

Faster data processing with R 473

Hadoop with R 475

Spark with R 476

R with databases 477

Machine learning with R 478

The future of R 478

Big Data 479

Fast data 480

Smart data 481

Where to go next 482

Summary 482

Index 483