目 录

BIG

 DATA

CONTENTS

基础篇

第1章 大数据概述 ··································································003

1.1 数据和大数据 ·········································································003

1.1.1 数据的高速增长 ·······························································003

1.1.2 大数据 ···········································································004

1.1.3 科学的范式 ·····································································006

1.2 大数据从哪里来 ······································································007

1.3 大数据的应用场景 ···································································008

1.4 大数据对思维方式的影响 ··························································010

1.5 数据挖掘与机器学习 ································································011

1.6 数据科学项目的基本流程 ··························································012

1.7 数据安全和大数据伦理 ·····························································013

1.7.1 数据安全 ········································································013

1.7.2 大数据伦理 ·····································································015

1.8 国家层面的大数据问题 ·····························································016

1.8.1 数据主权 ········································································016

1.8.2 大数据与国家治理 ····························································017

1.8.3 大数据重塑世界新格局 ······················································018

1.8.4 中国国家大数据战略 ·························································019

1.9 云计算 ··················································································020

1.9.1 云计算的特征 ··································································022

1.9.2 云计算的典型服务模式 ······················································022

1.9.3 云计算服务部署的环境 ······················································023



BIG 
DATA

大数据应用基础教程

IV

1.9.4 云计算和大数据的关系 ······················································023

1.10 物联网 ·················································································023

1.11 数字经济 ··············································································025

1.11.1 大数据与数字经济 ··························································026

1.11.2 进一步推动我国数字经济发展 ···········································029

本章小结 ·······················································································030

习题 ·····························································································032

第2章 Python及常用类库 ························································033

2.1 Python简介 ···········································································033

2.1.1 Python的诞生 ·································································033

2.1.2 Python社区 ····································································034

2.1.3 Python的版本 ·································································034

2.1.4 使用Python进行数据分析的原因 ······································036

2.2 Python的安装与运行 ·······························································037

2.2.1 Anaconda简介及安装 ························································037

2.2.2 Python的运行 ·································································041

2.2.3 小结 ··············································································046

2.3 Python语言基础 ·····································································046

2.3.1 数据结构 ········································································046

2.3.2 代码结构 ········································································058

2.3.3 小结 ··············································································069

2.4 Python数据分析的常用类库 ······················································069

2.4.1 NumPy简介 ····································································069

2.4.2 pandas简介 ·····································································076

2.4.3 小结 ··············································································095

本章小结 ·······················································································095

习题 ·····························································································096

数据分析篇

第3章 数据获取 ·····································································101

3.1 数据来源 ···············································································101

3.2 网络数据爬取 ·········································································103



BIG 
DATA

目 
录

V

3.2.1 网络爬虫概述 ··································································103

3.2.2 网页访问的基础知识 ·························································104

3.2.3 网页数据爬取 ··································································109

3.2.4 网页内容解析 ··································································111

3.2.5 常见的“爬取与反爬”攻防策略 ··········································115

3.3 网络数据采集器 ······································································118

3.3.1 常见采集器 ·····································································118

3.3.2 八爪鱼采集案例 ·······························································118

3.4 使用Selenium获取数据 ···························································122

3.4.1 安装Selenium ··································································122

3.4.2 使用Selenium获取页面元素 ···············································124

3.4.3 Selenium应用:链家二手房数据获取 ····································126

本章小结 ·······················································································130

习题 ·····························································································130

第4章 数据存储 ·····································································131

4.1 文件 ·····················································································131

4.2 传统数据库技术 ······································································133

4.2.1 数据库管理系统 ·······························································133

4.2.2 数据库的概念模型 ····························································134

4.2.3 关系型数据库 ··································································135

4.2.4 结构化查询语言SQL ························································136

4.2.5 MySQL数据库管理 ··························································137

4.2.6 基于MySQL monitor的基本数据库操作 ································141

4.2.7 基于HeidiSQL的基本数据库操作 ········································145

4.3 NoSQL数据库 ········································································148

4.3.1 NoSQL的发展背景 ···························································148

4.3.2 NoSQL数据库的类型 ························································149

本章小结 ·······················································································152

习题 ·····························································································152

第5章 数据预处理 ··································································153

5.1 数据质量问题 ·········································································153

5.1.1 现实世界的“脏”数据 ······················································153

5.1.2 数据质量问题的产生原因 ···················································155



BIG 
DATA

大数据应用基础教程

5.1.3 数据质量审核 ··································································156

5.2 数据预处理技术 ······································································158

5.2.1 数据清洗 ········································································158

5.2.2 数据集成 ········································································159

5.2.3 数据变换 ········································································160

5.2.4 数据归约 ········································································161

5.3 预处理案例 ············································································162

本章小结 ·······················································································166

习题 ·····························································································166

第6章 数据可视化 ··································································167

6.1 数据可视化概述 ······································································167

6.1.1 什么是数据可视化 ····························································167

6.1.2 常用的数据可视化工具 ······················································168

6.1.3 Python可视化工具库 ························································169

6.2 Matplotlib数据可视化 ·······························································170

6.2.1 Matplotlib绘图基础 ··························································170

6.2.2 Matplotlib常用绘图 ··························································172

6.2.3 使用mplot3d绘制3D图形 ·················································180

6.3 pandas数据可视化 ··································································185

6.3.1 pandas绘图基础 ·······························································185

6.3.2 pandas常用绘图 ·······························································186

6.4 seaborn数据可视化 ·································································191

6.4.1 seaborn绘图基础 ······························································191

6.4.2 seaborn常用绘图 ······························································197

6.5 pyecharts数据可视化 ·······························································201

6.5.1 pyecharts绘图基础 ···························································201

6.5.2 pyecharts常用绘图 ···························································201

本章小结 ·······················································································208

习题 ·····························································································208

第7章 数据分析方法 ·······························································211

7.1 数据分析方法的数学基础 ··························································211

7.1.1 理解复合函数求导 ····························································211

7.1.2 理解多元函数偏导 ····························································212



BIG 
DATA

目 
录

7.1.3 理解最小二乘法 ·······························································212

7.1.4 理解梯度 ········································································213

7.1.5 理解概率 ········································································213

7.1.6 理解条件概率 ··································································214

7.1.7 理解贝叶斯公式 ·······························································214

7.2 回归 ·····················································································215

7.2.1 回归的基本概念及方法 ······················································215

7.2.2 回归预测的性能度量 ·························································217

7.2.3 线性回归 ········································································218

7.3 分类 ·····················································································227

7.3.1 分类的基本方法 ·······························································227

7.3.2 分类任务的性能度量 ·························································228

7.3.3 逻辑回归 ········································································229

7.3.4 支持向量机 ·····································································240

7.3.5 决策树理论 ·····································································254

7.3.6 朴素贝叶斯 ·····································································258

7.3.7 k-近邻(k-NN)算法 ·························································262

7.4 聚类 ·····················································································266

7.4.1 聚类算法 ········································································266

7.4.2 K-means聚类算法 ····························································267

7.4.3 K-means聚类案例 ····························································268

7.5 文本分析 ···············································································276

7.5.1 文本分析的基本步骤 ·························································277

7.5.2 文本分析的基本概念 ·························································277

7.5.3 文本分析案例 ··································································278

本章小结 ·······················································································286

习题 ·····························································································286

大数据平台篇

第8章 Linux操作系统基础 ······················································289

8.1 Linux操作系统简介··································································289

8.1.1 操作系统 ········································································289

8.1.2 Linux操作系统 ································································290



BIG 
DATA

大数据应用基础教程

8.1.3 大数据平台基于Linux操作系统的原因 ·································293

8.2 Linux基本命令········································································293

8.2.1 目录与文件操作命令 ·························································293

8.2.2 文本过滤与处理 ·······························································298

8.2.3 Shell输入输出命令 ···························································300

8.2.4 进程管理命令 ··································································301

8.2.5 日常操作命令 ··································································303

本章小结 ·······················································································306

习题 ·····························································································306

第9章 大数据管理平台 ····························································307

9.1 应用场景 ···············································································307

9.2 发展历程 ···············································································309

9.3 技术体系 ···············································································311

9.3.1 数据收集层 ·····································································312

9.3.2 数据存储层 ·····································································313

9.3.3 资源管理层 ·····································································315

9.3.4 计算引擎层 ·····································································315

9.3.5 数据分析层 ·····································································317

9.3.6 数据可视化层 ··································································317

9.3.7 大数据管理平台技术栈 ······················································318

本章小结 ·······················································································319

习题 ·····························································································319

第10章 分布式存储 ································································321

10.1 HDFS介绍 ···········································································321

10.2 HDFS基本架构 ·····································································323

10.3 HDFS Shell访问 ···································································325

本章小结 ·······················································································328

习题 ·····························································································328

第11章 分布式处理 ································································329

11.1 分布式计算思想 ·····································································329

11.2 MapReduce ··········································································333



BIG 
DATA

目 
录

11.2.1 MapReduce介绍 ·····························································333

11.2.2 MapReduce编程模型 ·······················································334

11.2.3 MapReduce程序案例 ·······················································335

11.3 Spark ··················································································341

11.3.1 Spark介绍 ·····································································341

11.3.2 Spark编程模型 ·······························································342

11.3.3 Spark程序案例 ·······························································345

11.4 Spark相对于Hadoop的优势 ···················································352

本章小结 ·······················································································353

习题 ·····························································································353

参考文献 ·················································································355

附录A 基于虚拟机的Linux系统安装 ··········································359

A.1 虚拟机技术概述 ······································································359

A.2 虚拟机托管软件安装 ································································360

A.3 虚拟机Linux安装 ···································································362

附录B Hadoop及Spark安装 ····················································371

B.1 集群基础配置 ·········································································371

B.2 Hadoop安装 ··········································································375

B.3 Spark安装·············································································380