目 录 BIG DATA CONTENTS 基础篇 第1章 大数据概述 ··································································003 1.1 数据和大数据 ·········································································003 1.1.1 数据的高速增长 ·······························································003 1.1.2 大数据 ···········································································004 1.1.3 科学的范式 ·····································································006 1.2 大数据从哪里来 ······································································007 1.3 大数据的应用场景 ···································································008 1.4 大数据对思维方式的影响 ··························································010 1.5 数据挖掘与机器学习 ································································011 1.6 数据科学项目的基本流程 ··························································012 1.7 数据安全和大数据伦理 ·····························································013 1.7.1 数据安全 ········································································013 1.7.2 大数据伦理 ·····································································015 1.8 国家层面的大数据问题 ·····························································016 1.8.1 数据主权 ········································································016 1.8.2 大数据与国家治理 ····························································017 1.8.3 大数据重塑世界新格局 ······················································018 1.8.4 中国国家大数据战略 ·························································019 1.9 云计算 ··················································································020 1.9.1 云计算的特征 ··································································022 1.9.2 云计算的典型服务模式 ······················································022 1.9.3 云计算服务部署的环境 ······················································023 BIG DATA 大数据应用基础教程 IV 1.9.4 云计算和大数据的关系 ······················································023 1.10 物联网 ·················································································023 1.11 数字经济 ··············································································025 1.11.1 大数据与数字经济 ··························································026 1.11.2 进一步推动我国数字经济发展 ···········································029 本章小结 ·······················································································030 习题 ·····························································································032 第2章 Python及常用类库 ························································033 2.1 Python简介 ···········································································033 2.1.1 Python的诞生 ·································································033 2.1.2 Python社区 ····································································034 2.1.3 Python的版本 ·································································034 2.1.4 使用Python进行数据分析的原因 ······································036 2.2 Python的安装与运行 ·······························································037 2.2.1 Anaconda简介及安装 ························································037 2.2.2 Python的运行 ·································································041 2.2.3 小结 ··············································································046 2.3 Python语言基础 ·····································································046 2.3.1 数据结构 ········································································046 2.3.2 代码结构 ········································································058 2.3.3 小结 ··············································································069 2.4 Python数据分析的常用类库 ······················································069 2.4.1 NumPy简介 ····································································069 2.4.2 pandas简介 ·····································································076 2.4.3 小结 ··············································································095 本章小结 ·······················································································095 习题 ·····························································································096 数据分析篇 第3章 数据获取 ·····································································101 3.1 数据来源 ···············································································101 3.2 网络数据爬取 ·········································································103 BIG DATA 目 录 V 3.2.1 网络爬虫概述 ··································································103 3.2.2 网页访问的基础知识 ·························································104 3.2.3 网页数据爬取 ··································································109 3.2.4 网页内容解析 ··································································111 3.2.5 常见的“爬取与反爬”攻防策略 ··········································115 3.3 网络数据采集器 ······································································118 3.3.1 常见采集器 ·····································································118 3.3.2 八爪鱼采集案例 ·······························································118 3.4 使用Selenium获取数据 ···························································122 3.4.1 安装Selenium ··································································122 3.4.2 使用Selenium获取页面元素 ···············································124 3.4.3 Selenium应用:链家二手房数据获取 ····································126 本章小结 ·······················································································130 习题 ·····························································································130 第4章 数据存储 ·····································································131 4.1 文件 ·····················································································131 4.2 传统数据库技术 ······································································133 4.2.1 数据库管理系统 ·······························································133 4.2.2 数据库的概念模型 ····························································134 4.2.3 关系型数据库 ··································································135 4.2.4 结构化查询语言SQL ························································136 4.2.5 MySQL数据库管理 ··························································137 4.2.6 基于MySQL monitor的基本数据库操作 ································141 4.2.7 基于HeidiSQL的基本数据库操作 ········································145 4.3 NoSQL数据库 ········································································148 4.3.1 NoSQL的发展背景 ···························································148 4.3.2 NoSQL数据库的类型 ························································149 本章小结 ·······················································································152 习题 ·····························································································152 第5章 数据预处理 ··································································153 5.1 数据质量问题 ·········································································153 5.1.1 现实世界的“脏”数据 ······················································153 5.1.2 数据质量问题的产生原因 ···················································155 BIG DATA 大数据应用基础教程 5.1.3 数据质量审核 ··································································156 5.2 数据预处理技术 ······································································158 5.2.1 数据清洗 ········································································158 5.2.2 数据集成 ········································································159 5.2.3 数据变换 ········································································160 5.2.4 数据归约 ········································································161 5.3 预处理案例 ············································································162 本章小结 ·······················································································166 习题 ·····························································································166 第6章 数据可视化 ··································································167 6.1 数据可视化概述 ······································································167 6.1.1 什么是数据可视化 ····························································167 6.1.2 常用的数据可视化工具 ······················································168 6.1.3 Python可视化工具库 ························································169 6.2 Matplotlib数据可视化 ·······························································170 6.2.1 Matplotlib绘图基础 ··························································170 6.2.2 Matplotlib常用绘图 ··························································172 6.2.3 使用mplot3d绘制3D图形 ·················································180 6.3 pandas数据可视化 ··································································185 6.3.1 pandas绘图基础 ·······························································185 6.3.2 pandas常用绘图 ·······························································186 6.4 seaborn数据可视化 ·································································191 6.4.1 seaborn绘图基础 ······························································191 6.4.2 seaborn常用绘图 ······························································197 6.5 pyecharts数据可视化 ·······························································201 6.5.1 pyecharts绘图基础 ···························································201 6.5.2 pyecharts常用绘图 ···························································201 本章小结 ·······················································································208 习题 ·····························································································208 第7章 数据分析方法 ·······························································211 7.1 数据分析方法的数学基础 ··························································211 7.1.1 理解复合函数求导 ····························································211 7.1.2 理解多元函数偏导 ····························································212 BIG DATA 目 录 7.1.3 理解最小二乘法 ·······························································212 7.1.4 理解梯度 ········································································213 7.1.5 理解概率 ········································································213 7.1.6 理解条件概率 ··································································214 7.1.7 理解贝叶斯公式 ·······························································214 7.2 回归 ·····················································································215 7.2.1 回归的基本概念及方法 ······················································215 7.2.2 回归预测的性能度量 ·························································217 7.2.3 线性回归 ········································································218 7.3 分类 ·····················································································227 7.3.1 分类的基本方法 ·······························································227 7.3.2 分类任务的性能度量 ·························································228 7.3.3 逻辑回归 ········································································229 7.3.4 支持向量机 ·····································································240 7.3.5 决策树理论 ·····································································254 7.3.6 朴素贝叶斯 ·····································································258 7.3.7 k-近邻(k-NN)算法 ·························································262 7.4 聚类 ·····················································································266 7.4.1 聚类算法 ········································································266 7.4.2 K-means聚类算法 ····························································267 7.4.3 K-means聚类案例 ····························································268 7.5 文本分析 ···············································································276 7.5.1 文本分析的基本步骤 ·························································277 7.5.2 文本分析的基本概念 ·························································277 7.5.3 文本分析案例 ··································································278 本章小结 ·······················································································286 习题 ·····························································································286 大数据平台篇 第8章 Linux操作系统基础 ······················································289 8.1 Linux操作系统简介··································································289 8.1.1 操作系统 ········································································289 8.1.2 Linux操作系统 ································································290 BIG DATA 大数据应用基础教程 8.1.3 大数据平台基于Linux操作系统的原因 ·································293 8.2 Linux基本命令········································································293 8.2.1 目录与文件操作命令 ·························································293 8.2.2 文本过滤与处理 ·······························································298 8.2.3 Shell输入输出命令 ···························································300 8.2.4 进程管理命令 ··································································301 8.2.5 日常操作命令 ··································································303 本章小结 ·······················································································306 习题 ·····························································································306 第9章 大数据管理平台 ····························································307 9.1 应用场景 ···············································································307 9.2 发展历程 ···············································································309 9.3 技术体系 ···············································································311 9.3.1 数据收集层 ·····································································312 9.3.2 数据存储层 ·····································································313 9.3.3 资源管理层 ·····································································315 9.3.4 计算引擎层 ·····································································315 9.3.5 数据分析层 ·····································································317 9.3.6 数据可视化层 ··································································317 9.3.7 大数据管理平台技术栈 ······················································318 本章小结 ·······················································································319 习题 ·····························································································319 第10章 分布式存储 ································································321 10.1 HDFS介绍 ···········································································321 10.2 HDFS基本架构 ·····································································323 10.3 HDFS Shell访问 ···································································325 本章小结 ·······················································································328 习题 ·····························································································328 第11章 分布式处理 ································································329 11.1 分布式计算思想 ·····································································329 11.2 MapReduce ··········································································333 BIG DATA 目 录 11.2.1 MapReduce介绍 ·····························································333 11.2.2 MapReduce编程模型 ·······················································334 11.2.3 MapReduce程序案例 ·······················································335 11.3 Spark ··················································································341 11.3.1 Spark介绍 ·····································································341 11.3.2 Spark编程模型 ·······························································342 11.3.3 Spark程序案例 ·······························································345 11.4 Spark相对于Hadoop的优势 ···················································352 本章小结 ·······················································································353 习题 ·····························································································353 参考文献 ·················································································355 附录A 基于虚拟机的Linux系统安装 ··········································359 A.1 虚拟机技术概述 ······································································359 A.2 虚拟机托管软件安装 ································································360 A.3 虚拟机Linux安装 ···································································362 附录B Hadoop及Spark安装 ····················································371 B.1 集群基础配置 ·········································································371 B.2 Hadoop安装 ··········································································375 B.3 Spark安装·············································································380