MachineLearning 2019-02-26

这里仅仅是编程的基础，和机器学习算法的简单理解和接口调用
随后我会慢慢的对每个机器学习算法的原理性内容进行补充
精力有限，这里直接分享了我朋友（郭伟|Welian）汇总的整篇笔记．

数据分析

数据分析DAY01

什么是数据分析？

数据分析是指用适当的统计分析方法对收集来的大量数据进行分析，提取有用的信息形成结论而对结论加以详细研究、概括、总结的过程。

使用python做数据分析的常用库

numpy 基础数值算法
scipy 科学计算
matplotlib 数据可视化
pandas 序列高级函数

numpy概述

Numerical Python, 数值的python，补充了Python语言所欠缺的数值计算能力。
Numpy是其他数据分析及机器学习库的底层库。
Numpy完全标准C语言实现，效率充分优化。
Numpy开源免费。

numpy的历史

1995年， Numeric发布（numpy的前身）。
2001年，Scipy->Numarray，提供多维数组运算。
2005年，Numeric+Numarray -> Numpy
2006年，Numpy脱离Scipy称为独立项目。

numpy的核心：多维数组

import numpy as np
array = np.array([1,2,3,4,5])
a = array + 10

numpy基础

ndarray数组

内存中的ndarray对象

元数据（metadata）

存储对目标数组的描述信息，如：dim count, dimensions, dtype, data等等。

实际数据

完整的数组内容

将实际数据与元数据分开存放，一方面提高了内存空间的使用效率，另一方面减少对实际数据的访问频率，提高效率。

Numpy数组是同质数组，所有元素的数据类型必须相同。

ndarray对象的创建

np.array(任何可以被解释为Numpy数组的逻辑结构)

import numpy as np
array = np.array([1,2,3,4,5])

np.arange(起始值(0)，终止值，步长(1))

print(np.arange(0, 10, 1))
print(np.arange(0, 10))
print(np.arange(10))

np.zeros(数组元素个数, dtype=’类型’)

np.ones(数组元素格式, dtype=’类型’)

a = np.zeros(10, dtype='int32')
print(a.dtype, a)
a = np.ones(10, dtype='float32')
print(a.dtype, a)
a = np.ones(5, dtype='bool')
print(a.dtype, a)

ndarray对象属性的基本操作

数组的维度：array.shape

a = np.arange(1, 9)
print(a.shape, a)

# 修改数组对象的维度
a.shape = (2, 4)
print(a.shape, a, sep='\n')
a.shape = (2, 2, 2)
print(a.shape, a, sep='\n')

元素的类型：array.dtype

# 测试元素的类型相关操作
b = np.arange(10)
print(b.dtype, b)
# 把b中的每个元素当做float来看
# 返回新类型的数组
c = b.astype('float32')
print(b.dtype, b)
print(c.dtype, c)

数组元素个数：array.size

# 测试数组元素的个数
a = np.arange(8).reshape(2, 4)
print(a)
print(a.size)
print(len(a))

数组元素索引（下标）

# 测试数组索引操作
c = np.arange(1, 28).reshape(3, 3, 3)
print(c)
print(c[0])  # 0页
print(c[0][0])  # 0页 0行
print(c[0][0][0])  # 0页 0行 0列
print(c[0, 0, 0])  # 0页 0行 0列

for i in range(c.shape[0]):
    for j in range(c.shape[1]):
        for k in range(c.shape[2]):
            print(c[i, j, k], end=' ')

ndarray对象属性操作详解

Numpy的内部基本数据类型

类型名	类型表示符
布尔型	bool_
有符号整数型	int8(-128~127)/int16/int32/int64
无符号整型	uint8/uint16/uint32/uint64
浮点型	float16/float32/float64
复数型	complex64/complex128
字符串型	str_ (每个字符32位Unicode编码)

自定义复合类型

"""
demo03_cls.py
测试ndarray的自定义复合类型
"""
import numpy as np

data = [
    ('zs', [90, 85, 80], 16),
    ('ls', [91, 81, 81], 17),
    ('ww', [92, 82, 82], 18)]

# 第一种设置dtype的方式
# 3个Unicode字符,3个32位整型组成的列表,整数
a = np.array(data, dtype='U3, 3int32, int32')
print(a)
# 获取a数组中第一个元素的第一个字段的值
print(a[0]['f0'])
print(a[-1]['f1'])

# 第二种设置dtype的方式
b = np.array(data, dtype={
    'names': ['name', 'scores', 'age'],
    'formats': ['U3', '3int32', 'int32']})
print(b)
print(b[0]['scores'])

# 第三种设置dtype的方式
c = np.array(data, dtype=[
    ('name', 'str_', 2),
    ('scores', 'int32', 3),
    ('age', 'int32', 1)])
print(c)
print(c[2]['name'])

# 第四种设置dtype的方式
# 存储数据时，
# name字段占用8字节，从0字节的位置开始输出
# scores字段占用12字节，从16字节的位置开始输出
# 虽然name字段没有把前16个字节充分利用，但是
# 在内存中字段的位置偏移量是一致的，访问
# 效率将会得到提高。
d = np.array(data, dtype={
    'name': ('U2', 0),
    'scores': ('3int32', 16),
    'age': ('int32', 28)})
print(d)
print(d[0]['name'])

# 测试日期类型数组
e = np.array(['2011', '2012-01-01',
              '2013-01-01 01:01:01',
              '2012-02-01'])
f = e.astype('M8[D]')
print(f.dtype, f)
print(f[2] + 1)

类型字符码

类型	字符码
np.bool_	?
np.int8/16/32/64	i1/i2/i4/i8
np.uint8/16/32/64	u1/u2/u4/u8
np.float16/32/64	f2/f4/f8
np.complex64/128	c8/c16
np.str_	U<字符数>
np.datetime64	M8[Y / M / D / h / m / s]

ndarray数组对象维度操作

视图变维(数据共享): reshape() ravel()

a = np.arange(1, 9)
print(a)
a = a.reshape(2, 4)
print(a)
b = a.reshape(2, 2, 2)
print(b)
a[0, 0] = 999
print(a)
print(b)
b = b.ravel()
print(b)

复制变维(数据独立): flatten()

# 复制变维 数据独立
c = a.flatten()
print(a)
print(c)
c[0] = 1000
print(a)
print(c)

就地变维: 使用shape属性直接改变原数组对象的维度，不返回新数组。 resize(2, 2)

# 就地变维
c.shape = (2, 4)
print(c)
c.resize(2, 2, 2)
print(c)

ndarray数组对象的切片操作

# 数组切片的参数与列表切片基本一致
# step+（步长+）：默认从头向后切
# step-（步长-）：默认从后向前切
array[start:end:step, ...]

import numpy as np
a = np.arange(1, 10)
print(a)
print(a[:3])
print(a[3:6])
print(a[6:])
print(a[::-1])
print(a[:-4:-1])
print(a[-4:-7:-1])
print(a[-7::-1])
print(a[::])
print(a[::3])
print(a[1::3])
print(a[2::3])

多维数组的切片操作

a = a.reshape(3, 3)
print(a)
print(a[:2, :2])
print(a[:, 2])  # 所有行的最后一列

ndarray数组的掩码操作

import numpy as np

a = np.array([43, 70, 34, 76, 34, 57, 23, 45])
print(a)
b = a >= 60
print(b)
print(a[b])
indices = [1, 7, 6, 5, 4, 3, 2, 0]
print(a[indices])

# 输出100以内3与7的倍数
b = np.arange(100)
mask = (b % 3 == 0) & (b % 7 == 0)
print(b[mask])

多维数组的组合与拆分

垂直方向的操作：

# 在垂直方向a与b摞起来
np.vstack((a, b))
# 把c数组在垂直方向拆成2个
np.vsplit(c, 2)

水平方向的操作：

# 在水平方向a与b并起来
np.hstack((a, b))
# 把c数组在水平方向拆成2个
np.hsplit(c, 2)

深度方向的操作：

# 在深度方向a与b并起来
np.dstack((a, b))
# 把c数组在深度方向拆成2个
np.dsplit(c, 2)

多维数组组合与拆分的通用方法：

# 实现a与b数组的组合操作
# axis：轴向
# 如果待组合数组都是二维数组
# axis：0 垂直方向组合
# axis：1 水平方向组合
# 如果待组合数组都是三维数组
# axis：0 垂直方向组合
# axis：1 水平方向组合
# axis：2 深度方向组合
np.concatenate((a, b), axis=0)
# 把c数组拆成2分  axis为轴向
np.split(c, 2, axis=0)

长度不等的数组的组合：

# 长度不等的数组的组合
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9])
# 把b补成5个元素, 头部补0个，尾部补1个，
# 新增元素的默认值为-1
b = np.pad(
    b, pad_width=(3, 1), mode='constant',
    constant_values=-1)
print(b)

简单的一维数组的组合方案

# 把a与b两个一维数组摞在一起成两行
np.row_stack((a, b))
# 把a与b两个一维数组组合在一起成两列
np.column_stack((a, b))

ndarray对象的其他常用属性

ndim 维数（n维数组的n）
itemsize 元素所占用字节数
nbytes 数组所占用的总字节数
real 复数的实部
imag 复数的虚部
T 二维数组的转置数组
flat 多维数组的扁平迭代器

图片资料

数据分析DAY02

matplotlib概述

matplotlib是python的一个绘图库。使用它可以很方便的绘制出版质量级别的图形。

matplotlib的基本功能

基本绘图
1. 设置线型、线宽、颜色
2. 设置坐标轴范围及刻度
3. 设置坐标轴属性
4. 图例
5. 绘制特殊点
6. 备注
高级图形绘制
1. 绘制子图
2. 刻度定位器、刻度网格线
3. 半对数坐标
4. 散点图
5. 填充图
6. 条形图、饼状图
7. 等高线图
8. 热成像图
9. 极坐标系
10. 三维曲面
11. 简单动画

matplotlib基本功能详解

基本绘图

绘图核心API

绘制一条线：

import numpy as np
import matplotlib.pyplot as mp
# xarray: x坐标值数组
# yarray: y坐标值数组
mp.plot(xarray, yarray)
# 显示图像
mp.show()

绘制水平线与垂直线：

import matplotlib.pyplot as mp
# 绘制垂直线 给出x坐标值，给出y的范围即可
mp.vlines(val, ymin, ymax)
# 绘制水平线 给出y坐标值，给出x的范围即可
mp.hlines(val, xmin, xmax)

案例：绘制一条正弦曲线

# 从-π ~ π拆1000个点
x = np.linspace(-np.pi, np.pi, 1000)
sinx = np.sin(x)
# 绘制[-π, π]的1/2*cos(x)曲线
cosx = np.cos(x) / 2

mp.plot(x, sinx)
mp.plot(x, cosx)
mp.show()

线型、线宽、颜色

# linestyle: 线型  "-"  "--"  ":" 
# linewidth: 线宽
# color: 颜色  
#    常见颜色的英文单词  red  blue  green...
#    常见英文单词的首字母  r   b ..
#    #abcdab 
#    (1, 1, 1)  或  (0.8, 0.6, 1, 0.5)
# alpha: 透明度
mp.plot(x, y, linestyle='', linewidth=1, color='', alpha=0.5)

设置坐标刻度

把横坐标的刻度显示为 -π, -π/2, 0, π/2, π

# x_val_list: x轴刻度值列表
# x_text_list: x轴相应值的刻度文本列表
mp.xticks(x_val_list, x_text_list)
# y_val_list: y轴刻度值列表
# y_text_list: y轴相应值的刻度文本列表
mp.yticks(y_val_list, y_text_list)

matplotlib支持的latex排版语法

r'$a^2 + b^2 = c^2$'  
r'$[a_1 + a_2 + ... + a_n]$'  
r'$\frac{\pi}{2}$'

$$
a^2 + b^2 = c^2 \quad \quad
[a_1 + a_2 + … + a_n] \quad\quad
\frac{\pi}{2}
$$

设置坐标轴

# 获取当前坐标轴  'left' 'right' 'top' 'bottom'
ax = mp.gca()
# 获取其中某一个坐标轴
ax_left = ax.spines['left']
# 设置坐标轴的位置
# type: 移动坐标轴的参照类型 一般为'data' （以数据坐标轴作参照）
# val: 参照值
ax_left.set_position((type, val))
# 设置坐标轴的颜色
# color: 颜色值   若为'none',则隐藏当前坐标轴
ax_left.set_color(color)

图例

显示两条曲线的图例。

# 在plot绘制曲线时，给出label描述当前曲线的标签。
mp.plot(....,  label='')
mp.legend(loc='best')

特殊点

# xarray, yarray 散点的坐标
mp.scatter(xarray, yarray, 
          marker='',    # 点型
          s=60,            # 大小
          edgecolor='',    # 边缘色
          facecolor='',    # 填充色
          zorder=3        # 绘制图层编号（越大越靠顶层）
)

备注

为特殊点添加备注。

mp.annotate(
    r'$....$',                     # 备注文本
    xycoords='data',            # 定位目标点使用的参照坐标系
    xy=(1, 2),                    # 目标点的坐标
    textcoords='offset points',    # 定位文本位置使用的坐标系
    xytext=(10, -30),            # 备注文本的位置坐标
    fontsize=14,                # 字体大小
    arrowprops=dict()            # 箭头的样式
)

arrowprops=dict(
    arrowstyle='->',            # 箭头样式
    connectionstyle='angle3'    # 箭头连接线的样式
)

设置坐标轴的范围

通过设置坐标轴的范围区间，可以显示图像中的一部分内容。

# 设置x轴的可视区域
mp.xlim(x_min, x_max)
# 设置y轴的可视区域
mp.ylim(y_min, y_max)

matplotlib图像窗口高级操作

绘制两个窗口：

# title：窗口的标题
# figsize： 窗口的大小
# dpi：像素密度
# facecolor：图表的背景色
mp.figure('title', figsize=(4,3), dpi=120, facecolor='')
mp.show()

mp.ffigure方法可以创建一个新的窗口，但是如果新窗口的title已经被创建过了，那么就把该title的旧窗口置为当前窗口，以后的绘图操作，都将针对当前窗口生效。

设置当前窗口的参数

mp.title('title', fontsize=16)    # 设置图表标题
mp.xlabel('x', fontsize=12)        # x轴的描述文本
mp.ylabel('y', fontsize=12)        # y轴的描述文本
mp.tick_params(labelsize=8)        # 设置刻度参数
mp.grid(linestyle=':')            # 绘制网格线
mp.tight_layout()                # 紧凑布局

窗口子图

矩阵式布局

绘制矩阵式布局子图的API：

mp.figure('xx', ...)
# rows: 行数
# columns: 列数
# num: 子图编号 
mp.subplot(rows, columns, num)
mp.subplot(3, 3, 5)  # 把3行3列的5号图设置为当前绘图区域
mp.subplot(3, 3, 1)  # 把3行3列的1号图设置为当前绘图区域
mp.show()

案例：绘制九宫格

import matplotlib.pyplot as mp

mp.figure('Subplot', facecolor='lightgray')

for i in range(1, 10):
    mp.subplot(3, 3, i)
    # 在窗前绘图区域写文字
    mp.text(0.5, 0.5, i, size=36, alpha=0.8,
            ha='center', va='center')
    mp.xticks([])
    mp.yticks([])

mp.tight_layout()
mp.show()

网格式布局

网格式布局支持单元格的合并。

绘制网格式子图相关API：

import matplotlib.gridspec as mg
mp.figure(..)
# 创建GridSpec对象，（3行3列）
gs = mg.GridSpec(3, 3)
# 把GridSpec中第一行的前两列合并为一个子图
mp.subplot(gs[0, :2])
mp.show()

案例：

import matplotlib.pyplot as mp
import matplotlib.gridspec as mg

mp.figure('Grid', facecolor='lightgray')

gs = mg.GridSpec(3, 3)

mp.subplot(gs[0, :2])
mp.text(0.5, 0.5, 1, ha='center', va='center',
        size=36)
mp.xticks([])
mp.yticks([])

mp.subplot(gs[:2, 2])
mp.text(0.5, 0.5, 2, ha='center', va='center',
        size=36)
mp.xticks([])
mp.yticks([])

mp.subplot(gs[1, 1])
mp.text(0.5, 0.5, 3, ha='center', va='center',
        size=36)
mp.xticks([])
mp.yticks([])

mp.subplot(gs[1:, 0])
mp.text(0.5, 0.5, 4, ha='center', va='center',
        size=36)
mp.xticks([])
mp.yticks([])

mp.subplot(gs[2, 1:])
mp.text(0.5, 0.5, 5, ha='center', va='center',
        size=36)
mp.xticks([])
mp.yticks([])


mp.tight_layout()
mp.show()

自由布局

自由布局相关API：

mp.figure()
# 设置图表的位置，需要给出左下角点坐标与宽、高即可
# 0.03: 左下角点的横坐标
# 0.04: 左下角点的纵坐标
# 0.94: 图表的宽度
# 0.92: 图表的高度
mp.axes([0.03, 0.04, 0.94, 0.92])
mp.text(....)
mp.show()

案例：

import matplotlib.pyplot as mp

mp.figure('FlowLayou', facecolor='lightgray')
mp.axes([0.1, 0.2, 0.5, 0.3])
mp.text(0.5, 0.5, 1, ha='center',
        va='center', size=36)
mp.xticks([])
mp.yticks([])

mp.axes([0.2, 0.6, 0.5, 0.3])
mp.text(0.5, 0.5, 2, ha='center',
        va='center', size=36)
mp.xticks([])
mp.yticks([])

mp.show()

刻度定位器

设置坐标轴刻度定位器相关的API：

# 获取当前坐标轴
ax = mp.gca()
# 设置横轴的主刻度定位器为多点定位器，主刻度间隔为1
ax.xaxis.set_major_locator(mp.MultipleLocator(1))
# 设置横轴的次刻度定位器为NullLocator()（将不会显示次刻度）
ax.xaxis.set_minor_locator( mp.NullLocator() )

案例：

import numpy as np
import matplotlib.pyplot as mp

locators = ['mp.NullLocator()',
            'mp.MaxNLocator(nbins=6)',
            'mp.MultipleLocator(1)',
            'mp.AutoLocator()',
            'mp.FixedLocator(locs=[0,2.5,7.5,10])']

mp.figure('Locator', facecolor='lightgray')

for i in range(len(locators)):
    mp.subplot(len(locators), 1, i + 1)
    ax = mp.gca()
    ax.spines['left'].set_color('none')
    ax.spines['top'].set_color('none')
    ax.spines['right'].set_color('none')
    mp.xlim(0, 10)
    mp.ylim(-1, 1)
    mp.yticks([])
    ax.spines['bottom'].set_position(('data', 0))
    # 设置x轴的主刻度定位器
    ax.xaxis.set_major_locator(eval(locators[i]))
    # 设置x轴的次刻度定位器
    ax.xaxis.set_minor_locator(mp.MultipleLocator(0.1))

mp.tight_layout()
mp.show()

刻度网格线

ax = mp.gca()
# 绘制刻度网格线
ax.grid(
    which='',    # 'major''minor' 绘制主刻度/次刻度网格线
    axis='',    # 'x'/'y'/'both' 绘制x轴/y轴的刻度网格线
    linewidth=1, 
    linestyle='',
    color='',
    alpha=0.5
)

案例：[1,10,100,1000,100,10,1]

import numpy as np
import matplotlib.pyplot as mp

y = np.array([1, 10, 100, 1000, 100, 10, 1])
mp.figure('GridLine', facecolor='lightgray')
mp.title('GridLine', fontsize=16)
mp.xlabel('x', fontsize=12)
mp.ylabel('y', fontsize=12)
mp.tick_params(labelsize=10)
# 设置刻度定位器与刻度网格线
ax = mp.gca()
ax.xaxis.set_major_locator(
    mp.MultipleLocator(1))
ax.xaxis.set_minor_locator(
    mp.MultipleLocator(0.1))
ax.yaxis.set_major_locator(
    mp.MultipleLocator(250))
ax.yaxis.set_minor_locator(
    mp.MultipleLocator(50))
# 设置x与y轴的主刻度网格线
ax.grid(which='major', axis='both',
        linewidth=0.75, linestyle='-',
        color='orange')
# 设置x与y轴的次刻度网格线
ax.grid(which='minor', axis='both',
        linewidth=0.25, linestyle='-',
        color='orange')


mp.plot(y, 'o-', c='dodgerblue', label='plot')
mp.legend()
mp.show()

半对数坐标轴

y轴将会以指数方式递增。

mp.figure()
# mp.plot()改为mp.semilogy()方法绘制曲线即可
mp.semilogy(x, y, .....)
mp.show()

散点图

numpy.random模块提供了normal函数用于产生符合正态分布的随机数。

n = 100
# 产生n个符合正态分布规律的随机数
# 172：期望
# 20：标准差（标准差越小，越集中；标准差越大，越分散）
x = np.random.normal(172, 20, n)
y = np.random.normal(65, 10, n)

绘制散点图的相关API：

mp.scatter(
    x, y,
    marker='',
    s=60,
    edgecolor='',
    facecolor='',
    zorder=3
)

案例：

import numpy as np
import matplotlib.pyplot as mp

n = 500
x = np.random.normal(172, 20, n)
y = np.random.normal(65, 10, n)

mp.figure('Students', facecolor='lightgray')
mp.title('Students List', fontsize=16)
mp.xlabel('Height', fontsize=14)
mp.ylabel('Weight', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=':')
mp.scatter(x, y, marker='o', s=50,
           color='dodgerblue', alpha=0.5,
           zorder=3, label='Stduent')
mp.legend()
mp.show()

数据分析DAY03

散点图

散点图颜色映射的使用。

# d 为大于0的整数，这样计算越靠近中心的样本
# d的值越小，越靠近外周的样本d的值越大
d = (x - 172)**2 + (y - 65)**2
mp.scatter(x, y, marker='o', s=50, alpha=0.5,
           c=d, cmap='jet',
           zorder=3, label='Stduent')

填充图

以某种颜色自动填充两条曲线的闭合区域。

mp.fill_between(
    x, sin_x, cos_x,    # 通过这3个参数，给出两条曲线
    sin_x < cos_x,        # 填充条件，若为True时填充
    color='',
    alpha=0.5
)

绘制两条曲线：sin(x) cos(x/2)/2 [0, 8π]

import numpy as np
import matplotlib.pyplot as mp

n = 1000
x = np.linspace(0, 8 * np.pi, n)
sinx = np.sin(x)
cosx = np.cos(x / 2) / 2

mp.figure('Fill', facecolor='lightgray')
mp.title('Fill', fontsize=18)
mp.xlabel('x', fontsize=12)
mp.ylabel('y', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(x, sinx, c='dodgerblue',
        label=r'$y=sin(x)$')
mp.plot(x, cosx, c='orangered',
        label=r'$y=\frac{1}{2}cos(\frac{x}{2})$')

mp.fill_between(
    x, sinx, cosx, sinx < cosx,
    color='dodgerblue', alpha=0.5)
mp.fill_between(
    x, sinx, cosx, sinx > cosx,
    color='orangered', alpha=0.5)

mp.legend()
mp.show()

条形图（柱状图）

mp.figure()
mp.bar(
    x,             # 水平坐标数组
    y,            # 每个柱子绘制的高度
    width,        # 每个柱子的宽度
    color='',    # 颜色
    label='',    # 标签名（供legend使用）
    alpha=0.5
)

案例：

import numpy as np
import matplotlib.pyplot as mp

apples = [9, 1, 23, 64, 89, 12, 36, 48, 93, 24, 85, 86]
oranges = [19, 23, 64, 92, 37, 46, 28, 54, 28, 46, 93, 43]

mp.figure('Bar', facecolor='lightgray')
mp.title('Bar', fontsize=18)
mp.xlabel('Date', fontsize=14)
mp.ylabel('Price', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=':')
x = np.arange(len(apples))
# 绘制条形图
mp.bar(x - 0.2, apples, 0.4, color='dodgerblue',
       label='Apple')
mp.bar(x + 0.2, oranges, 0.4, color='orangered',
       label='Orange')

mp.xticks(x, [
    'Jan', 'Feb', 'Mar', 'Apr', 'May',
    'Jun', 'Jul', 'Aug', 'Sep', 'Oct',
    'Nov', 'Dec'])

mp.legend()
mp.show()

饼状图

绘制饼状图相关API：

mp.pie(
    values,            # 值列表
    spaces,         # 扇形之间的间距列表
    labels,            # 标签列表
    colors,            # 颜色列表
    '%d%%',            # 标签所占比例的格式化字符串
    shadow=True,    # 是否添加阴影
    startangle=90,    # 起始角度
    radius=1        # 大饼半径
)

案例：

import matplotlib.pyplot as mp

# 整理数据
values = [26, 17, 21, 29, 11]
spaces = [0.05, 0.01, 0.01, 0.01, 0.01]
labels = ['Python', 'Javascript',
          'C++', 'Java', 'PHP']
colors = ['dodgerblue', 'orangered',
          'limegreen', 'violet', 'gold']

mp.figure('Pie', facecolor='lightgray')
mp.title('Pie', fontsize=18)
mp.axis('equal')  # 设置等轴比例绘制
mp.pie(values, spaces, labels, colors,
       '%d%%', shadow=True, startangle=90,
       radius=1.2)
mp.show()

等高线图

整理等高线数据需要网格点坐标矩阵，也需要每个点的高度，所以等高线属于3D数学模型的范畴。

绘制等高线的相关API：

cntr = mp.contour(
    x, y,         # 网格点坐标
    z,             # z提供每个坐标点的高度
    8,             # 把整体的高度分8份
    colors='',    # 每条等高线的颜色
    linewidths=0.5    # 每条等高线的宽度
)
# 为等高线添加标签
mp.clabel(cntr, inline_spacing=1, fmt='%.1f', 
          fontsize=10)
# 为等高线每一部分设置颜色
mp.contourf(x, y, z, 8, cmap='jet')

案例：

import numpy as np
import matplotlib.pyplot as mp

n = 1000
# 获取网格点坐标矩阵数组
x, y = np.meshgrid(np.linspace(-3, 3, n),
                   np.linspace(-3, 3, n))

# 通过x与y计算每个坐标点的高度（编的）
z = (1 - x / 2 + x**5 + y**3) * \
    np.exp(-x**2 - y**2)

# 绘制等高线
mp.figure('Contour', facecolor='lightgray')
mp.title('Contour', fontsize=18)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=':')
cntr = mp.contour(x, y, z, 8, colors='black',
                  linewidths=0.5)
# 为等高线添加标签
mp.clabel(cntr, inline_spacing=1,
          fmt='%.1f', fontsize=10)
# 为等高线每一部分设置颜色
mp.contourf(x, y, z, 8, cmap='jet')
mp.show()

热成像图

# 把矩阵z图形化，使用cmap的颜色表示矩阵中每个元素的热度值
# origin：y轴方向
#     'lower'  y轴的0在下方
#     'upper'  y轴的0在上方
mp.imshow(z, cmap='jet', origin='lower')

案例：

import numpy as np
import matplotlib.pyplot as mp

n = 1000
# 获取网格点坐标矩阵数组
x, y = np.meshgrid(np.linspace(-3, 3, n),
                   np.linspace(-3, 3, n))

# 通过x与y计算每个坐标点的高度（编的）
z = (1 - x / 2 + x**5 + y**3) * \
    np.exp(-x**2 - y**2)

# 绘制
mp.figure('Hot', facecolor='lightgray')
mp.title('Hot', fontsize=18)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=':')
# 绘制热成像图
mp.imshow(z, cmap='jet', origin='lower')
mp.show()

3D图像的绘制

使用matplotlib绘制3D图像，需要先获取3D坐标轴，调用ax3d对象的方法绘制3D图形。

from mpl_toolkits.mplot3d import axes3d
ax3d = mp.gca(projection='3d')

ax3d.plot_wireframe()
ax3d.plot_surface()
ax3d.scatter()

3D线框图

ax3d.plot_wireframe(
    x, y, z,         # 网格点坐标矩阵，和每个点的z轴值
    rstride=30,        # 行跨距
    cstride=30,        # 列跨距
    linewidth=1,    # 线宽
    color=''        # 颜色
)

案例：

import numpy as np
import matplotlib.pyplot as mp
from mpl_toolkits.mplot3d import axes3d
n = 1000
# 获取网格点坐标矩阵数组
x, y = np.meshgrid(np.linspace(-3, 3, n),
                   np.linspace(-3, 3, n))
# 通过x与y计算每个坐标点的高度（编的）
z = (1 - x / 2 + x**5 + y**3) * \
    np.exp(-x**2 - y**2)

# 绘制
mp.figure('WireFrame', facecolor='lightgray')
ax = mp.gca(projection='3d')
ax.set_xlabel('x', fontsize=14)
ax.set_ylabel('y', fontsize=14)
ax.set_zlabel('z', fontsize=14)
ax.plot_wireframe(
    x, y, z, rstride=30, cstride=30,
    color='dodgerblue', linewidth=1)

mp.show()

3D曲面图

ax3d.plot_surface(
    x, y, z, 
    rstride=30, cstride=30, cmap='jet'
)

3D散点图

ax3d.scatter(
    x, y, z,    # 样本点的(x,y,z)坐标数组
    marker='',    # 点型
    s=10,        # 点的大小
    zorder='',
    color='',
    edgecolor='',
    facecolor='',
    c=d,        # 颜色值数组
    cmap=''        # 所使用的颜色映射
)

案例：

import numpy as np
import matplotlib.pyplot as mp
from mpl_toolkits.mplot3d import axes3d
n = 500
x = np.random.normal(0, 1, n)
y = np.random.normal(0, 1, n)
z = np.random.normal(0, 1, n)

d = x**2 + y**2 + z**2
mp.figure('3D Scatter')
ax = mp.gca(projection='3d')
ax.scatter(x, y, z, s=30, alpha=0.5,
           c=d, cmap='jet')
mp.show()

极坐标系

某些情况下极坐标系适合显示与角度有关的图像，例如雷达。它可以处理极径ρ(rho)与极角θ(theta)之间的线性关系。

mp.gca(projecction='polar')
r = 0.8*t

案例：

import numpy as np
import matplotlib.pyplot as mp

t = np.linspace(0, 4 * np.pi, 1000)
r = 0.8 * t

x = np.linspace(0, 6 * np.pi, 1000)
y = 3 * np.sin(6 * x)
mp.figure('Polar')
mp.gca(projection='polar')
mp.plot(t, r)
mp.plot(x, y)
mp.show()

简单动画

动画即是在一段时间内快速连续重新绘制图像的过程。

matplotlib提供了方法可以实现简单动画，定义一个update函数用于即时更新图像。

import matplotlib.animation as ma
# update中编写更新图像的代码，每隔一段时间将会自动调用
def update(number):
    pass
# 每隔30毫秒，执行一次update函数，针对当前窗体绘制图像
anim = ma.FuncAnimation(mp.gcf(), update, interval=30)
mp.show()

案例：

import numpy as np
import matplotlib.pyplot as mp
import matplotlib.animation as ma

n = 100
# 初始化100个球，每个属性值都是0.0
balls = np.zeros(n, dtype=[
                ('position', float, 2),
                ('size', float, 1),
                ('growth', float, 1),
                ('color', float, 4)])

# 初始化所有球的属性
# 为每个球的position属性赋值
# 获取n行2列的二维数组，每个元素从0到1平均分布
balls['position'] =    \
    np.random.uniform(0, 1, (n, 2))
balls['size'] = \
    np.random.uniform(40, 70, 1)
balls['growth'] = \
    np.random.uniform(10, 20, 1)
balls['color'] = \
    np.random.uniform(0, 1, (n, 4))


mp.figure('Animation')
# 把100个球都先画出来
sc = mp.scatter(balls['position'][:, 0],
                balls['position'][:, 1],
                balls['size'],
                color=balls['color'],
                alpha=0.5)


def update(number):
    # 让每个球都不断增大
    balls['size'] += balls['growth']
    # 每次让一个气泡破裂， 随机生成一个新的
    boom_ind = number % n
    balls[boom_ind]['size'] = \
        np.random.uniform(40, 70, 1)
    balls[boom_ind]['position'] = \
        np.random.uniform(0, 1, (1, 2))

    # 重新设置属性
    sc.set_sizes(balls['size'])
    sc.set_offsets(balls['position'])

anim = ma.FuncAnimation(
    mp.gcf(), update,  interval=30)
mp.show()

使用生成器函数提供数据，实现动画绘制

很多情况下，绘制动画的参数是动态获取的，matplotlib支持定义生成生成器函数，用于生成数据，然后自动把数据交给update函数更新图像：

def update(data):
    pass·

def generator():
    yield data
# 每10毫秒，调用生成器，把生成器yield出的数据交给update函数执行
anim = ma.FuncAnimation(mp.gcf(), update, generator,
                           interval=10)

案例：

import numpy as np
import matplotlib.pyplot as mp
import matplotlib.animation as ma

mp.figure("Signal", facecolor='lightgray')
mp.title("Signal", fontsize=18)
mp.xlim(0, 10)
mp.ylim(-3, 3)
mp.grid(linestyle=':')
pl = mp.plot([], [], color='dodgerblue',
             label='Signal')[0]
x = 0
def update(data):
    t, v = data
    x, y = pl.get_data()
    x = np.append(x, t)
    y = np.append(y, v)
    pl.set_data(x, y)
    # 动态的移动可视区域
    if x[-1] > 10:
        mp.xlim(x[-1] - 10, x[-1])

def generator():
    global x
    y = np.sin(2 * np.pi * x) * \
        np.exp(np.sin(0.2 * np.pi * x))
    yield x, y
    x += 0.05

# 使用动画向plot绘制的曲线中添加x，y坐标
anim = ma.FuncAnimation(
    mp.gcf(), update, generator, interval=20)

mp.legend()
mp.show()

案例应用

加载文件

numpy提供了加载逻辑上可以被解释为二维数组的文本文件：

np.loadtxt(
    '../data/aapl.csv',        # 文本路径
    delimiter=',',            # 列之间的分隔符
    usecols=(0,1,2),        # 读取哪些列
    unpack=False,            # 是否拆包 False将会返回二维数组
    dtype="U10, U10, f8",    # 元素的类型
    converters={1, func}    # 转换器函数字典
)

案例：绘制收盘价折线图

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
loadtxt.py 加载文件
"""
import numpy as np
import matplotlib.pyplot as mp
import datetime as dt
import matplotlib.dates as md


def dmy2ymd(dmy):
    # 28-10-2011  -> 2011-10-28
    # 把字符串转成日期对象，在按照需要的
    # 格式把日期对象转成字符串
    dmy = str(dmy, encoding='utf-8')
    time = dt.datetime.strptime(
        dmy, '%d-%m-%Y').date()
    t = time.strftime('%Y-%m-%d')
    print(t)
    return t

dates, opening_prices, highest_prices, \
    lowest_prices, closing_prices = \
    np.loadtxt('../da_data/aapl.csv',
               delimiter=',',
               usecols=(1, 3, 4, 5, 6),
               unpack=True,
               dtype='M8[D], f8, f8, f8, f8',
               converters={1: dmy2ymd})

# 把收盘价绘制为折线图
mp.figure('APPL', facecolor='lightgray')
mp.title('APPL', fontsize=18)
mp.xlabel('Dates', fontsize=14)
mp.ylabel('Prices', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=':')
# 为了x轴显示日期，设置刻度定位器
ax = mp.gca()
ax.xaxis.set_major_locator(
    md.WeekdayLocator(byweekday=md.MO))
ax.xaxis.set_major_formatter(
    md.DateFormatter('%d %b %Y'))
ax.xaxis.set_minor_locator(md.DayLocator())

# 修改dates的数据类型 从M8[D] ->
dates = dates.astype(md.datetime.datetime)

mp.plot(dates, closing_prices,
        color='dodgerblue', linestyle='--',
        linewidth=3, label='Closing Prices')
mp.legend()
mp.tight_layout()
mp.gcf().autofmt_xdate()
mp.show()

数据分析DAY04

加载文件

案例：绘制AAPL的K线图.

# 绘制每一天的蜡烛图：
rise = closing_prices >= opening_prices
# 整理所有的填充色
color = np.array(
    [('white' if x else 'limegreen')
     for x in rise])
# 整理所有的边缘色
edgecolor = np.array(
    [('red' if x else 'limegreen')
     for x in rise])

# 绘制影线
mp.bar(dates, highest_prices - lowest_prices,
       0.1, lowest_prices, color=edgecolor)

# 绘制实体
mp.bar(dates, closing_prices - opening_prices,
       0.8, opening_prices, color=color,
       edgecolor=edgecolor)

算数平均数

算数平均数表示对真值的无偏估计。

A = [a1, a2, a3, ... an]
mean = (a1 + a2 + a3 + ... + an) / n

numpy提供了方法求一组数据的均值：

# 1.
m = np.mean(array)
# 2.
m = array.mean()

案例：

# 绘制收盘价的均值水平线
mean = np.mean(closing_prices)
mp.hlines(mean, dates[0], dates[-1],
          color='orangered', linewidth=2,
          label='average(closing_prices)')

加权平均值

S = [s1, s2, s3, .. sn]
W = [w1, w2, w3, .. wn]
# 加权均值：
aw = (s1w1 + s2w2 + ... sn*wn)/(w1+w2+...wn)

最值

np.max(a)    # 返回一组数据的最大值
np.min(a)    # 返回一组数据的最小值
np.ptp(a)    # 返回一组数据的极差 (max-min)

np.argmax(a)    # 返回一组数据最大值的下标
np.argmin(a)     # 返回一组数据最小值的下标

案例：

# 求最值
max_highest = np.max(highest_prices)
min_lowest = np.min(lowest_prices)
print(max_highest - min_lowest)

# 求最高点、最低点的日期
print(dates[np.argmax(highest_prices)])
print(dates[np.argmin(lowest_prices)])

# 将两个同维数组中对应位置的元素进行比较，把最大的保留。返回新数组
np.maximum(a, b)
# 将两个同维数组中对应位置的元素进行比较，把最小的保留。返回新数组
np.minimum(a, b)

案例：

a = np.arange(1, 10).reshape(3, 3)
b = np.arange(1, 10)[::-1].reshape(3, 3)
print(np.maximum(a, b))
print(np.minimum(a, b))

中位数

将多个样本按照大小排序，取中间位置的元素。

如果样本个数为奇数，则为中间的数字。如果样本个数为偶数，则求两个中间数字的均值即可。

a = [..排序过的数组..]
median = (a[(a.size-1)/2] + a[a.size/2]) / 2

# 对数组排序
sc = np.msort(closing_prices)
m2 = (sc[int((sc.size - 1) / 2)] +
      sc[int(sc.size / 2)]) / 2
print(median, m2)

# 求a数组的中位数
np.median(a)

标准差

一组数据的标准差可以衡量这组数据的离散程度。标准差越大代表这组样本越散。反之，则代表这组样本越集中。

样本：S = [s1, s2, s3 …, sn]

均值：m = (s1 + s2 + … + sn) / n

离差：D = [s1-m, s2-m, s3-m …. , sn-m]

离差方： D² -> [q1, q2, q3 …. qn] 每个元素求平方

总体方差：(q1+q2+…qn) / n 离差方数组的均值

总体标准差：np.sqrt(总体方差)

样本方差： (q1+q2+…qn) / n-1 离差方数组的均值

样本标准差：np.sqrt(样本方差)

# 求array数组的总体标准差
np.std(array)
# 求array数组的样本标准差，分母的修正值：n-1
np.std(array, ddof=1)

案例：

import numpy as np

closing_prices = np.loadtxt(
    '../da_data/aapl.csv', delimiter=',',
    usecols=(6,), unpack=True)
# 求这组数据的标准差
m = np.mean(closing_prices)
s = np.mean((closing_prices - m) ** 2)
std = np.sqrt(s)
print(std)

std = np.std(closing_prices)
print(std)

std = np.std(closing_prices, ddof=1)
print(std)

案例应用

案例：轴向统计

统一每个周一、周二、.. 周五的收盘价的平均值，并放入一个数组。

import numpy as np
import datetime as dt


def dmy2day(dmy):
    # 日月年字符串转成星期
    dmy = str(dmy, encoding='utf-8')
    date = dt.datetime.strptime(
        dmy, '%d-%m-%Y').date()
    wday = date.weekday()
    return wday

wdays, closing_prices = np.loadtxt(
    '../da_data/aapl.csv',
    delimiter=',', usecols=(1, 6),
    unpack=True, converters={1: dmy2day})

ave_closing_prices = np.zeros(5)
for wday in range(ave_closing_prices.size):
    ave_closing_prices[wday] = \
        np.mean(closing_prices[wdays == wday])

print(ave_closing_prices)

案例：统一每个周一、周二、.. 周五的收盘价的平均值，最大值，最小值，并放入一个数组。

numpy提供了一个方法可以方便的实现二维数据的轴向汇总。

def func(data):
    pass
# 根据轴向的设定，把每一行（列）依次传递给func函数实现统计业务
# func： 处理函数
# axis： 轴向  0  1
# array： 原二维数组
np.apply_along_axis(func, axis, array)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo08_axis.py  轴向统计
"""
import numpy as np
import datetime as dt


def dmy2day(dmy):
    # 日月年字符串转成星期
    dmy = str(dmy, encoding='utf-8')
    date = dt.datetime.strptime(
        dmy, '%d-%m-%Y').date()
    wday = date.weekday()
    return wday

wdays, closing_prices = np.loadtxt(
    '../da_data/aapl.csv',
    delimiter=',', usecols=(1, 6),
    unpack=True, converters={1: dmy2day})

# 第一个周一的下标， 最后一个周五的下标
# np.where方法返回一个元组，数据存储在[0]中
first_mon = np.where(wdays == 0)[0][0]
last_fri = np.where(wdays == 4)[0][-1]
wdays = wdays[first_mon: last_fri + 1]

indices = np.arange(first_mon, last_fri + 1)
# 整理这些数据为一个二维数组
mon_inds = indices[wdays == 0]
tue_inds = indices[wdays == 1]
wen_inds = indices[wdays == 2]
thu_inds = indices[wdays == 3]
fri_inds = indices[wdays == 4]

# 把不足6个元素的数组补齐
mon_inds = np.pad(
    mon_inds, pad_width=(0, 1),
    mode='constant', constant_values=-1)

# 把这些数据在垂直方向摞在一起，执行轴向汇总
data = np.vstack(
    (mon_inds, tue_inds, wen_inds,
     thu_inds, fri_inds))

print(data)


def func(row):
    row = row[row != -1]
    return np.max(closing_prices[row]), \
        np.min(closing_prices[row]),    \
        np.mean(closing_prices[row])

r = np.apply_along_axis(func, 1, data)
print(r)

案例：移动平均线

收盘价5日平均线：从第五天开始，每天计算最近5天的收盘价的平均值所构成的一条线。

移动均线的算法：

(a + b + c + d + e) / 5     均线的第1个点
(b + c + d + e + f) / 5     均线的第2个点
....

案例：绘制5日均线图

# 求5日均线
sma5 = np.zeros(closing_prices.size - 4)
for i in range(sma5.size):
    sma5[i] = closing_prices[i:i + 5].mean()

mp.plot(dates[4:], sma5, color='orangered',
        linewidth=2,
        linestyle='-', label='SMA-5')

卷积运算

a = [1, 2, 3, 4, 5] 原数组

b = [8, 7, 6] 卷积核数组

使用b数组作为卷积核对a数组执行卷积运算的过程如下：

                44    65    86            - 有效卷积(valid)
            23    44    65    86    59        - 同维卷积(same)
        8    23    44    65    86    59    30    - 完全卷积(full)
0    0    1    2    3    4    5    0    0
6    7    8
    6    7    8
        6    7    8
            6    7    8
                6    7    8
                    6    7    8
                        6    7    8

卷积运算相关API：

# a: 原数组
# b: 卷积核数组
# 卷积类型： 
#     'valid'    有效卷积
#     'same'    同维卷积
#     'full'    完全卷积
np.convole(a, b, 卷积类型)

a b c d e f g h i j k l ....
b = [1/5, 1/5, 1/5, 1/5, 1/5]

案例：使用卷积实现5日均线

# 使用卷积 实现5日均线
core = np.ones(5) / 5
sma52 = np.convolve(closing_prices, core,
                    'valid')
mp.plot(dates[4:], sma52, color='orangered',
        linewidth=7, alpha=0.3,
        linestyle='-', label='SMA-52')

# 使用卷积 实现10日均线
core = np.ones(10) / 10
sma510 = np.convolve(closing_prices, core,
                     'valid')
mp.plot(dates[9:], sma510, color='limegreen',
        linewidth=2, alpha=0.8,
        linestyle='-', label='SMA-10')

加权卷积

5日均线中计算的平均值为算数平均值，但是每个样本应该有不同的权重。按照时间加权均值的算法，时间越晚权重越高。所以可以选出一组卷积核，使得在卷积运算中实现加权卷积。

# 实现加权5日均线
weights = np.exp(np.linspace(-1, 0, 5))
weights = weights[::-1]
# 处理卷积核数组，使得卷积核中元素之和为1
weights /= weights.sum()
sma53 = np.convolve(closing_prices,
                    weights, 'valid')
mp.plot(dates[4:], sma53,
        linewidth=2, alpha=0.8,
        linestyle='-', label='SMA-53')

布林带

布林带由三条线组成：

中轨：移动均线

上轨：中轨 + 2*5日标准差

下轨：中轨 - 2*5日标准差

布林带收窄代表股价趋于稳定，布林带张开代表有较大的波动空间。

数据分析DAY05

案例应用

布林带

布林带由三条线组成：

中轨：移动均线

上轨：中轨 + 2*5日标准差

下轨：中轨 - 2*5日标准差

布林带收窄代表股价趋于稳定，布林带张开代表有较大的波动空间。

案例：

# 绘制5日均线的布林带
stds = np.zeros(sma53.size)
for i in range(stds.size):
    stds[i] = closing_prices[i:i + 5].std()

# 计算上轨和下轨
lowers = sma53 - 2 * stds
uppers = sma53 + 2 * stds

mp.plot(dates[4:], lowers, color='gold',
        linewidth=1, label='Lower Line')
mp.plot(dates[4:], uppers, color='gold',
        linewidth=1, label='Upper Line')

线性预测

若想使用线性方程预测业务结果，则必须首先假设数据样本符合线性规律。这样预测的结果才会相对准确。

假设这组数据符合线性规律：

a    b    c    d    e    f    ?

ax + by + cz = d
bx + cy + dz = e
cx + dy + ez = f
...
dx + ey + fz = ?

基于矩阵运算解方程组：
$$
\left[
\begin{matrix}
a & b & c \
b & c & d \
c & d & e \
\end{matrix}
\right]
\times
\left[
\begin{matrix}
x \
y \
z \
\end{matrix}
\right]
=
\left[
\begin{matrix}
d \
e \
f \
\end{matrix}
\right] \ \
\quad \quad a \quad \quad \quad \quad x \quad \quad \quad b
$$
numpy提供了方法可以求取上述矩阵运算中的x,y,z.

# a、b矩阵参数同上图所示
x = np.linalg.lstsq(a, b)[0]

案例：基于线性预测，预测下一天的收盘价。

# 整理三元一次方程组， 实现线性预测
N = 5
pred_prices = np.zeros(
    closing_prices.size - 2 * N + 1)
# 把每一个预测值都求出来
for i in range(pred_prices.size):
    # 整理矩阵a与矩阵b，使用lstsq方法求出x
    a = np.zeros((N, N))
    for j in range(N):
        a[j, ] = closing_prices[i + j:i + j + N]
    b = closing_prices[i + N:i + N * 2]
    x = np.linalg.lstsq(a, b)[0]
    # b与x执行点积运算
    pred_prices[i] = b.dot(x)

# 画出预测股价曲线
mp.plot(dates[2 * N:], pred_prices[:-1],
        'o-', color='orangered',
        linewidth=2, label='Predict Line')

print(pred_prices[-1])

线性拟合

线性拟合可以寻求与一组散点走向趋势规律相适应的线性表达式方程。

[x1, y1]
[x2, y2]
[x3, y3]
..
[xn, yn]

若这组点符合线性规律，那么就可以使用二元线性方程拟合这组数据。

kx + b = y
kx1 + b = y1
kx2 + b = y2
kx3 + b = y3
...
kxn + b = yn

如何求k与b：
$$
\left[
\begin{matrix}
x1 & 1 \
x2 & 1 \
x3 & 1 \
xn & 1 \
\end{matrix}
\right]
\times
\left[
\begin{matrix}
k \
b \
\end{matrix}
\right]
=
\left[
\begin{matrix}
y1 \
y2 \
y3 \
yn \
\end{matrix}
\right]
$$
案例：利用线性拟合画出股价的趋势线。

股价的趋势线 = (最高价+最低价+收盘价) / 3

# 计算所有的趋势点
trend_points = (
    highest_prices + lowest_prices +
    closing_prices) / 3

mp.plot(dates, trend_points, color='dodgerblue',
        linewidth=2, linestyle=':',
        label='Trend Points')
# 一组x坐标
days = dates.astype('M8[D]').astype(int)
a = np.column_stack((days, np.ones(days.size)))
b = trend_points
# 通过a与b求x
x = np.linalg.lstsq(a, b)[0]
# 求该直线的所有x对应的函数值
y = x[0] * days + x[1]
mp.plot(dates, y, color='orangered',
        linestyle='--', linewidth=3,
        label='Trend Line')

# 绘制顶部压力线  底部支撑线
spreads = highest_prices - lowest_prices
resistance_points = y + spreads
support_points = y - spreads

mp.plot(dates, resistance_points,
        color='red', linewidth=2,
        label='Resistance Line')
mp.plot(dates, support_points,
        color='limegreen', linewidth=2,
        label='Support Line')

数组的裁剪、压缩、累乘

数组的裁剪

# clip方法将会把数组中所有小于下限的元素改为下限值。把所有大于上限
# 的元素改为上限值。
ndarray.clip(min=下限, max=上限)

数组的压缩

# 返回由调用数组中满足条件的元素组成的新数组
ndarray.compress(条件)
a.compress(a>5)

案例：

import numpy as np

a = np.arange(1, 10).reshape(3, 3)
b = a.clip(min=3, max=7)
print(b)

a = a.reshape(9, )
c = a.compress(
    np.all([a > 3, a < 8], axis=0))
print(c)

协方差、相关性系数

通过两组统计数据计算而得的协方差可以评估这两组统计数据的相似程度。

样本：

A = [a1, a2, a3 ... an]
B = [b1, b2, b3 ... bn]

平均值：

ave_A = (a1 + a2 + a3 + ... an) / n
ave_B = (b1 + b2 + b3 + ... bn) / n

离差：

dev_A = [a1, a2 .. an] - ave_A
dev_B = [b1, b2 .. bn] - ave_B

协方差：

cov_ab = np.mean(dev_A * dev_B)

协方差可以简单反映两组统计样本的相关性：值为正，则为正相关；值为负则为负相关。绝对值越大相关性越强。

案例：计算bhp与vale两只股票的协方差。

# 计算两组样本的协方差
# 两组样本的均值
ave_bhp = bhp_closing_prices.sum()    \
    / bhp_closing_prices.size
ave_vale = vale_closing_prices.sum()    \
    / vale_closing_prices.size
# 两组样本的离差
dev_bhp = bhp_closing_prices - ave_bhp
dev_vale = vale_closing_prices - ave_vale
# 两组样本的协方差
cov_array = dev_bhp * dev_vale
cov = cov_array.sum() / cov_array.size
print('COV:', cov)

相关性系数

协方差除以两组统计样本的标准差的乘积是一个[-1, 1]之间的数字。该结果成为两组统计样本的相关性系数。

通过相关性系数可以分析两组数据的相关程度：

若相关系数越接近于1， 则表示两组样本正相关程度越强。
若相关系数越接近于-1， 则表示两组样本负相关程度越强。
若相关系数越接近于0， 则表示两组样本没啥关系。

numpy提供了求相关性系数的API：

np.corrcoef(a, b)

该方法将会返回相关性矩阵:
$$
\left[ \begin{matrix}
a与a的相关性系数 & a与b的相关性系数 \
b与a的相关性系数 & b与b的相关性系数 \
\end{matrix}\right]
$$
该矩阵的算法：
$$
\left[ \begin{matrix}
\frac{a与a的协方差}{std(a) std(a)} &
\frac{a与b的协方差}{std(a) std(b)} \
\frac{b与a的协方差}{std(b) std(a)} &
\frac{b与b的协方差}{std(b) std(b)} \
\end{matrix}\right]
$$
求a与b的协方差矩阵：

np.cov(a, b)   # 返回的协方差矩阵即是相关性矩阵的分子矩阵

# 输出两组数据的协方差矩阵
print(np.cov(bhp_closing_prices, vale_closing_prices))

多项式拟合

多项式的一般形式：
$$
y = p_0x^n + p_1x^{n-1} + p_2x^{n-2} + … + p_n
$$
多项式拟合的目的是为了找到一组p₀~p_n , 使得拟合方程尽可能的域实际样本数据相符合（误差最小）。

在numpy中，一组p₀~p_n即可以表示一个多项式。

# 根据一组离散样本，给出最高次幂，求出拟合多项式结果的系数数组。
# P = [p0, p1 ... pn]
P = np.polyfit(x, y, 最高次幂)

# 给出自变量x，该方法将把x代入多项式，求出函数值。
Y = np.polyval(P, x) 

# 求多项式函数的导函数  （Q = [q1, q2 ... qm]）
Q = np.polyder(P)

# 求多项式的根（y等于0是x的取值）
x = np.roots(P)

案例：使用多项式函数拟合两只股票bhp、vale收盘价的差价函数。

# 修改dates的数据类型 从M8[D] ->
dates = dates.astype(md.datetime.datetime)

# 求出两支股票的差价
diff_prices = bhp_closing_prices - vale_closing_prices
mp.plot(dates, diff_prices, color='dodgerblue',
        linewidth=2, linestyle='--',
        label='DIFF LINE')

# 多项式拟合
days = dates.astype('M8[D]').astype(int)
P = np.polyfit(days, diff_prices, 4)
y = np.polyval(P, days)
mp.plot(dates, y, color='orangered',
        linestyle='-', linewidth=2,
        label='Polyfit Line')

# 多项式求导，求导函数的根从而获取驻点坐标
Q = np.polyder(P)
xs = np.roots(Q)         # 驻点的x坐标
print(xs)
ys = np.polyval(P, xs)  # 驻点的y坐标
xs_date = np.floor(xs).astype(
    'M8[D]').astype(md.datetime.datetime)
mp.scatter(xs_date, ys, color='limegreen',
           s=60, marker='D', label='Points',
           zorder=3)

mp.legend()
mp.tight_layout()
mp.gcf().autofmt_xdate()
mp.show()

数据分析DAY06

数据平滑

数据平滑处理通常包含有降噪、拟合等操作。降噪的功能在于去除额外的影响因素，拟合的目的在于数据数学模型化，可以通过更多数学工具识别曲线的特征。

符号数组

np.sign()函数可以把样本数组编程对应的符号数组，正数变为1，负数变为-1，0还是0。

array = np.sign(原数组)

OBV能量潮

成交量可以反映市场对某只股票的人气，成交量是一支股票上涨的能量。一支股票上涨往往需要较大的成交量，而下跌时则不然。

若相比上一天的收盘价上涨，则为正成交量；若相比上一天的收盘价下跌，则为负成交量。正成交量为红色柱状图，负成交量为绿色柱状图。

案例：绘制OBV柱状图

# 绘制OBV能量潮
# 从第二天开始，减去前一天的收盘价得到新数组
diff_prices = np.diff(closing_prices)
# 对diff_prices符号化
sign_prices = np.sign(diff_prices)
obvs = volumns[1:] * sign_prices
# 正成交量为红色  负成交量为红色
dates = dates[1:]
mp.bar(dates[obvs >= 0], obvs[obvs >= 0], 0.9,
       color='red', label='OBV+')
mp.bar(dates[obvs < 0], -obvs[obvs < 0], 0.9,
       color='green', label='OBV-')

数据处理函数

# 针对原数组的每个元素做条件判断，取相应的值
# 条件序列：[array<60, array>=60]
# 取值序列：[-1, 1]
ary = np.piecewise(原数组，条件序列，取值序列)

案例：

sign_prices = np.piecewise(
    diff_prices,
    [diff_prices < 0, diff_prices == 0,
     diff_prices > 0],
    [-1, 0, 1])

函数矢量化

把一个只能处理标量的函数矢量化后，将可以直接处理数组参数。

numpy提供了vectorize函数，可以把处理标量的函数矢量化，返回一个新的函数（矢量函数），该矢量函数可以直接处理ndarray数组。

import numpy as np
import math as m


def foo(a, b):
    return m.sqrt(a**2 + b**2)

print(foo(3, 4))
a = np.array([3, 4, 5])
b = np.array([4, 5, 6])

# 调用vectorize方法把foo函数矢量化
vec_foo = np.vectorize(foo)
print(vec_foo(a, b))

numpy还提供了frompyfunc函数，也可以完成与vectorize相同的功能。

# 调用frompyfunc函数把foo函数矢量化
# 矢量化foo函数，接收2个参数，返回1个结果
fpf_foo = np.frompyfunc(foo, 2, 1)
print(fpf_foo(a, b))

矩阵对象

numpy中的matrix类型的对象即是矩阵对象。该类继承自numpy.ndarray所以任何针对多维数组的操作，对矩阵对象同样有效。作为子类，矩阵也结合了自身的特点，做了必要的扩充，比如：乘法计算、求逆等。

矩阵对象的创建

# 创建矩阵对象
# ary：任何可以被解释为矩阵的二维容器
# copy：是否复制数据（默认为True）
numpy.matrix(ary, copy=True)

# 等价于 numpy.matrix(ary, copy=False)
# 创建出的矩阵对象与原数组数据共享
numpy.mat(任何可以被解释为矩阵的二维容器)

# 矩阵字符串拼块规则：'1 2 3; 4 5 4; 6 4 3'
numpy.mat(矩阵字符串拼块规则)

矩阵的乘法(点积)

print('-' * 45)
a = np.mat('1 2 3; 5 6 4; 3 4 5')
b = np.mat('1 2 3; 5 6 4; 3 4 5')
print(a * b)
print(a.dot(b))

矩阵的逆

一个矩阵乘以该矩阵的逆矩阵得到的结果为一个单位矩阵(E)。一个单位矩阵对角线上的元素都是1，其他元素都为0。

#A矩阵的逆矩阵：
A.I
np.linalg.inv(A)

print('-' * 45)
a = np.mat(np.arange(1, 10).reshape(3, 3))
a = np.mat(np.array([1, 3, 5, 7, 9, 11, 13, 15, 17]).reshape(3, 3))
print(a)
print(a.I)
print(a * a.I)

案例：假设一帮人去旅游，去程做的是大巴，小孩票价3元，大人3.2元。共花了118.4；回来时坐火车，小孩票价3.5元，大人3.6元。共花了135.2. 求大人和小孩的人数。
$$
\left[
\begin{matrix}
3 & 3.2 \
3.5 & 3.6 \
\end{matrix}
\right]
\times
\left[
\begin{matrix}
x\
y \
\end{matrix}
\right]
=
\left[
\begin{matrix}
118.4 \
135.2 \
\end{matrix}
\right]
$$

prices = np.mat('3 3.2; 3.5 3.6')
totals = np.mat('118.4; 135.2')
persons = prices.I * totals
print(persons)

斐波那契数列

1 1 2 3 5 8 13 21 ….

     1 1    1 1        1 1        1 1
     1 0    1 0        1 0        1 0
----------------------------------------------
1 1     2 1    3 2        5 3        8 5
1 0     1 1    2 1        3 2        5 3

案例：

A = np.mat('1 1; 1 0')
n = 35
for i in range(1, n + 1):
    print((A**i)[0, 1])

numpy通用函数

加法通用函数

a = np.array([...])
np.add(a, b)            # a与b两数组对应位置元素相加
np.add.reduce(a)        # a所有元素的累加和
np.add.accumulate(a)    # 返回累加和的过程数组
np.add.outer([10,20,30], a)    # 列表与a数组的外和

案例：

a = np.arange(1, 10)
b = np.arange(1, 10)
print(np.add(a, b))
print(np.add.reduce(a))
print(np.add.accumulate(a))
print(np.add.outer([10, 20, 30], a))

乘法通用函数

ndarray.prod()        # 返回a数组中所有元素累乘的结果
np.prod(a)
ndarray.cumprod()    # 返回a数组中所有元素累乘的过程
np.cumprod(a)
np.outer([10, 20, 30], a)    # 外积

除法通用函数

np.true_divide(a, b)     #  a除以b
np.divide(a, b)            #  等价于上一个API
np.floor_divide(a, b)    #  地板除
np.ceil(a / b)            #  取天花板整数
np.trunc(a / b)            #  取截断整数
np.round(a / b)

案例：

import numpy as np

a = np.array([20, 20, -20, -20])
b = np.array([3, -3, 6, -6])

print(a / b)
print(np.divide(a, b))
print(np.floor_divide(a, b))
print(np.ceil(a / b))
print(np.trunc(a / b))
print(np.round(a / b))

位运算通用函数

位异或

c = a ^ b
c = np.bitwise_xor(a, b)

按位异或操作可以方便的判断两个数据是否同号。

import numpy as np

a = np.array([-1, 2, -3, -4, 5])
b = np.array([1, -2, 3, -4, 5])
print(a ^ b)
c = np.bitwise_xor(a, b)
print(c)
print(np.where(c < 0))

位与

e = a & b
e = np.bitwise_and(a, b)

利用位与运算计算某个数字是否是2的幂

# 1  2^0  00001        0    00000
# 2  2^1  00010        1    00001
# 4  2^2  00100        3    00011
# 8  2^3  01000        7    00111
# 16 2^4  10000        15    01111

print('-' * 45)
for i in range(1, 201):
    if i & (i - 1) == 0:
        print(i, end=' ')

位或

f = a | b
f = np.bitwise_or(a, b)

位反

g = ~a 
g = np.bitwise_not(a)

移位

<<    np.left_shift() 
>>    np.right_shift()

三角函数通用函数

np.sin() 、np.cos()等三角函数都属于矢量函数，都可以处理数组数据。
$$
y = Asin(wx + \phi) + C
$$
傅里叶定理

法国科学家傅里叶说过：所有周期性的函数都是有n多个不同振幅、不同频率、不同初相位的正弦函数叠加而成。
$$
y = 4\pi \times sin(x) \
y = \frac{4}{3}\pi \times sin(3x) \
y = \frac{4}{5}\pi \times sin(5x) \
y = \frac{4}{7}\pi \times sin(7x) \
… \
y = \frac{4}{2n-1}\pi \times sin((2n-1)x) \
$$
案例：

import numpy as np
import matplotlib.pyplot as mp

x = np.linspace(-2 * np.pi, 2 * np.pi, 1000)
y1 = 4 * np.pi * np.sin(x)
y2 = 4 * np.pi / 3 * np.sin(3 * x)
y3 = 4 * np.pi / 5 * np.sin(5 * x)
y4 = 4 * np.pi / 7 * np.sin(7 * x)

# 叠加1000个波形
n = 1000
y = 0
for i in range(1, n + 1):
    y += 4 * np.pi / (2 * i - 1) * \
        np.sin((2 * i - 1) * x)

mp.figure('Sin', facecolor='lightgray')
mp.title('Sin', fontsize=16)
mp.xlabel('X', fontsize=14)
mp.ylabel('Y', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.plot(x, y1, label='sin(x)')
mp.plot(x, y2, label='sin(3x)')
mp.plot(x, y3, label='sin(5x)')
mp.plot(x, y4, label='sin(7x)')
mp.plot(x, y, label='total')
mp.legend()
mp.tight_layout()
mp.show()

矩阵的广义逆矩阵

如果一个方阵A与另一个方阵B的乘积是一个单位矩阵，A与B互为逆矩阵。

np.linalg.inv(A) -> B
A.I

将以上有关逆矩阵的定义推广到非方阵，则成为广义逆矩阵。

import numpy as np

a = np.mat('11 12 11 14;20 21 21 23;19 18 11 16')
print(a)
b = a.I
#b = np.linalg.inv(a)
print(b)
print(a * b)

解线性方程组

# 使用最小二乘法求取结果
x = np.linalg.lstsq(a, b)[0]
# 解线性方程组
x = np.linalg.solve(a, b)

$$
\begin{cases}
x-2y+z=0 \
2y-8z-8=0 \
-4x+5y+9z+9=0
\end{cases}
$$

整理矩阵：
$$
\left[\begin{matrix}
1 & -2 & 1\
0 & 2 & -8\
-4 & 5 & 9\
\end{matrix}\right]
\times
\left[\begin{matrix}
x\
y\
z\
\end{matrix}\right]
=
\left[\begin{matrix}
0\
8\
-9\
\end{matrix}\right]
$$

a = np.mat('1 -2 1; 0 2 -8; -4 5 9')
b = np.mat('0; 8; -9')
c = np.linalg.solve(a, b)
print(c)

特征值与特征向量

对于n阶方阵A，如果存在数a与非零n维列向量x，使得Ax=ax，则称a是矩阵A的一个特征值，x是矩阵A属于特征值a的特征向量。

提取特征值与特征向量的API：

# 已知n阶方阵A， 求特征值与特征向量
# eigvals：特征值数组
# eigvecs：对应的特征向量数组
eigvals, eigvecs = np.linalg.eig(A)
# 已知特征值与特征向量，求方阵
S = eigvecs * np.diag(eigvals) * eigvecs.I

数据分析DAY07

特征值与特征向量

对于n阶方阵A，如果存在数a与非零n维列向量x，使得Ax=ax，则称a是矩阵A的一个特征值，x是矩阵A属于特征值a的特征向量。

提取特征值与特征向量的API：

# 已知n阶方阵A， 求特征值与特征向量
# eigvals：特征值数组
# eigvecs：对应的特征向量数组
eigvals, eigvecs = np.linalg.eig(A)
# 已知特征值与特征向量，求方阵
S = eigvecs * np.diag(eigvals) * eigvecs.I

案例：

import numpy as np

A = np.mat('1 6 5; 9 7 3; 1 5 6')
print(A)
# 特征值提取
eigvals, eigvecs = np.linalg.eig(A)
print(eigvals, eigvecs, sep='\n')

# 通过特征值与特征向量， 求方阵
S = np.mat(eigvecs) * np.mat(np.diag(eigvals))\
    * np.mat(eigvecs).I
print(S)

案例：读取图片的亮度矩阵，提取特征值与特征向量，保留部分特征值，重新生成新的亮度矩阵，绘制图片。

import numpy as np
import scipy.misc as sm
import matplotlib.pyplot as mp

img = sm.imread('../da_data/lily.jpg', True)
# 特征值提取
eigvals, eigvecs = np.linalg.eig(img)
# 把靠后的特征值与特征向量置为0（抹掉特征）
eigvals[50:] = 0
# 重新生成新的图片
img2 = np.mat(eigvecs) * \
    np.mat(np.diag(eigvals)) *    \
    np.mat(eigvecs).I
img2 = img2.real
mp.figure('Image Features')
mp.subplot(1, 2, 1)
mp.xticks([])
mp.yticks([])
mp.imshow(img, cmap='gray')
mp.subplot(1, 2, 2)
mp.xticks([])
mp.yticks([])
mp.imshow(img2, cmap='gray')
mp.tight_layout()
mp.show()

奇异值分解

有一个矩阵M，可以分解为3个矩阵U、S、V，使得U x S x V等于m。U与V都是正交矩阵（乘以自身的转置矩阵结果为单位矩阵）。那么S矩阵主对角线上的元素则成为矩阵M的奇异值，其他元素均为0。

奇异值分解相关API：

# 奇异值分解
# 得到3个矩阵： U S V。 U与V都是正交矩阵，S矩阵保存着奇异值
U, S, V = np.linalg.svd(M)

案例：

# 奇异值分解
U, sv, V = np.linalg.svd(img)
sv[50:] = 0
img3 = U * np.mat(np.diag(sv)) * V

傅里叶变换

什么傅里叶变换？

傅里叶定理支出任何一条周期曲线，无论多么跳跃与不规则，都能表示成一组光滑的正弦曲线叠加之和。

傅里叶变换即是从一条周期曲线中分解出这一组光滑的正弦曲线的过程。傅里叶变换最终的结果将会得到这一组光滑正弦曲线的振幅、频率、初相位等数据。

傅里叶变换的目的是可以将时域（时间域）上的信号，转变为频域上的信号，随着域的不同，对一个事物的了解角度也就发生改变。因此某些在时域中不好处理的地方，放在频域中就可以较为简单的处理。（数据存储、数据降噪）
$$
y= A_1sin(\omega_1 x+\phi_1) +
A_2sin(\omega_2 x+\phi_2) +
A_3sin(\omega_3 x+\phi_3)+
C
$$

傅里叶变换相关函数

# 导入快速傅里叶变换模块
import numpy.fft as nf
# 通过采样数量、采样周期，求得傅里叶变换分解出的所有曲线的频率
freqs = nf.fft.fftfreq(采样数量， 采样周期)
# 基于原函数值序列求出傅里叶变换分解出的所有曲线的振幅和初相位
# complex_array: 复数数组
#     复数的 辅角  即是相位角
#    复数的 模    即是振幅.
complex_array = nf.fft.fft(原函数值序列)

# 逆向傅里叶变换 通过复数数组，逆向生成原函数
y2 = nf.fft.ifft(complex_array)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo05_fft.py  傅里叶变换
"""
import numpy as np
import matplotlib.pyplot as mp
import numpy.fft as nf

x = np.linspace(-2 * np.pi, 2 * np.pi, 1000)
y1 = 4 * np.pi * np.sin(x)
y2 = 4 * np.pi / 3 * np.sin(3 * x)
y3 = 4 * np.pi / 5 * np.sin(5 * x)
y4 = 4 * np.pi / 7 * np.sin(7 * x)
y = y1 + y2 + y3 + y4

mp.figure('FFT', facecolor='lightgray')

mp.subplot(1, 2, 1)
mp.title('Time Domain', fontsize=16)
mp.xlabel('X', fontsize=14)
mp.ylabel('Y', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
# mp.plot(x, y1, label='sin(x)')
# mp.plot(x, y2, label='sin(3x)')
# mp.plot(x, y3, label='sin(5x)')
# mp.plot(x, y4, label='sin(7x)')
mp.plot(x, y, label='total')
mp.legend()

# 对该合成波执行FFT操作，得到分解波的信息。
# 绘制频域图像。
# 通过采样数量，采样周期返回频率数组
freqs = nf.fftfreq(x.size, x[1] - x[0])
# 通过原函数图形，得到分解结果的复数数组
# 复数的模就是振幅，复数的辅角为相位角
ffts = nf.fft(y)
pows = np.abs(ffts)
# 逆向傅里叶变换
y2 = nf.ifft(ffts)
# 绘制频域图像
mp.subplot(1, 2, 2)
mp.title('Frequency Domain', fontsize=16)
mp.xlabel('Frequency', fontsize=14)
mp.ylabel('Pow', fontsize=14)
mp.grid(linestyle=':')
mp.plot(freqs[freqs > 0], pows[freqs > 0],
        color='orangered',
        label='Frequency Line')
mp.subplot(1, 2, 1)
mp.plot(x, y2, color='red', linewidth=5,
        alpha=0.5,  label='IFFT LINE')

mp.legend()
mp.tight_layout()
mp.show()

基于傅里叶变换的频域滤波

含噪信号是高能信号与低能噪声叠加的信号，可以通过傅里叶变换的频域滤波实现降噪。

通过FFT使含噪信号转换为含噪频谱，取出低能噪声，留下高能频谱后再通过IFFT合成高能信号。

案例：基于傅里叶变换为音频文件去除噪声。

读取音频文件，获取音频文件的基本信息：采样个数、采样周期，每个采样声音的信号值。绘制音频时域：时间、位移图像。

import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp
# 读取音频文件 获取采样率，与每个采样点的信号值
sample_rate, noised_sigs = \
    wf.read('../da_data/noised.wav')
# print(sample_rate, type(sample_rate))
# print(noised_sigs, noised_sigs.shape)
# 整理图像的y
noised_sigs = noised_sigs / 2 ** 15
# 整理图像的x
times = np.arange(len(noised_sigs)) / sample_rate

mp.figure('Filter', facecolor='lightgray')
mp.subplot(2, 2, 1)
mp.title('Time Domain', fontsize=16)
mp.xlabel('Time', fontsize=12)
mp.ylabel('Sigs', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(times[:150], noised_sigs[:150],
        color='dodgerblue', label='Noised')
mp.legend()
mp.show()

基于傅里叶变换，获取音频频域信息，绘制频域：频率、能量图像。

# 绘制频率图像
freqs = nf.fftfreq(times.size, 1 / sample_rate)
# 获取每个频率波的能量
noised_fft = nf.fft(noised_sigs)
noised_pow = np.abs(noised_fft)
mp.subplot(222)
mp.title('Frequecy Domain', fontsize=16)
mp.xlabel('Freqs', fontsize=12)
mp.ylabel('Pows', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.semilogy(freqs[freqs > 0],
            noised_pow[freqs > 0],
            color='orangered', label='Noised')
mp.legend()
mp.tight_layout()

将低频噪声去除后绘制音频频域：频率、能量图像。

# 请去除低频噪声  找到需要保留的高能频率
fund_freq = freqs[noised_pow.argmax()]
# 所有噪声频率的下标
noised_indices = np.where(freqs != fund_freq)
filter_ffts = noised_fft.copy()
filter_ffts[noised_indices] = 0
filter_pows = np.abs(filter_ffts)

mp.subplot(224)
mp.xlabel('Frequency', fontsize=12)
mp.ylabel('Pows', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(freqs[freqs > 0],
        filter_pows[freqs > 0],
        color='orangered', label='Filter')
mp.legend()
mp.tight_layout()

基于逆向傅里叶变换，生成新的音频信号，绘制音频时域的：时间、位移图像。

# 生成新的音频图像
filter_sigs = nf.ifft(filter_ffts).real
mp.subplot(223)
mp.xlabel('Time', fontsize=12)
mp.ylabel('Sigs', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(times[:150], filter_sigs[:150],
        color='dodgerblue', label='Filter')

mp.legend()
mp.tight_layout()

生成文件

# 重新生成音频文件
wf.write('../da_data/out.wav', sample_rate,
    (filter_sigs * 2**15).astype(np.int16))

随机数模块

np.random可以生成服从特定统计规律的随机数序列。

二项分布（binomial）

二项分布就是重复n次独立事件的伯努利实验。在每次实验中只有两种可能的结果，而且两种结果发生与否相互对立，相互独立，事件发生与否的概率在每一次独立实验中都保持不变。

# 产生size个随机数， 每个随机数来自n次尝试中成功的次数
# 其中每次尝试成功的概率为p。
np.random.binomial(n, p, size)
np.random.binomial(100, 0.5, 100)

案例：某人投篮命中率为0.3，投10次进5个球的概率。

r = np.random.binomial(10, 0.3, 100000)
r = (r == 5).sum() / 100000
print(r)

# 客服接通率0.6， 打三次电话都没人接的概率
r = np.random.binomial(3, 0.6, 100000)
r = (r == 0).sum() / 100000
print(r)

超几何分布(hypergeometric)

# 产生size个随机数
# 每个随机数为在总样本中随机抽取nsample个样本后，好样本的个数
# 总样本由ngood个好样本和nbad个坏样本组成
np.random.hypergeometric(ngood, nobad, nsample, size)
np.random.hypergeometric(7, 3, 3, 1)

案例：

# 超几何分布
# 7个好苹果、3个坏苹果中随机选个
# 抽中好苹果的数量
r = np.random.hypergeometric(7, 3, 3, 1)
print(r)

正态分布（normal）

# 产生size个随机数，服从标准正态分布(期望=0，标准差=1)
np.random.normal(size)
# 产生size个随机数，服从正态分布(期望=1，标准差=10)
np.random.normal(loc=1, scale=10, size)

杂项功能

排序

联合间接排序

联合间接排序支持为待排序列排序，若待排序列值相同，则利用参考序列作为参考继续排序。最终返回排序过后的有序索引。

indices = np.lexsort((参考序列， 待排序列))

案例：

import numpy as np

products = np.array(
    ['Apple', 'Huawei', 'Mi', 'Samsung', 'OPPO'])
prices = np.array(
    [8000, 4500, 2999, 4999, 2999])
volumns = np.array(
    [100, 180, 80, 20, 100])

indices = np.lexsort((-volumns, prices))
print(indices)

复数数组排序

# 按照实部升序排列, 对于实部相同的元素，参考虚部的升序，返回排序后
# 的结果数组。
np.sort_complex(复数数组)

插入排序

如果有需求向有序数组中插入元素，使数组依然有序，numpy提供了方法：searchsorted方法查询并返回可插入位置数组，然后调用insert插入元素。

indices = np.searchsorted(有序序列，待插入序列)
np.insert(有序序列，位置序列，待插入序列)

案例：

# 插入排序
a = np.array([1, 2, 3, 4, 7, 9])
b = np.array([8, 5])
indices = np.searchsorted(a, b)
print(indices)
r = np.insert(a, indices, b)
print(r)

插值

scipy提供了常见的插值算法，可以通过一系列的离散点设计出符合一定规律的插值器函数。若我们给插值器函数更多的x，该函数将返回对应的函数值。

（插值常用于数据的补充）

import scipy.interpolate as si

# 返回插值器函数
func = si.interp1d(
    离散点x坐标，
    离散点y坐标，
    kind = 插值算法(默认'linear': 线性插值)
)

案例：

import numpy as np
import scipy.interpolate as si
import matplotlib.pyplot as mp

# 搞出15个散点
min_x, max_x = -50, 50
dis_x = np.linspace(min_x, max_x, 15)
dis_y = np.sinc(dis_x)

# 把散点交给插值器，生成插值器函数，绘制图像
linear = si.interp1d(dis_x, dis_y)
x = np.linspace(min_x, max_x, 1000)
y = linear(x)

# 三次样条插值器(Cubic Spline Interpolation)
cu = si.interp1d(dis_x, dis_y, kind='cubic')
y2 = cu(x)

mp.figure('Interpolate')
mp.grid(linestyle=':')
mp.scatter(dis_x, dis_y, s=60, color='red')
##mp.plot(x, y, color='dodgerblue')
mp.plot(x, y2, color='orangered')
mp.show()

插值器会在游戏领域使用，在机器学习领域我们收集到的数据有很多是不完整的数据，这样就需要根据已有数据对空白数据进行预测，对样本数据进行补充完善，这时可以利用机器学习相关算法与插值器对空白数据进行预测与补充，完善训练样本。

积分

直观的说，对于一个给定的正实值函数，在一个实数区间上的定积分可以理解为坐标平面上由曲线、直线、轴围城的曲边梯形的面积。（一个确定值）

案例：利用微元法求定积分: y=2x²+3x+4曲线在[-5,5]区间的定积分。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo10_int.py  积分案例
"""
import numpy as np
import matplotlib.pyplot as mp
import matplotlib.patches as mc


def f(x):
    return 2 * x**2 + 3 * x + 4

a, b = -5, 5
x1 = np.linspace(a, b, 1000)
y1 = f(x1)

mp.figure('Integral', facecolor='lightgray')
mp.title('Integral', fontsize=18)
mp.xlabel('X', fontsize=12)
mp.ylabel('Y', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(x1, y1, color='dodgerblue',
        linewidth=6, label='f(x)')

# 微元法画梯形
n = 50
x2 = np.linspace(a, b, n + 1)
y2 = f(x2)
area = 0  # 保存积分值
for i in range(n):
    area += (y2[i] + y2[i + 1]) * \
        (x2[i + 1] - x2[i]) / 2
print(area)
# 绘制图像
for i in range(n):
    mp.gca().add_patch(
        mc.Polygon([
            [x2[i], 0],    [x2[i], y2[i]],
            [x2[i + 1], y2[i + 1]],
            [x2[i + 1], 0]],
            fc='deepskyblue',
            ec='dodgerblue',
            alpha=0.5))

mp.legend()
mp.show()

scipy.integrate模块提供了quad方法计算定积分：

import scipy.integrate as si
# 求定积分：给出函数f，积分下限a，积分上限b
# 返回：(积分值， 最大误差)
area = si.quad(f, a, b)[0]

基于scipy的图像处理

scipy.ndimage中提供了一些简单的图像处理API，高斯模糊、图像旋转、边缘识别等功能。

机器学习

机器学习DAY01

机器学习概述

什么是机器学习

机器学习是一门能够让编程计算机从数据中学习的计算机科学。

一个计算机程序在完成任务T之后，获得经验E，其表现效果为P，如果任务T的性能表现（P），随着E增加而增加，那么这样的计算机程序就被称为机器学习系统。拥有自我完善，自我增进，自我适应的特点。

为什么需要机器学习

自动化的升级和维护
解决那些算法过于复杂，甚至根本就没有算法的问题
在机器学习的过程中协助人类对事物进行深刻理解

机器学习的问题

建模问题

所谓机器学习，形式上可以这么理解：在数据中通过统计或推理的方法，寻找一个接收特定输入x，并给出预期输出y的功能函数f(x).
评估问题

针对已知输入，函数给出的输出（预测值）与实际输出（目标值）之间存在一定的误差，因此需要构建一个评估体系，根据误差的大小判定函数的优劣。
优化问题

学习的核心在于改善自身的性能，通过数据对算法的反复锤炼，不断提升函数预测的准确性，直到获得能够满足实际需求的最优解决方案。

机器学习的种类

监督学习、无监督学习、半监督学习、强化学习

监督学习：用已知的输出评估模型的性能，引导模型优化。
无监督学习：在没有已知输出的情况下，仅仅根据输入信息的相关性，进行类别的划分。
半监督学习：先通过无监督学习划分类别，再根据人工标记，通过有监督学习预测输出。
强化学习：通过对不同决策结果的奖励和惩罚，使机器学习系统经过长时间训练以后，越来越倾向于给出接近期望结果的输出。

批量学习和增量学习

批量学习：将学习的过程和应用的过程分开，用全部的训练数据训练模型，然后在应用场景中实现预测，当预测结果不够理想时，重新回到学习过程，如此循环。
增量学习：将学习的过程和应用的而过程统一起来，在应用的同时以增量的方式不断学习新的内容，边训练变预测，边预测边训练。

基于实例的学习和基于模型的学习

基于实例的学习：以历史数据作为经验，寻找与待预测输入最接近的样本，以其输出作为预测结果。
基于模型的学习：以历史数据作为经验，建立联系输入与输出的某种数学模型，将待预测输入代入该模型，预测结果。

机器学习的一般过程

数据清理

数据收集（数据检索、数据挖掘、爬虫）
数据清洗

机器学习

选择模型（算法）
训练模型（算法）
评估模型（工具、框架、算法）
测试模型

业务运维

应用模型
维护模型

机器学习的典型应用

股价预测、推荐引擎、自然语言识别（NLP）、语音、图像、人脸识别等。

机器学习的基本问题

回归问题：根据已知的输入和输出寻找某种性能最佳的模型，将未知输出的输入代入模型，得到连续的输出。
分类问题：根据已知的输入和输出寻找某种性能最佳的模型，将未知输出的输入代入模型，得到离散的输出。
聚类问题：根据已知输入的相似程度，将其划分成不同的群落。
升维、降维问题：在性能损失尽可能小的情况下，实现模型业务。

数据预处理

数据样本矩阵格式

年龄	学历	经验	性别	月薪
27	硕士	2	女	12000
24	本科	1	男	8000
…	…	…	…	…

一行一样本，一列一特征。

数据预处理相关库

# sklearn是解决机器学习问题的科学计算工具包
# preprocessing用于数据预处理操作
import sklearn.preprocessing as sp

均值移除(标准化)

由于一个样本的不同特征值差异较大，不利于使用现有机器学习算法进行样本处理。所以可以使用均值移除让样本矩阵中每组特征值的平均值为0，标准差为1。

如何使样本矩阵中每组特征平均值为0？

年龄： 17  20  23
mean(年龄) = 20
a = -3
b = 0 
c = 3

如何使样本矩阵中每组特征标准差为1？

a = -3
b = 0 
c = 3
s = std(a, b, c)
结果：[a/s, b/s, c/s]

sklearn提供了均值移除的相关API：

import sklearn.preprocessing as sp
A = sp.scale(array)

案例：

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [17., 100., 4000.],
    [20., 80., 5000.],
    [23., 75., 5500.]])

std_samples = sp.scale(raw_samples)
print(std_samples)
# 输出每一列的均值与标准差
print(std_samples.mean(axis=0))
print(std_samples.std(axis=0))

范围缩放

将样本矩阵中的每一列的最小值与最大值设定为相同的区间，统一各列特征值的范围。一般情况下会把特征值缩放至[0,1]区间。

如何使一组特征值的最小值为0呢？

原特征： [17, 20, 23]
每个元素减去特征值的最小值： [0, 3, 6]

如何使一组特征值的最大值为1呢？

[0, 3, 6]
把特征值数组每个元素除以最大值： [0, 0.5, 1]

sklearn提供的范围缩放相关API：

# 获取MinMax范围缩放器， 定义缩放范围[0, 1]
mms = sp.MinMaxScaler(feature_range=(0, 1))
# 返回缩放结果
result = mms.fit_transform(原始样本矩阵)

案例：

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [17., 100., 4000.],
    [20., 80., 5000.],
    [23., 75., 5500.]])

print(raw_samples)
# 执行范围缩放
mms = sp.MinMaxScaler(feature_range=(0, 1))
result = mms.fit_transform(raw_samples)
print(result)

该范围缩放过程也可以这么理解：

当x=17时，y=0；当x=23时，y=1；问：当x=20时，y=几？
$$
\left[\begin{matrix}
17 & 1 \
23 & 1 \
\end{matrix}\right]
\times
\left[\begin{matrix}
k \
b \
\end{matrix}\right]
=
\left[\begin{matrix}
0 \
1 \
\end{matrix}\right]
$$

# 使用解方程的方式求解k与b，实现线性范围缩放
mms_samples = raw_samples.copy()
for col in mms_samples.T:
    col_min = col.min()
    col_max = col.max()
    a = np.array([[col_min, 1],
                  [col_max, 1]])
    b = np.array([0, 1])
    x = np.linalg.solve(a, b)
    # 计算 kx + b
    col = col * x[0]
    col += x[1]
print(mms_samples)

归一化

有些情况每个样本的每个特征值具体是多少并不重要，但是每个样本特征值的占比更加重要。

	Python	Java	PHP
2017	10	20	5
2018	8	5	0
…	…	…	…

归一化即是用每个样本的每个特征值除以该样本各个特征值绝对值的总和。变换后样本矩阵，每个样本的特征值绝对值之和为1。

sklearn提供的归一化相关API：

# array：原始样本矩阵
# norm：范数
#    l1：l1范数，这组数据每个元素绝对值之和为1
#   l2：l2范数，这组数据每个元素平方之和为1
sp.normalize(array, norm='l1')

案例：

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [17., 100., 4000.],
    [20., 80., 5000.],
    [23., 75., 5500.]])

print(raw_samples)
# 归一化处理
nor_samps = sp.normalize(raw_samples, norm='l1')
print(nor_samps)
print(abs(nor_samps).sum(axis=1))

# 自己算：
nor_samps = raw_samples.copy()
for row in nor_samps:
    row /= abs(row).sum()
print(nor_samps)
print(abs(nor_samps).sum(axis=1))

二值化

有些业务并不需要分析矩阵的详细完整数据（比如图像边缘识别只需要分析图像边缘即可），可以根据一个事先给定的阈值，用0和1对特征值进行转换，分别表示不高于或高于阈值。二值化后的数组中每个元素非0即1，达到简化数学模型的目的。

sklearn提供二值化相关API：

# 给出阈值，获取二值化器
bin = sp.Binarizer(threshold=阈值)
# 对原始样本矩阵执行二值化预处理操作
result = bin.transform(原始样本矩阵)

案例：

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [17., 100., 4000.],
    [20., 80., 5000.],
    [23., 75., 5500.]])

print(raw_samples)
# 二值化处理
bin = sp.Binarizer(threshold=80)
bin_samples = bin.transform(raw_samples)
print(bin_samples)
# 自己做：
bin_samples = raw_samples.copy()
bin_samples[bin_samples <= 80] = 0
bin_samples[bin_samples > 80] = 1
print(bin_samples)

独热编码

为样本特征的每个值建立一个由一个1与若干个0组成的序列，用该序列对所有的特征值进行编码。

2个数      3个数    4个数
1        3        2
7        5        4
1        8        6
7        3        9
为每一个数组进行编码：
1-10    3-100    2-1000
7-01    5-010    4-0100
        8-001    6-0010
                9-0001
编码完毕后得到的最终样本矩阵：
101001000
010100100
100010010
011000001

独热编码相关的API：

# 创建独热编码器对象
# sparse：是否采用紧缩格式
# dtype:数据类型
ohe=sp.OneHotEncoder(sparse=是否采用紧缩格式，dtype=数据类型)
# 对原始样本矩阵进行处理，返回度热编码后的样本矩阵
result=ohe.fit_transform(原始样本矩阵)

# 获取ohe的编码字典
ohe_dict = ohe.fit(raw_samples)
ohe_samples = ohe_dict.transform(raw_samples)
print(ohe_samples)

案例：

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    [17., 100., 4000.],
    [20., 80., 5000.],
    [23., 75., 5500.]])

print(raw_samples)
# 独热编码
ohe = sp.OneHotEncoder(sparse=False, dtype=int)
ohe_samples = ohe.fit_transform(raw_samples)
print(ohe_samples)
# 获取ohe的编码字典
ohe_dict = ohe.fit(raw_samples)
ohe_samples = ohe_dict.transform(raw_samples)
print(ohe_samples)

标签编码

根据字符串形式的特征值在特征序列中的位置，为其制定一个数字标签，用于提供给基于数值算法的学习模型。

标签编码相关的API：

# 获取标签编码器
lbe = sp.LabelEncoder()
# 对原始样本矩阵执行标签编码，返回编码完毕后的结果
result = lbe.fit_transform(原始样本矩阵)
# 给出编码矩阵，反向逆推出原始样本矩阵
samples = lbe.inverse_transform(result)

案例：

import numpy as np
import sklearn.preprocessing as sp

raw_samples = np.array([
    'audi', 'ford', 'audi', 'toyota',
    'ford', 'bmw', 'toyota', 'byd',
    'audi'])
# 执行标签编码
lbe = sp.LabelEncoder()
lbe_samples = lbe.fit_transform(raw_samples)
print(lbe_samples)

inv_sampls = lbe.inverse_transform(lbe_samples)
print(inv_sampls)

线性回归

线性回归的本质为针对符合线性数学模型的一组数据，可以找到一个线性方程拟合样本数据。从而当给出自变量后，通过线性方程实现预测输出的目的。

输入       输出
0.5        5.0
0.6        5.5
0.8        6.0
1.1        6.8
1.4        7.0
....
y = kx + b

预测函数： y = w₀ + w₁x

x：输入

y：输出

w₀ w₁: 模型参数

所谓模型的训练，就是根据已知的x与y，找到最佳的模型参数 w₀ w₁，使得尽可能的精确描述所有输出和输出的关系（误差最小）。

单样本误差：

根据预测函数求出输入为x时的预测值： y’ = w₀ + w₁x。

单样本误差则为：1/2 (y’ - y)²

总样本误差：

把所有单样本误差相加即是总样本误差： 1/2 Σ (y’ - y)²

损失函数：

loss = 1/2 Σ (w₀ + w₁x - y)²

所以最终的结果即是需要找到一组w₀ w₁使得loss函数值最小。

机器学习DAY02

线性回归

输入       输出
0.5        5.0
0.6        5.5
0.8        6.0
1.1        6.8
1.4        7.0
....
y = kx + b

预测函数： y = w₀ + w₁x

x：输入

y：输出

w₀ w₁: 模型参数

所谓模型的训练，就是根据已知的x与y，找到最佳的模型参数 w₀ w₁，使得尽可能的精确描述所有输入和输出的关系（误差最小）。

单样本误差：

根据预测函数求出输入为x时的预测值： y’ = w₀ + w₁x。

单样本误差则为：1/2 (y’ - y)²

总样本误差：

把所有单样本误差相加即是总样本误差： 1/2 Σ (y’ - y)²

损失函数：

loss = 1/2 Σ (w₀ + w₁x - y)²

所以最终的结果即是需要找到一组w₀ w₁使得loss函数值最小。

案例：基于梯度下降理论实现线性回归，求出w₀ w₁。

整理训练集数据，自定义梯度下降算法，求出w₀ w₁，绘制回归线。

import numpy as np
import matplotlib.pyplot as mp

train_x = np.array([0.5, 0.6, 0.8, 1.1, 1.4])
train_y = np.array([5.0, 5.5, 6.0, 6.8, 7.0])
test_x = np.array([0.45, 0.55, 1.0, 1.3, 1.5])
test_y = np.array([4.8, 5.3, 6.4, 6.9, 7.3])
# 基于梯度下降理论，完成线性回归，求w0与w1
times = 1000  # 梯度下降次数
lrate = 0.01  # 每次梯度下降参数的变化率
epoches = []  # 记录每次梯度下降的索引
w0, w1, losses = [1], [1], []  # 参数
# 做1000次梯度下降
for i in range(1, times + 1):
    epoches.append(i)
    loss = ((w0[-1] + w1[-1] * train_x
             - train_y)**2).sum() / 2
    losses.append(loss)
    # 使用偏导数取出w0与w1梯度下降值
    d0 = ((w0[-1] + w1[-1] * train_x)
          - train_y).sum()
    d1 = ((w0[-1] + w1[-1] * train_x - train_y)
          * train_x).sum()
    print('{: 4} > w0={: .8f}, w1={: .8f}, \
          loss={: .8f}'.format(
        epoches[-1], w0[-1], w1[-1],
        losses[-1]))
    # w0, w1梯度下降：
    w0.append(w0[-1] - lrate * d0)
    w1.append(w1[-1] - lrate * d1)

mp.figure('Linear Regression')
mp.title('Linear Regression', fontsize=18)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.scatter(train_x, train_y, marker='s',
           c='dodgerblue', alpha=0.5, s=80,
           label='Training Points')

# 通过w0与w1，绘制回归线
pred_train_y = w1[-1] * train_x + w0[-1]
mp.plot(train_x, pred_train_y,
        color='orangered', linewidth=2,
        label='Regression Line')

mp.legend()
mp.show()

绘制每次梯度下降过程中w₀ w₁ loss值的变化。

# 绘制每次梯度下降过程中w0 w1 loss值的变化。
mp.figure('Training Progress')
mp.subplot(311)
mp.title('Training Progress', fontsize=16)
mp.ylabel('w0', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=":")
mp.plot(epoches, w0[1:], label='w0')
mp.legend()
mp.subplot(312)
mp.title('Training Progress', fontsize=16)
mp.ylabel('w1', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=":")
mp.plot(epoches, w1[1:], label='w1')
mp.legend()
mp.subplot(313)
mp.title('Training Progress', fontsize=16)
mp.ylabel('loss', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=":")
mp.plot(epoches, losses, label='loss')
mp.legend()
mp.tight_layout()

基于三维曲面绘制梯度下降过程中的每一个空间点。

import mpl_toolkits.mplot3d as axes3d

# 绘制空间曲面，观察梯度下降过程
grid_w0, grid_w1 = np.meshgrid(
    np.linspace(0, 9, 500),
    np.linspace(0, 3.5, 500))
# 构建一个数组，结构与grid_w0一致，元素全是0
grid_loss = np.zeros_like(grid_w0)
for x, y in zip(train_x, train_y):
    grid_loss += \
        (grid_w0 + x * grid_w1 - y)**2 / 2
# 绘制曲面
mp.figure("Loss Function")
ax = mp.gca(projection='3d')
mp.title('Loss Function')
ax.set_xlabel('w0', fontsize=12)
ax.set_ylabel('w1', fontsize=12)
ax.set_zlabel('loss', fontsize=12)
ax.plot_surface(
    grid_w0, grid_w1, grid_loss,
    rstride=10, cstride=10, cmap='jet')
ax.plot(w0[:-1], w1[:-1], losses,
        'o-', color='orangered')

以等高线的方式绘制梯度下降的过程。

# 以等高线的方式绘制梯度下降的过程。
mp.figure('Contour')
mp.xlabel('w0', fontsize=14)
mp.ylabel('w1', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.contourf(grid_w0, grid_w1, grid_loss,
            10, cmap='jet')
cntr = mp.contour(grid_w0, grid_w1, grid_loss,
                  10, colors='black',
                  linewidths=0.5)
mp.clabel(cntr, inline_spacing=0.1,
          fmt='%.2f', fontsize=8)
mp.plot(w0, w1, 'o-', c='orangered',
        label='BGD')
mp.legend()

线性回归模型

sklearn提供了线性回归相关的API：

import sklearn.linear_model as lm
# 创建线性回归模型
model = lm.LinearRegression()
# 模型训练
# 输入：一个二维数组表示的样本矩阵
# 输出：每个样本最终的结果
model.fit(输入，输出)
# 模型预测  给出与训练时格式相同的输入， 返回预测输出
result = model.predict(测试输入)

案例：基于线性回归训练single.txt中的样本，使用模型预测样本。

import numpy as np
import matplotlib.pyplot as mp
import sklearn.linear_model as lm

# 读取数据
x, y = np.loadtxt(
    '../ml_data/single.txt', delimiter=',',
    usecols=(0, 1),  unpack=True)
x = x.reshape(-1, 1)  # x变为n行1列
# 创建模型->训练模型->使用模型
model = lm.LinearRegression()
model.fit(x, y)
pred_y = model.predict(x)
# 把这些点画出来
mp.figure('Linear Regression', facecolor='lightgray')
mp.title('Linear Regression', fontsize=14)
mp.xlabel('x', fontsize=12)
mp.ylabel('y', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(x, y, c='dodgerblue', alpha=0.5,
           s=60, label='Sample Points')
mp.plot(x, pred_y, c='orangered',
        linewidth=3, label='Regression Line')
mp.legend()
mp.show()

评估训练结果的误差

线性回归模型训练完毕后，可以利用测试集评估训练结果误差。sklearn.metrics 提供了计算模型误差的几个常用算法：

import sklearn.metrics as sm

# 平均绝对值误差:  1/m*∑|实际输出-预测输出|
sm.mean_absolute_error(y， pred_y)
# 平均平方误差:  sqrt(1/m*∑(实际输出-预测输出)^2)
sm.mean_squared_error(y, pred_y)
# 中位数绝对值误差: median(|实际输出-预测输出|)
sm.median_absolute_error(y, pred_y)
# R2得分 (0, 1]区间上的分值，分数越高，模型越好, 误差越小
sm.r2_score(y, pred_y)

案例：

# 输出误差指标
print(sm.mean_absolute_error(y, pred_y))
print(sm.mean_squared_error(y, pred_y))
print(sm.median_absolute_error(y, pred_y))
print(sm.r2_score(y, pred_y))

模型的保存与加载

模型训练是一个耗时的过程，一个优秀的机器学习模型是很宝贵的。可以把模型保存到磁盘中，也可以在需要的时候从磁盘中重新加载模型。不需要重新训练。

模型保存和加载相关的API：

import pickle
pickle.dump(内存对象, 磁盘文件)  # 保存模型
model = pickle.load(磁盘文件) # 加载模型

案例：把训练好的模型保存到磁盘中，在合适的时间读取模型对象。

# 保存模型
with open('../ml_data/linear.pkl', 'wb') as f:
    pickle.dump(model, f)

# 加载模型
with open('../ml_data/linear.pkl', 'rb') as f:
    model = pickle.load(f)
pred_y = model.predict(x)

岭回归

普通线性回归模型使用基于梯度下降的最小二乘法，在损失函数最小化的前提下，寻找最优的模型参数。在此过程中，包括少数异常样本在内的全部训练数据会对模型参数造成相当程度的一些影响，而异常样本无法在训练过程中被识别出来。为此，岭回归在模型迭代过程中增加了正则项，以限制模型参数对异常样本的匹配程度。进而提高模型对多数正常样本的拟合精度。

岭回归相关API：

import sklearn.linear_model as lm
# 正则强度: 该数字越大，越忽略异常样本
# fit_intercept: 是否训练截距 
# max_iter: 最大迭代次数
m = lm.Ridge(正则强度, fit_intercept=True, max_iter=1000)
# 训练模型
m.fit(输入， 输出)
# 使用模型
pred_y = m.predict(test_x)

案例：加载abnormal.txt

# 使用岭回归
model = lm.Ridge(150, fit_intercept=True,
                 max_iter=10000)
model.fit(x, y)
pred_y = model.predict(x)
print(sm.r2_score(y, pred_y))
mp.plot(x, pred_y, c='limegreen',
        linewidth=3, label='Ridge Regression')

多项式回归

若希望回归模型更好的拟合训练样本的数据，可以用多项式回归器。

一元多项式回归
$$
y = w_0 + w_1x + w_2x^2 + w_3x^3 + … + w_dx^d
$$
将高次项看做对1次项的扩展:
$$
y = w_0 + w_1x_1 + w_2x_2 + w_3x_3 + … + w_dx_d
$$
那么一元多项式回归即可以看做为多元线性回归。可以直接使用LinearRegression模型对样本数据进行模型训练。

所以一元多项式回归的实现需要两个步骤：

将一元多项式回归问题转换为多元线性回归问题（只需给出高次数）
将1步骤得到的多项式结果中w₁ w₂ .. 当做样本特征，交给线性回归器训练多元线性模型。

使用sklearn提供的数据管线实现两个步骤的顺序执行：

import sklearn.pipeline as pl
import sklearn.preprocessing as sp
import sklearn.linear_model as lm
# 该管线将会组织两个步骤，首先使用多项式特征扩展器扩展出多项式，
# 而后把数据交给线性回归器训练多元线性模型。
model = pl.make_pipeline(
    sp.PolynomialFeatures(5), # 多项式特征扩展器
    lm.LinearRegression()      # 线性回归器    
)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo06_poly.py 多项式回归 single.txt
"""
import numpy as np
import matplotlib.pyplot as mp
import sklearn.linear_model as lm
import sklearn.preprocessing as sp
import sklearn.metrics as sm
import sklearn.pipeline as pl

# 读取数据
x, y = np.loadtxt(
    '../ml_data/single.txt', delimiter=',',
    usecols=(0, 1),  unpack=True)
x = x.reshape(-1, 1)  # x变为n行1列

# 创建管线
model = pl.make_pipeline(
    sp.PolynomialFeatures(6),  # 多项式特征扩展
    lm.LinearRegression()  # 线性回归器
)
model.fit(x, y)
pred_y = model.predict(x)
print(sm.r2_score(y, pred_y))

# 把这些点画出来
mp.figure('Linear Regression', facecolor='lightgray')
mp.title('Linear Regression', fontsize=14)
mp.xlabel('x', fontsize=12)
mp.ylabel('y', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=":")
mp.scatter(x, y, c='dodgerblue', alpha=0.5,
           s=60, label='Sample Points')
# 绘制多项式曲线
x = np.linspace(x.min(), x.max(), 1000)
x = x.reshape(-1, 1)
y = model.predict(x)
mp.plot(x, y, color='orangered',
        linewidth=2, label='PolyFit Line')
mp.legend()
mp.show()

过于简单的模型，无论对于训练数据还是测试数据都无法给出足够高的预测精度，这种现象叫做欠拟合。

过于复杂的模型，对于训练数据可以得到较高的预测精度，但是对于预测数据通常精度较低。这种现象叫做过拟合。

一个性能可以接收的学习模型，应该对训练数据和测试数据都有接近的预测精度，而且精度不能太低。

决策树

基本算法原理

核心思想：相似的输入必会产生相似的输出。比如预测薪资：

年龄：1-青年， 2-中年， 3 -老年

学历：1-本科， 2-硕士，3-博士

经验：1-出道，2-一般，3-老手， 4-骨灰

性别：1-男性，2-女性

年龄	学历	工作经验	性别	==>	薪资
1	1	1	1	==>	6000
2	1	3	1	==>	10000
3	3	4	1	==>	50000
…	…	…	…	…	…
1	2	2	1	==>	?

为了提高搜索效率，使用树形数据结构处理样本数据：
$$
年龄1\left{
\begin{aligned}
学历1 \
学历2 \
学历3 \
\end{aligned}
\right.
\quad\quad
年龄2\left{
\begin{aligned}
学历1 \
学历2 \
学历3 \
\end{aligned}
\right.
\quad\quad
年龄3\left{
\begin{aligned}
学历1 \
学历2 \
学历3 \
\end{aligned}
\right.
$$
首先从训练样本矩阵中选择第一个特征正进行子表划分，使每个子表中该特征的值全部相同。然后在每个子表中选择下一个特征按照同样的规则继续划分更小的子表，不断重复，直到所有的特征全部使用完为止。

此时便得到了叶级子表，其中所有样本的特征值全部相同。对于待预测样本，根据其每一个特征的值，选择对应的子表，逐一匹配，直到找到域值完全匹配的叶级子表，利用该子表中样本的输出，通过平均（回归问题）或投票（分类问题）为待预测样本提供预测输出。

随着子表的划分，信息熵（信息的混乱度）将会越来越小，信息越来越纯，数据越来越精准。

决策树回归器模型相关API：

import sklearn.tree as st
# 创建决策树回归器模型
# max_depth: 树的最大深度
model = st.DecisionTreeRegressor(max_depth=4)
# 训练模型
model.fit(train_x, train_y)
# 使用模型
pred_y=model.predict(test_x)

案例：预测波士顿地区房屋价格。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo07_dt.py 决策树预测房屋价格
"""
import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.metrics as sm

boston = sd.load_boston()
print(boston.data.shape)  # 数据量
print(boston.target.shape)
print(boston.feature_names)  # 13个特征名

# 打乱原始输入、输出数据集
x, y = su.shuffle(
    boston.data, boston.target, random_state=7)
# 划分测试集、训练集
train_size = int(len(x) * 0.8)
train_x, train_y, test_x, test_y = \
    x[:train_size], y[:train_size], \
    x[train_size:], y[train_size:]

# 创建模型， 使用训练集数据进行训练
model = st.DecisionTreeRegressor(max_depth=10)
model.fit(train_x, train_y)
# 测试集数据预测
pred_test_y = model.predict(test_x)
# 输出R2得分
print(sm.r2_score(test_y, pred_test_y))

集合算法

根据多个不同模型给出的预测结果，利用平均（回归问题）或者投票（分类问题）的方法，得出最终预测的结果。

基于决策树的集合算法，就是按照某种规则构建多颗彼此不同的决策树模型，分别给出针对未知样本的预测结果，最后通过平均或投票的方式得到相对综合的结论。

正向激励

首先为样本矩阵中的每行样本随机分配初始权重，由此构建一棵带有权重的决策树。在使用该决策树提供预测输出时，通过加权平均或加权投票的方式产生预测值。将训练样本代入模型，预测输出，对那些预测失败的样本，提高其权重，由此形成第二棵树。重复以上过程，构建出不同权重的n棵树。

正向激励相关API：

import sklearn.tree as st
import sklearn.ensemble as se
# 普通决策树模型
model = st.DecisionTreeRegressor(max_depth=4)
# 使用正向激励集合算法构建400棵树
model = se.AdaBoostRegressor(
    model,     # 基础模型
    n_estimators=400,  # 决策树的数量
    random_state=7    # 随机种子
)
model.fit(train_x, train_y)
test_y = model.predict(test_x)

案例：

机器学习DAY03

机器学习

集合算法

根据多个不同模型给出的预测结果，利用平均（回归问题）或者投票（分类问题）的方法，得出最终预测的结果。

正向激励

正向激励相关API：

import sklearn.tree as st
import sklearn.ensemble as se
# 普通决策树模型
model = st.DecisionTreeRegressor(max_depth=4)
# 使用正向激励集合算法构建400棵树
model = se.AdaBoostRegressor(
    model,     # 基础模型
    n_estimators=400,  # 决策树的数量
    random_state=7    # 随机种子
)
model.fit(train_x, train_y)
test_y = model.predict(test_x)

特征重要性

作为决策树模型训练过程的副产品，根据每个特征划分子表前后的信息熵减少量就标志了该特征的重要程度，此即为该特征的特征重要性指标。通过模型训练得到的model对象提供了属性：feature_importances_来存储每个特征的特征重要性指标值。

获取特征重要性相关API:

model.fit(train_x, train_y)
fi = model.feature_importances_

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo01_fi.py 特征重要性
"""
import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.metrics as sm
import sklearn.ensemble as se
import sklearn.linear_model as lm
import matplotlib.pyplot as mp
import numpy as np

boston = sd.load_boston()
print(boston.data.shape)  # 数据量
print(boston.target.shape)
print(boston.feature_names)  # 13个特征名

# 打乱原始输入、输出数据集
x, y = su.shuffle(
    boston.data, boston.target, random_state=7)
# 划分测试集、训练集
train_size = int(len(x) * 0.8)
train_x, train_y, test_x, test_y = \
    x[:train_size], y[:train_size], \
    x[train_size:], y[train_size:]

# 创建模型， 使用训练集数据进行训练
model = st.DecisionTreeRegressor(max_depth=4)
model.fit(train_x, train_y)
# 测试集数据预测
pred_test_y = model.predict(test_x)
# 输出R2得分
print(sm.r2_score(test_y, pred_test_y))
# 单颗决策树模型，输出特征重要性
fi = model.feature_importances_
# 把特征重要性绘制成柱状图
mp.figure('Feature Importance')
mp.subplot(211)
mp.title('Decision Tree', fontsize=14)
mp.ylabel('Importance', fontsize=12)
mp.grid(linestyle=":")
indices = fi.argsort()[::-1]
pos = np.arange(indices.size)
mp.bar(pos, fi[indices], label='DT',
       facecolor='deepskyblue',
       edgecolor='steelblue')
mp.xticks(pos, boston.feature_names[indices])
mp.legend()

# 正向激励
model = se.AdaBoostRegressor(
    model,     # 基础模型
    n_estimators=400,  # 决策树的数量
    random_state=7  # 随机种子
)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
# 输出R2得分
print(sm.r2_score(test_y, pred_test_y))
# 基于决策树的正向激励，输出特征重要性
fi2 = model.feature_importances_
# 把特征重要性绘制成柱状图
mp.subplot(212)
mp.title('AdaBoostRegressor', fontsize=14)
mp.ylabel('Importance', fontsize=12)
mp.grid(linestyle=":")
indices = fi2.argsort()[::-1]
pos = np.arange(indices.size)
mp.bar(pos, fi2[indices], label='AB',
       facecolor='deepskyblue',
       edgecolor='steelblue')
mp.xticks(pos, boston.feature_names[indices])
mp.legend()

mp.show()

自助聚合

每次从总样本矩阵中以有放回抽样的方式随机抽取部分样本构建决策树，这样形成多颗包含不同训练样本的决策树。以削弱某些强势样本对模型预测结果的影响，提高模型的普适性。

随机森林

在自主聚合的基础上，每次构建决策树模型时，不仅随机选择部分样本，而且还随机选择部分特征，这样的集合算法，不仅规避了强势样本对预测结果的影响，而且也削弱了强制特征的影响，使模型的预测能力更加泛化。

随机森林相关API：

import sklearn.ensemble as se
# 随机森林回归器
model = se.RandomForestRegressor(
    max_depth=10,         #  数的最大深度
    n_estimators=1000,    #  决策树的数量
    min_samples_split=2    #  子表中最小样本数，小于该值，不再拆分
)

案例：分析共享单车的需求，从而判断如何进行共享单车的投放。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo02_bike.py 共享单车分析
"""
import numpy as np
import sklearn.utils as su
import sklearn.ensemble as se
import sklearn.metrics as sm
import matplotlib.pyplot as mp

data = np.loadtxt('../ml_data/bike_day.csv',
                  unpack=False, dtype='U20',
                  delimiter=',')
# 获取表头字符序列
day_headers = data[0, 2:13]
# 整理输入输出集
x = np.array(data[1:, 2:13], dtype=float)
y = np.array(data[1:, -1], dtype=float)

x, y = su.shuffle(x, y, random_state=7)
train_size = int(len(x) * 0.9)
train_x, train_y, test_x, test_y = \
    x[:train_size], y[:train_size],  \
    x[train_size:], y[train_size:]
# 随机森林回归器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
fi = model.feature_importances_
indices = fi.argsort()[::-1]
print(day_headers[indices])

# 以小时为单位
data = np.loadtxt('../ml_data/bike_hour.csv',
                  unpack=False, dtype='U20',
                  delimiter=',')
# 获取表头字符序列
hour_headers = data[0, 2:14]
# 整理输入输出集
x = np.array(data[1:, 2:14], dtype=float)
y = np.array(data[1:, -1], dtype=float)

x, y = su.shuffle(x, y, random_state=7)
train_size = int(len(x) * 0.9)
train_x, train_y, test_x, test_y = \
    x[:train_size], y[:train_size],  \
    x[train_size:], y[train_size:]
# 随机森林回归器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
fi = model.feature_importances_
indices = fi.argsort()[::-1]
print(hour_headers[indices])

人工分类(简单分类)

特征1	特征2	输出
3	1	0
2	5	1
1	8	1
6	4	0
5	2	0
3	5	1
4	7	1
4	-1	0
…	…	…
6	8	?

案例：

import numpy as np
import matplotlib.pyplot as mp

x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
# 绘制类别边界线
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
n = 500
grid_x, grid_y = np.meshgrid(
    np.linspace(l, r, n),
    np.linspace(b, t, n))
grid_z = np.piecewise(
    grid_x,
    [grid_x < grid_y, grid_x >= grid_y],
    [1, 0])

# 使用matplotlib绘制这些点，以及点的类别
mp.figure('Simple Classification', facecolor='lightgray')
mp.title('Simple Classification', fontsize=16)
mp.xlabel('Feture1', fontsize=12)
mp.ylabel('Feture2', fontsize=12)
mp.tick_params(labelsize=10)
# 为网格点坐标矩阵填充颜色
mp.pcolormesh(grid_x, grid_y, grid_z,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg',
           s=80)
mp.show()

逻辑分类

通过输入的样本数据，基于多元线性回归模型求出线性预测方程。

y = w₀ + w₁x₁ + w₂x₂

x1    x2     y
7     13     0.823423423
12    4      0.224234234  
......

但通过线性回归方程返回的是连续值，不可以直接用于分类业务模型，所以急需一种方式使得把连续的预测值->离散的预测值。[-∞, +∞] -> (0, 1)
$$
逻辑函数：y=\frac{1}{1+e^{-x}}
$$
该逻辑函数当x>0，y>0.5; 当x<0, y<0.5；可以把样本数据经过线性预测模型求得的值代入逻辑函数的x，即将预测函数的输出看做被划分成为1类别的概率，则概率大的类别作为预测结果，就可以根据函数值确定两个分类。这也是线性函数非线性化的一种方式。

逻辑分类相关API：

import sklearn.linear_model as lm
# 创建逻辑分类模型
# solver：liblinear指的是逻辑分类底层使用线性回归模型
m = lm.LogisticRegression(solver='liblinear', C=正则强度)
m.fit(训练输入，训练输出)
m.predict(测试输入)

案例：

import numpy as np
import matplotlib.pyplot as mp
import sklearn.linear_model as lm

x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
# 绘制类别边界线
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
n = 500
grid_x, grid_y = np.meshgrid(
    np.linspace(l, r, n),
    np.linspace(b, t, n))
# 构建逻辑分类器
model = lm.LogisticRegression(
    solver='liblinear', C=1)
model.fit(x, y)
samples = np.column_stack((
    grid_x.ravel(), grid_y.ravel()))
grid_z = model.predict(samples)
grid_z = grid_z.reshape(grid_x.shape)

# 使用matplotlib绘制这些点，以及点的类别
mp.figure('Simple Classification', facecolor='lightgray')
mp.title('Simple Classification', fontsize=16)
mp.xlabel('Feture1', fontsize=12)
mp.ylabel('Feture2', fontsize=12)
mp.tick_params(labelsize=10)
# 为网格点坐标矩阵填充颜色
mp.pcolormesh(grid_x, grid_y, grid_z,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg',
           s=80)
mp.show()

多元分类

通过多个二元分类器解决多元分类的问题。

若拿到一组新的样本，可以基于二元逻辑分类器，训练出一个模型，判断属于A类别的概率。在使用同样的方法训练出两个模型，分别判断属于B、C类别的概率，最终选择概率最高的类别作为这组样本的分类结果。

案例：基于逻辑分类模型的多元分类。

import numpy as np
import matplotlib.pyplot as mp
import sklearn.linear_model as lm

x = np.array([
    [4, 7],
    [3.5, 8],
    [3.1, 6.2],
    [0.5, 1],
    [1, 2],
    [1.2, 1.9],
    [6, 2],
    [5.7, 1.5],
    [5.4, 2.2]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
# 绘制类别边界线
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
n = 500
grid_x, grid_y = np.meshgrid(
    np.linspace(l, r, n),
    np.linspace(b, t, n))
# 构建逻辑分类器
model = lm.LogisticRegression(
    solver='liblinear', C=1000)
model.fit(x, y)
samples = np.column_stack((
    grid_x.ravel(), grid_y.ravel()))
grid_z = model.predict(samples)
grid_z = grid_z.reshape(grid_x.shape)

# 使用matplotlib绘制这些点，以及点的类别
mp.figure('Logistic Classification', facecolor='lightgray')
mp.title('Logistic Classification', fontsize=16)
mp.xlabel('Feture1', fontsize=12)
mp.ylabel('Feture2', fontsize=12)
mp.tick_params(labelsize=10)
# 为网格点坐标矩阵填充颜色
mp.pcolormesh(grid_x, grid_y, grid_z,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg',
           s=80)
mp.show()

朴素贝叶斯分类

朴素贝叶斯分类是一种依据统计概率理论而实现的一种分类方式。

天气情况	穿衣风格	约女朋友	==>	心情
0（晴天）	0（休闲）	0（约了）	==>	0（高兴）
0	1（风骚）	1（没约）	==>	0
1（多云）	1	0	==>	0
0	2（破旧）	1	==>	1（郁闷）
2（下雨）	2	0	==>	0
…	…	…	==>	…
0	1	0	==>	？

贝叶斯定理：

P(A|B) = P(A) P(B|A) / P(B) <= P(A, B) = P(A)P(B|A) = P(B) P(A|B)

例如：假设一个学校里有60%男生和40%女生，女生穿裤子的人数和穿裙子的人数相等，所有男生都穿裤子，一个人在远处随机看到了一个穿裤子的学生，那么这个学生是女生的概率是多少？

P(女) = 0.4
P(裤子|女) = 0.5
P(裤子) = 0.6+0.2 = 0.8
P(女生|裤子)= 0.4*0.5/0.8 = 0.25

由此可得，统计总样本空间中晴天、休闲、没约并且高兴的概率，然后在求总样本空间中晴天、休闲、没约并且郁闷的概率，取其大者作为分类结果。

P(晴天，休闲，没约，高兴)
=P(晴天|休闲，没约，高兴) P(休闲，没约，高兴)
=P(晴天|休闲，没约，高兴) P(休闲|没约，高兴) P(没约，高兴)
=P(晴天|休闲，没约，高兴) P(休闲|没约，高兴) P(没约|高兴)P(高兴)
（朴素:条件独立，特征值中间没有因果关系）
=P(晴天|高兴) P(休闲|高兴) P(没约|高兴)P(高兴)

朴素贝叶斯分类器相关API：

import sklearn.naive_bayes as nb
# 创建符合高斯分布的朴素贝叶斯分类器
model = nb.GaussianNB()
model.fit(x, y)
result = model.predict(samples)

案例：multiple1.txt

import numpy as np
import sklearn.naive_bayes as nb
import matplotlib.pyplot as mp

data = np.loadtxt('../ml_data/multiple1.txt',
                  unpack=False, dtype='U20',
                  delimiter=',')

x = np.array(data[:, :-1], dtype=float)
y = np.array(data[:, -1], dtype=float)

# 构建高斯朴素贝叶斯分类器
model = nb.GaussianNB()
model.fit(x, y)
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
n = 500
grid_x, grid_y = np.meshgrid(
    np.linspace(l, r, n),
    np.linspace(b, t, n))
samples = np.column_stack(
    (grid_x.ravel(), grid_y.ravel()))
grid_z = model.predict(samples)
grid_z = grid_z.reshape(grid_x.shape)

mp.figure('Naive Bayes Classification')
mp.title('Naive Bayes Classification', fontsize=16)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x, grid_y, grid_z,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], s=80,
           c=y, cmap='brg')
mp.show()

数据集的划分

对于分类问题，训练集和测试集的划分不应该用整个样本空间的特定百分比作为训练数据，而应该在其每一个类别的样本中抽取特定百分比作为训练数据。

sklearn提供了数据集划分相关方法，可以方便的划分训练集与测试数据，相关API如下：

import sklearn.model_selection as ms
# train_test_split自动划分测试集与训练街
训练输入，测试输入，训练输出，测试输出 = ms.train_test_split(
    输入集，输出集，
    test_size=0.2,     # 测试集占比
    random_state=7    
)

案例：

# 划分训练集与测试集
train_x, test_x, train_y, test_y = \
    ms.train_test_split(
        x, y, test_size=0.25, random_state=7)

# 让模型预测测试集结果
pred_test_y = model.predict(test_x)
# 输出精准度：预测正确的个数/总预测个数
print((test_y == pred_test_y).sum() / pred_test_y.size)

# 绘制测试集图形
mp.pcolormesh(grid_x, grid_y, grid_z, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], s=80,
           c=test_y, cmap='brg')

交叉验证

由于数据集的划分有不确定性，若随机划分的样本正好处于某类特殊样本，那么得到的训练模型所预测的结果可信度将会受到质疑。所以需要进行多次交叉验证。把样本空间中所有的样本均分成n分，使用不同的训练集训练模型，对不同的测试集进行测试，输出指标得分。

sklearn提供了交叉验证相关API:

import sklearn.model_selection as ms
# 交叉验证
指标值数组 = ms.cross_val_score(
    模型，            # 原始模型
    输入集，输出集， 
    cv=5,           # 折叠数 （做5次交叉验证）
    scoring=指标名 
)

交叉验证指标

精确度（accuracy）：分类正确的样本数/总样本数
查准率（precision_weighted）：针对每一个类别，预测正确的样本数/预测出来的样本数。
召回率（recall_weighted）：针对每一个类别，预测正确的样本数/实际存在的样本数。
f1得分（f1_weighted）: 2 x 查准率 x 召回率/(查准率+召回率)

案例：

# 构建高斯朴素贝叶斯分类器
model = nb.GaussianNB()
# 先做交叉验证，看一下f1得分
ac = ms.cross_val_score(
    model, train_x, train_y, cv=5,
    scoring='accuracy')
pw = ms.cross_val_score(
    model, train_x, train_y, cv=5,
    scoring='precision_weighted')
rw = ms.cross_val_score(
    model, train_x, train_y, cv=5,
    scoring='recall_weighted')
f1 = ms.cross_val_score(
    model, train_x, train_y, cv=5,
    scoring='f1_weighted')
print(ac.mean(), pw.mean(), rw.mean(), f1.mean())

混淆矩阵

sklearn提供了混淆矩阵清晰的显示出每一类别的查准率、召回率等验证结果。每一行和每一列分别对应样本输出中的每一个类别，行表示实际类别，列表示预测类别。

	A类别	B类别	C类别
A类别	5	0	0
B类别	0	6	0
C类别	0	0	7

上述矩阵为理想的混淆矩阵。不理想的如下：

	A类别	B类别	C类别
A类别	3	1	1
B类别	0	4	2
C类别	0	0	7

获取模型分类结果的混淆矩阵相关API：

import sklearn.metrics as sm
m = sm.confusion_matrix(实际输出, 预测输出)

案例：

# 输出预测结果的混淆矩阵
m = sm.confusion_matrix(test_y, pred_test_y)
print(m)

mp.figure('M')
mp.imshow(m, cmap='gray')

机器学习DAY04

朴素贝叶斯分类

分类报告

sklearn.metrics提供了分类报告相关API，不仅可以得到混淆矩阵，还可以得到交叉验证的查准率、召回率、f1得分的结果。这样可以方便的分析出哪些样本是异常样本。

import sklearn.metrics as sm
# 获取分类报告
cr = sm.classification_report(实际输出, 预测输出)
print(cr)

决策树分类

决策树分类模型会找到与样本特征匹配的叶子节点，然后以投票的方式进行分类。

案例：基于决策树分类算法，训练模型，预测小汽车（car.txt）等级。需要注意的是每一个特征都需要使用标签编码做预处理。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
汽车评估 案例
"""
import numpy as np
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms

# 读取文本数据，对特征进行标签编码，
# 基于随机森林进行模型训练，完成交叉验证。
data = np.loadtxt('../ml_data/car.txt',
                  delimiter=',', dtype='U10')
data = data.T
encoders = []
train_x, train_y = [], []
for row in range(len(data)):
    encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        train_x.append(
            encoder.fit_transform(data[row]))
    else:
        train_y = encoder.fit_transform(data[row])
    encoders.append(encoder)
train_x = np.array(train_x).T
# 基于随机森林分类器完成模型训练
model = se.RandomForestClassifier(
    max_depth=6, n_estimators=200,
    random_state=7)
print(ms.cross_val_score(
    model, train_x, train_y, cv=4,
    scoring='f1_weighted').mean())
model.fit(train_x, train_y)
# 造一些测试数据，进行测试。
data = [
    ['high', 'med', '5more', '4',
     'big', 'low', 'unacc'],
    ['high', 'high', '4', '4',
     'med', 'med', 'acc'],
    ['low', 'low', '2', '4',
     'small', 'high', 'good'],
    ['low', 'med', '3', '4',
     'med', 'high', 'vgood']]
# 对测试数据进行相同的标签编码，
# 才可以使用训练好的模型
data = np.array(data).T
test_x, test_y = [], []
for row in range(len(data)):
    encoder = encoders[row]
    if row < len(data) - 1:
        test_x.append(
            encoder.transform(data[row]))
    else:
        test_y = encoder.transform(
            data[row])
test_x = np.array(test_x).T
# 使用模型开始预测
pred_test_y = model.predict(test_x)
e = encoders[-1]  # 针对最后一列的标签编码器
print(e.inverse_transform(test_y))
print(e.inverse_transform(pred_test_y))

验证曲线

验证曲线 : 模型性能 = f(超参数)

验证曲线是超参数优选的一种解决方案，通过验证曲线相关API，可以得到相同模型在不同超参数下的模型性能得分，从而得知最优超参数。

import sklearn.model_selection as ms
train_scores, test_scores = ms.validation_curve(
    model,             # 初始模型
    输入集, 输出集,
    'n_estimators',  # 超参数名
    np.array([50, 60, 80, 100]), # 超参数取值列表
    cv=5    # 折叠数
)

train_scores的结构：

超参数	CV1	CV2	CV3	CV4	CV5
50	0.973	0.894	0.944	0.924	0.914
100	0.914	0.924	0.934	0.954	0.934
…	…	…	…	…	…

test_scores的结构与上述结构一致。

案例：小汽车案例中使用验证曲线选择超参数。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
汽车评估 案例  验证曲线选择超参数
"""
import numpy as np
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms
import matplotlib.pyplot as mp


# 读取文本数据，对特征进行标签编码，
# 基于随机森林进行模型训练，完成交叉验证。
data = np.loadtxt('../ml_data/car.txt',
                  delimiter=',', dtype='U10')
data = data.T
encoders = []
train_x, train_y = [], []
for row in range(len(data)):
    encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        train_x.append(
            encoder.fit_transform(data[row]))
    else:
        train_y = encoder.fit_transform(data[row])
    encoders.append(encoder)
train_x = np.array(train_x).T
'''
# 基于随机森林分类器完成模型训练
model = se.RandomForestClassifier(
    max_depth=6, random_state=7)
# 获取关于n_estimators的验证曲线
n_estimators = np.arange(50, 550, 50)
train_scores, test_scores = \
    ms.validation_curve(
        model, train_x, train_y,
        'n_estimators', n_estimators, cv=5)
# 每个超参数取值对应的得分
train_means = train_scores.mean(axis=1)
for p, s in zip(n_estimators, train_means):
    print(p, '->', s)

mp.figure('N_estimators', facecolor='lightgray')
mp.title('N_estimators', fontsize=16)
mp.xlabel('Estimators', fontsize=14)
mp.ylabel('Scores', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=":")
mp.plot(n_estimators, train_means,
        c='dodgerblue', label='Valid Curve')
mp.legend()
mp.show()


# 基于随机森林分类器完成模型训练
model = se.RandomForestClassifier(
    n_estimators=200, random_state=7)
# 获取关于max_depth的验证曲线
max_depths = np.arange(1, 11)
train_scores, test_scores = \
    ms.validation_curve(
        model, train_x, train_y,
        'max_depth', max_depths, cv=5)
# 每个超参数取值对应的得分
train_means = train_scores.mean(axis=1)
for p, s in zip(max_depths, train_means):
    print(p, '->', s)

mp.figure('Max_depth', facecolor='lightgray')
mp.title('Max_depth', fontsize=16)
mp.xlabel('Max_depths', fontsize=14)
mp.ylabel('Scores', fontsize=14)
mp.tick_params(labelsize=12)
mp.grid(linestyle=":")
mp.plot(max_depths, train_means,
        c='dodgerblue', label='Valid Curve')
mp.legend()
mp.show()
'''

# 基于随机森林分类器完成模型训练
model = se.RandomForestClassifier(
    max_depth=8,
    n_estimators=200, random_state=7)
print(ms.cross_val_score(
    model, train_x, train_y, cv=4,
    scoring='f1_weighted').mean())
model.fit(train_x, train_y)
# 造一些测试数据，进行测试。
data = [
    ['high', 'med', '5more', '4',
     'big', 'low', 'unacc'],
    ['high', 'high', '4', '4',
     'med', 'med', 'acc'],
    ['low', 'low', '2', '4',
     'small', 'high', 'good'],
    ['low', 'med', '3', '4',
     'med', 'high', 'vgood']]
# 对测试数据进行相同的标签编码，
# 才可以使用训练好的模型
data = np.array(data).T
test_x, test_y = [], []
for row in range(len(data)):
    encoder = encoders[row]
    if row < len(data) - 1:
        test_x.append(
            encoder.transform(data[row]))
    else:
        test_y = encoder.transform(
            data[row])
test_x = np.array(test_x).T
# 使用模型开始预测
pred_test_y = model.predict(test_x)
e = encoders[-1]  # 针对最后一列的标签编码器
print(e.inverse_transform(test_y))
print(e.inverse_transform(pred_test_y))

学习曲线

学习曲线：模型性能=f(训练集大小)

学习曲线相关API:

import sklearn.model_selection as ms
# 获取学习曲线
_, train_scores, test_scores = ms.learning_curve(
    model, # 初始模型
    输入集, 输出集,
    [0.9, 0.8, 0.7], # 训练集的大小
    cv=5    # 折叠数
)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
汽车评估 案例  学习曲线选择最优训练集大小
"""
import numpy as np
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms
import matplotlib.pyplot as mp


# 读取文本数据，对特征进行标签编码，
# 基于随机森林进行模型训练，完成交叉验证。
data = np.loadtxt('../ml_data/car.txt',
                  delimiter=',', dtype='U10')
data = data.T
encoders = []
train_x, train_y = [], []
for row in range(len(data)):
    encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        train_x.append(
            encoder.fit_transform(data[row]))
    else:
        train_y = encoder.fit_transform(data[row])
    encoders.append(encoder)
train_x = np.array(train_x).T

# 基于随机森林分类器完成模型训练
model = se.RandomForestClassifier(
    max_depth=8,
    n_estimators=200, random_state=7)
# 使用学习曲线获取最优训练集大小
train_sizes = np.linspace(0.1, 1, 10)
_, train_scores, test_scores = \
    ms.learning_curve(model, train_x, train_y,
        train_sizes=train_sizes, cv=5)
train_means = train_scores.mean(axis=1)
for size, score in zip(train_sizes, train_means):
    print(size, '->', score)

mp.figure('Learning Curve')
mp.title('Learning Curve', fontsize=16)
mp.xlabel('Train Size', fontsize=12)
mp.ylabel('Curve Score', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(train_sizes, train_means, 'o-',
        c='dodgerblue', label='Curve')
mp.legend()
mp.show()

支持向量机(SVM)

支持向量机原理

核心理念：寻求最优分类边界

正确：对大部分样本可以正确的划分类别。

泛化：最大化支持向量间距。

公平：类别与支持向量等距。

简单：线性，直线、平面做类别分割。

基于核函数的升维变换

对于难以分类的一些样本，执行基于核函数的升维变换。增加新的特征，使得低维度空间中的线性不可分问题，变成高维度空间中的线性可分问题。

SVM提供的常用核函数

线性核函数 linear
多项式核函数 poly
径向基核函数 rbf

线性核函数

不通过核函数进行维度提升，仅在原始维度空间中寻求线性分类边界。

基于线性核函数的SVM分类器：

import sklearn.svm as svm
model = svm.SVC(kernel='linear')
mpdel.fit(train_x, train_y)

案例：使用线性核函数的SVM训练simple2.txt。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo05_svm_linear.py 线性核函数的SVM
"""
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp

x, y = [], []
data = np.loadtxt('../ml_data/multiple2.txt',
                  delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]

train_x, test_x, train_y, test_y = \
    ms.train_test_split(
        x, y, test_size=0.25, random_state=5)

# 基于线性核函数的支持向量机模型
model = svm.SVC(kernel='linear')
model.fit(train_x, train_y)

# 预测分类边界
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x, grid_y = np.meshgrid(
    np.linspace(l, r, n),
    np.linspace(l, r, n))
samples = np.column_stack(
    (grid_x.ravel(), grid_y.ravel()))
grid_z = model.predict(samples)
grid_z = grid_z.reshape(grid_x.shape)
# 使用模型预测测试集数据，输出分类报告
pred_test_y = model.predict(test_x)
cr = sm.classification_report(
    test_y, pred_test_y)
print(cr)

# 绘制分类边界
mp.figure('SVM Linear Classification', facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=16)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=12)
mp.pcolormesh(grid_x, grid_y, grid_z,
              cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1],
           c=test_y, cmap='brg', s=80)
mp.show()

多项式核函数

通过多项式函数增加原始样本特征的高次方幂作为新的特征，在新的思考维度中对样本进行分类。

# 基于多项式函数扩展新的特征
model = svm.SVC(kernel='poly', degree=3)
model.fit(train_x, train_y)

径向基核函数

通过高斯分布函数增加原始样本特征的分布概率。

# 基于径向基核函数的支持向量机分类器
# C: 正则强度
# gamma: 正态分布曲线的标准差
model = svm.SVC(kernel='rbf', C=600, gamma=0.01)
model.fit(train_x, train_y)

样本类别均衡化

通过类别均衡化，使所占比例较小的样本权重较高，而所占比例较大的样本权重较低。以此平均化不同类别样本对分类模型的贡献，提高模型精度。

# 添加 class_weight='balanced' 实现样本类别均衡化
model = svm.SVC(kernel='linear', class_weight='balanced')
model.fit(train_x, train_y)

案例：基于线性核函数对imbalance.txt进行训练

# 基于线性核函数的支持向量机模型
model = svm.SVC(kernel='rbf', C=1000, gamma=0.01,
                class_weight='balanced')
model.fit(train_x, train_y)

置信概率

根据样本与分类边界的距离远近，对其预测类别的可信程度进行量化，离边界越近的样本，置信概率越低，反之，离边界越远的样本，置信概率越高。

获取每个样本的置信概率API：

# 在获取支持向量机模型时，给出超参数 probability=True
m = svm.SVC(kernel='', C=1, gamma=0.01, probability=True)
# 获取每个样本的置信概率
置信概率矩阵 = model.predict_proba(test_x)

置信概率矩阵格式：

	类别1	类别2
样本1	0.8	0.2
样本2	0.6	0.4

案例：修改径向基核函数的SVM案例，新增测试样本，输出置信概率。

# 新增测试点   输出置信概率， 绘制图像
prob_x = np.array([
    [2, 1.5],
    [8, 9],
    [4.8, 5.2],
    [4, 4],
    [2.5, 7],
    [7.6, 2],
    [5.4, 5.9]])
pred_prob_y = model.predict(prob_x)
probs = model.predict_proba(prob_x)
print(probs)

mp.scatter(prob_x[:, 0], prob_x[:, 1],
           c=pred_prob_y, cmap='jet_r',
           s=80, marker='D')
# 为每个点都加上备注: 1类别概率, 2类别概率
for i in range(len(probs)):
    mp.annotate(
        '{}% {}%'.format(
            round(probs[i, 0] * 100, 2),
            round(probs[i, 1] * 100, 2)),
        xy=(prob_x[i, 0], prob_x[i, 1]),
        xytext=(12, -12),
        textcoords='offset points',
        fontsize=9,
        bbox={'boxstyle': 'round,pad=0.6',
              'fc': 'orange', 'alpha': 0.8})

网格搜索

获取一个最优超参数的方式可以绘制验证曲线，但是验证曲线只能每次获取一个最优超参数。如果多个超参数有很多排列组合的话，就可以使用网格搜索寻求最优超参数组合。

在网格搜索过程中，针对每一个超参数组合，实例化给定的模型，做cv次交叉验证，将其中平均f1得分最高的超参数组合作为最佳选择，实例化模型对象。

网格搜索相关API：

import sklearn.model_selection as ms
# 返回的是已经使用了最优超参数组合的model对象
model = ms.GridSearchCV(
    model,             # 原始模型
    超参数组合列表,    # 使用列表的方式把所有组合列出
    cv=5             # 交叉验证折叠数
)
# 获取网格搜索过程中的每个参数组合
model.cv_results_['params']
# 获取王国搜索过程每个参数组合对应的平均测试分
model.cv_results_['mean_test_score']
# 获取最好的参数
model.best_params_        # 最优参数
model.best_scores_        # 最优得分
model.best_estimator_    # 最优模型

案例：修改置信概率案例，基于网格搜索寻的最优超参数。

# 基于径向基核函数的支持向量机模型
model = svm.SVC()
# 整理超参数列表，做网格搜索，寻找最优
params = [
    {'kernel': ['linear'],
     'C':[1, 10, 100, 1000]},
    {'kernel': ['poly'], 'C':[1],
     'degree':[2, 3]},
    {'kernel': ['rbf'], 'C':[1, 10, 100, 1000],
     'gamma':[1, 0.1, 0.01, 0.001]}]
model = ms.GridSearchCV(model, params, cv=5)
model.fit(train_x, train_y)

print(model.best_params_)
print(model.best_score_)
print(model.best_estimator_)

# 查看网格搜索过程的细节
for p, s in zip(
    model.cv_results_['params'],
        model.cv_results_['mean_test_score']):
    print(p, '->', s)

案例：事件预测

加载event.txt, 预测某个时间段是否会出现特殊事件。

机器学习DAY05

案例：事件预测

加载event.txt, 预测某个时间段是否会出现特殊事件。

案例：交通流量预测

加载traffic.txt，预测某个时间段某个路口的车流量。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo01_traffic.py 车流量预测
"""

import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm


class DigitEncoder():
    """自定义数字编码器"""

    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)

data = []
data = np.loadtxt('../ml_data/traffic.txt',
                  delimiter=',', dtype='U20')
data = data.T
encoders, x = [], []
for row in range(len(data)):
    if data[row][0].isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)

x = np.array(x).T
# 划分测试集训练集
train_x, test_x, train_y, test_y = \
    ms.train_test_split(
        x, y, test_size=0.25, random_state=7)
# 使用SVM回归器训练模型
model = svm.SVR(kernel='rbf', C=10)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))

# 执行业务预测
data = [['Tuesday', '13:35', 'San Francisco', 'yes']]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    print(encoder)
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(pred_y)

聚类

分类（classification）与聚类（cluster）不同，分类是有监督学习模型，聚类属于无监督学习模型。聚类讲究实用一些算法把样本划分为n个群落。一般情况下，这种算法都需要计算欧式距离。

欧几里得距离
$$
P(x_1) - Q(x_2): |x_1-x_2| = \sqrt{(x_1 - x_2)^2} \
P(x_1, y_1) - Q(x_2, y_2): \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} \
P(x_1, y_1, z_1) - Q(x_2, y_2, z_2): \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + (z_1 - z_2)^2} \
$$
两个样本对应特征值之差的平方和的平方根，即欧式距离，用来表示两个样本的相似性。

K均值算法

第一步：随机选择K个样本作为K个聚类中心，计算每个样本到各个聚类中的欧氏距离，将该样本分配到与之最近的聚类中心所在类别中。

第二步：根据第一步得到的聚类划分，分别计算每个聚类的几何中心，将几何中心作为新的聚类中心，重复第一步，直到计算所得几何中心与聚类中心重合或接近重合为止。

注意：

聚类数必须事先已知，借助某些评估指标，优选最好的聚类数量。
聚类中心的初始选择会影响到最终聚类划分的结果，初始中心尽量选择距离较远的样本。

K均值相关API：

import sklearn.cluster as sc
# n_clusters：聚类的类别数量。
model = sc.KMeans(n_clusters=4)
# 给出训练样本
model.fit(x)
# 预测样本的类别
c = model.predict(x)
# 获取训练结果的聚类中心
centers = model.cluster_centers_

案例：加载multiple3.txt, 完成KMeans聚类。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo02_kmeans.py  聚类案例
"""
import numpy as np
import sklearn.cluster as sc
import matplotlib.pyplot as mp

x = np.loadtxt('../ml_data/multiple3.txt',
               delimiter=',')
# KMeans聚类模型
model = sc.KMeans(n_clusters=4)
model.fit(x)
pred_c = model.predict(x)
centers = model.cluster_centers_
print(pred_c)

# 预测分类边界
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x, grid_y = np.meshgrid(
    np.linspace(l, r, n),
    np.linspace(b, t, n))
samples = np.column_stack(
    (grid_x.ravel(), grid_y.ravel()))
grid_z = model.predict(samples)
grid_z = grid_z.reshape(grid_x.shape)

# 画图
mp.figure('K-Means Cluster', facecolor='lightgray')
mp.title('K-Means Cluster', fontsize=14)
mp.xlabel('x', fontsize=12)
mp.ylabel('y', fontsize=12)
mp.tick_params(labelsize=10)
# 绘制分类边界
mp.pcolormesh(grid_x, grid_y, grid_z, cmap='gray')
# 绘制所有样本
mp.scatter(x[:, 0], x[:, 1], c=pred_c, s=60,
           cmap='brg', label='samples')
mp.scatter(centers[:, 0], centers[:, 1],
           c='orangered', marker='+', s=300,
           label='Centers')
mp.legend()
mp.show()

图像量化

KMeans聚类算法可以应用于图像量化领域。通过KMeans算法把一张图像所包含的颜色值进行聚类划分，求每一类别的平均值后，再重新生成新的图像。可以达到图像降维的目的，更有利于图像识别。这个过程称为图像量化。

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo03_image.py 图像量化
"""
import numpy as np
import scipy.misc as sm
import scipy.ndimage as sn
import sklearn.cluster as sc
import matplotlib.pyplot as mp

img = sm.imread('../ml_data/lily.jpg', True)
x = img.reshape(-1, 1)  # n行1列
model = sc.KMeans(n_clusters=4)
model.fit(x)
y = model.labels_   # 返回聚类标签
centers = model.cluster_centers_.ravel()
# 把y数组中相应元素的值改为centers中的值
# 生成新图片，处理后，图片中只包含centers范围内的值
print(centers.shape,  centers)
print(y.shape, y)
img2 = centers[y].reshape(img.shape)

mp.figure('Image')
mp.subplot(121)
mp.axis('off')
mp.imshow(img, cmap='gray')
mp.subplot(122)
mp.axis('off')
mp.imshow(img2, cmap='gray')
mp.tight_layout()
mp.show()

均值漂移算法

首先假定样本空间中的每个聚类均服从某种已知的概率分布规则，然后用不同的概率密度函数拟合样本中的统计直方图，不断移动密度函数的中心位置，直到获得最佳的拟合效果为止。这些概率密度函数曲线的峰值点就是聚类的中心，再根据每个样本与各个中心的距离，选择最近的聚类中心所属类别作为该样本的类别。

均值漂移算法的特点：

聚类数不必事先已知，算法会自动识别出统计直方图的中心数量。
聚类中心不依据于最初假定，聚类划分的结果相对稳定。
样本空间应该服从某种概率分布规则，否则算法的准确性将会大大折扣。

均值漂移算法相关API：

# 获取量化带宽对象
bw = sc.estimate_bandwidth(
    x,     # 输入集
    n_samples=len(x),    # 样本数量
    quantile=0.1    # 量化带宽（统计直方图的宽度）
)
model = sc.MeanShift(bandwidth=bw, bin_seeding=True)
model.fit(x)
model.predict(x)

案例： multiple3.txt，使用均值漂移做聚类划分。

x = np.loadtxt('../ml_data/multiple3.txt',
               delimiter=',')
# KMeans聚类模型
model = sc.KMeans(n_clusters=4)
model.fit(x)
pred_c = model.predict(x)
centers = model.cluster_centers_
print(pred_c)
....

凝聚层次算法

首先假定每个样本都是一个独立的聚类，如果统计出来的聚类数大于期望的聚类数，则从每个样本出发，寻找离自己最近的另一个样本，与之聚集，形成更大的聚类。同时令总聚类数减少，不断重复以上过程，直到统计出来的聚类数达到期望值为止。

凝聚层次算法的特点：

聚类数k必须事先已知。借助某些评估指标，优选最好的聚类数。
没有聚类中心概念，因此只能在训练集中划分聚类，不能对训练集以外的未知样本确定其归属。
在确定被凝聚的样本时，除了以距离作为条件外，还可以根据连续性来确定被聚集的样本。

凝聚层次算法相关API：

# 构建凝聚层次聚类模型
model = sc.AgglomerativeClustering(n_clusters=4)
# 训练+预测
pred_y = model.fit_predict(x)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo05_agglo.py  凝聚层次聚类
"""
import numpy as np
import sklearn.cluster as sc
import matplotlib.pyplot as mp

x = np.loadtxt('../ml_data/multiple3.txt',
               delimiter=',')
# 凝聚层次聚类模型
model = sc.AgglomerativeClustering(n_clusters=4)
pred_c = model.fit_predict(x)
print(pred_c)

# 画图
mp.figure('Agglo Cluster', facecolor='lightgray')
mp.title('Agglo Cluster', fontsize=14)
mp.xlabel('x', fontsize=12)
mp.ylabel('y', fontsize=12)
mp.tick_params(labelsize=10)
# 绘制所有样本
mp.scatter(x[:, 0], x[:, 1], c=pred_c, s=60,
           cmap='brg', label='samples')
mp.legend()
mp.show()

使用凝聚层次算法，根据连续性来确定被聚集的样本类别划分相关API：

# linkage='average'： 无连续性的凝聚层次聚类器
model = sc.AgglomerativeClustering(
            linkage='average', n_clusters=4)

# 基于连续性的凝聚层次聚类器
# 近邻筛选器
conn = nb.kneighbors_graph(x, 10, include_self=False)
model = sc.AgglomerativeClustering(
            linkage='average', n_clusters=4,
            conntivity=conn)

轮廓系数

轮廓系数用于评估聚类器的好坏。好的聚类：内密外疏，同一个聚类内部的样本要足够密集，不同的聚类之间样本要足够疏远。

轮廓系数的计算规则：针对样本空间中的一个特定样本，计算它与所在聚类其他样本的平均距离a，以及该样本与距离最近的另一个聚类中所有样本的平均距离b，该样本的轮廓系数为(b-a)/max(a, b)。将整个样本空间中所有样本的轮廓系数取算数平均值，作为聚类划分的性能指标。

轮廓系数的取值区间：[-1, 1]。 -1代表分类效果差，1代表分类效果好。0代表聚类重叠,没有很好的划分。

轮廓系数相关API:

import sklearn.metrics as sm
# v:返回样本空间所有样本的平均轮廓系数
# metric:euclidean 使用欧几里得距离算法
v = sm.silhouette_score(
    输入集，输出集，
    sample_size=样本数,
    metric=距离算法
)

案例：

# 输出轮廓系数指标
v = sm.silhouette_score(
    x, pred_c, sample_size=len(x), metric='euclidean')
print(v)

DBSCAN算法

从样本空间中任意选择一个样本，以事先给定的半径做圆，凡被该圆圈中的样本都与该样本处于相同的聚类，以这些被圈中的样本为圆心继续画圆，重复以上过程，不断扩大被圈中样本的规模，直到再也没有新的样本加入为止，至此即得到一个聚类。在剩余样本中，重复以上过程，直到耗尽样本空间中的所有样本为止。

DBSCAN算法的特点：

事先给定的半径会影响最后的聚类效果，可以借助轮廓系数选择较优方案
根据聚类的形成过程，把样本细分为三类：
1. 外周样本：被其他样本聚集到某个聚类中，但无法再引入新样本的样本。
2. 孤立样本：据类中的样本数低于所设定的下限，则不称其为聚类，称其为孤立样本。
3. 核心样本：除了外周样本和孤立样本以外的样本。

DBSCAN算法相关API：

# 构建DBSCAN聚类模型
# eps：圆的半径
# min_samples：聚类样本数量的下限，低于该数值则为孤立样本。
model = sc.DBSCAN(eps=0.5, min_samples=3)
model.fit(x)
# 获取所有核心样本的index
model.core_sample_indices_

案例：读取perf.txt，实现聚类划分。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo06_dbscan.py  dbscan算法实现聚类划分
"""
import numpy as np
import sklearn.cluster as sc
import matplotlib.pyplot as mp
import sklearn.metrics as sm

x = np.loadtxt('../ml_data/perf.txt',
               delimiter=',')
# 基于轮廓系数优先eps超参数
eps, scores, models =     \
    np.linspace(0.3, 1.2, 10), [], []
for epsilon in eps:
    model = sc.DBSCAN(eps=epsilon, min_samples=5)
    model.fit(x)
    # 输出轮廓系数指标
    score = sm.silhouette_score(
        x, model.labels_, sample_size=len(x),
        metric='euclidean')
    scores.append(score)
    models.append(model)
scores = np.array(scores)
# 最高分的下标
best_index = scores.argmax()
print(eps[best_index])
print(scores[best_index])

# 获取最优模型
model = models[best_index]
pred_y = model.fit_predict(x)
print(pred_y)
# 构建核心样本的掩码
core_mask = np.zeros(len(x), dtype=bool)
core_mask[model.core_sample_indices_] = True
# 获取孤立样本的掩码
offset_mask = model.labels_ == -1
# 外周样本的掩码
p_mask = ~(core_mask | offset_mask)

# 画图
mp.figure('DBSCAN Cluster', facecolor='lightgray')
mp.title('DBSCAN Cluster', fontsize=14)
mp.xlabel('x', fontsize=12)
mp.ylabel('y', fontsize=12)
mp.tick_params(labelsize=10)
labels = model.labels_
mp.scatter(x[core_mask][:, 0],
           x[core_mask][:, 1],
           c=labels[core_mask], cmap='brg',
           s=80, label='core')
mp.scatter(x[p_mask][:, 0], x[p_mask][:, 1],
           c=labels[p_mask], cmap='brg',
           alpha=0.8, label='p', s=80)
mp.scatter(x[offset_mask][:, 0],
           x[offset_mask][:, 1], s=80,
           c='black', label='offset')

mp.legend()
mp.show()

机器学习DAY06

自然语言处理(NLP)

Siri：1. 听 2. 懂 3. 思考 4. 组织语言 5. 回答

语音识别
自然语言处理 - 语义分析
逻辑分析 - 综合业务场景上下文
自然语言处理 - 分析结果生成自然语言文本
语音合成

自然语言处理的常用处理过程：

先针对训练文本进行分词处理（词干提取、原型提取），统计词频，通过词频-逆文档频率算法获得该词对某种语义的贡献，根据每个词的贡献力度，构建有监督分类学习模型，把测试样本，交给模型处理，得到语义类别。

自然语言工具包 - NLTK

文本分词

分词处理相关API：

import nltk.tokenize as tk
# 把样本按照句子进行拆分  sent_list：句子列表
sent_list = tk.sent_tokenize(text)
# # 把样本字符串按照单词进行拆分  word_list：单词列表
word_list = tk.word_tokenize(text)
# WordPunctTokenizer:分词器对象
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(text)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo02_tk.py  文本分词
"""
import nltk.tokenize as tk
doc = "Are you curious about tokenization? "\
    "Let's see how it works! " \
    "We need to analyze a couple of " \
    "sentences with punctuations to see "\
    "it in action."
# 拆出句子
sent_list = tk.sent_tokenize(doc)
for i, sent in enumerate(sent_list):
    print('%2d' % (i + 1), sent)

print('-' * 45)
# 拆分单词
word_list = tk.word_tokenize(doc)
for i, word in enumerate(word_list):
    print('%2d' % (i + 1), word)
print('-' * 45)
# 拆分单词  WordPunctTokenizer
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(doc)
for i, word in enumerate(word_list):
    print('%2d' % (i + 1), word)

词干提取

分词后的单词词性与时态对语义分析没有影响，所以需要对单词进行词干提取。

词干提取相关API:

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

stemmer = pt.PorterStemmer() # 波特词干提取器（宽松）
stemmer = lc.LancasterStemmer() # 朗卡斯特词干提取器（严格）
stemmer = sb.SnowballStemmer('english')# 思诺博词干提取器（中庸）

r = stemmer.stem('playing') # 提取词干，返回结果

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo03_stem.py 词干提取
"""
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

words = ['table', 'probably', 'wolves',
         'playing', 'is', 'dog', 'the',
         'beaches', 'grounded', 'dreamt',
         'envision']
pt_stemmer = pt.PorterStemmer()  # 波特词干提取器（宽松）
lc_stemmer = lc.LancasterStemmer()  # 朗卡斯特词干提取器（严格）
sb_stemmer = sb.SnowballStemmer('english')  # 思诺博词干提取器（中庸）
for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s %8s %8s %8s' %
          (word, pt_stem, lc_stem, sb_stem))

词性还原

与词干提取作用类似，词性还原更利于人工二次处理。因为有些词干并非正确的单词，人工阅读更麻烦。此行还原可以把名词的复数形式改为单数，动词特殊形式恢复为原型。

import nltk.stem as ns
# 词性还原器
lemmatizer = ns.WordNetLemmatizer()
# 还原名词
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 还原动词
v_lemma = lemmatizer.lemmatize(word, pos='v')

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo03_lemm.py 词性还原
"""
import nltk.stem as ns

words = ['table', 'probably', 'wolves',
         'playing', 'is', 'dog', 'the',
         'beaches', 'grounded', 'dreamt',
         'envision']
# 词性还原器
lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n = lemmatizer.lemmatize(word, pos='n')
    v = lemmatizer.lemmatize(word, pos='v')
    print('%8s %8s %8s' %
          (word, n, v))

词袋模型

一句话的语义很大程度取决于某个单词出现的次数，所以可以把句子中所有可能出现的单词作为特征名，每一个句子为一个样本，单词在句子中出现的次数为特征值构建数学模型。该模型即称为词袋模型。

The brown dog is running. The black dog is in the black room. Running in the room is forbidden.

The brown dog is running.
The black dog is in the black room.
Running in the room is forbidden.

the	brown	dog	is	running	black	in	room	forbidden
1	1	1	1	1	0	0	0	0
2	0	1	1	0	2	1	1	0
1	0	0	1	1	0	1	1	1

构建词袋模型相关API：

import sklearn.feature_extraction.text as ft
# 词袋模型对象
cv = ft.CountVectorizer()
bow = cv.fit_transform(sent)
# 获取所有特征名（词袋的表头，所有的单词）
words = cv.get_feature_names()

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo05_bow.py 测试词袋模型
"""
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft

doc = 'The brown dog is running. '    \
    'The black dog is in the black room. '\
    'Running in the room is forbidden.'
# 分解所有句子
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences)
print(bow.toarray())
words = cv.get_feature_names()
print(words)

词频（TF）

单词在句子中出现的次数除以句子的总词数称为词频。即一个单词在一个句子中出现的频率。词频相比单词出现的次数更可以客观的评估单词对一句话语义的贡献度。对词袋矩阵归一化处理即可得到单词的词频。

案例：

tf = sp.normalize(bow, norm='l1')
print(tf.toarray())

文档频率（DF）

含有某个单词的文档样本数量 / 总文档样本数量

文档频率越低，代表某个单词对语义的贡献越高。

逆文档频率（IDF）

总文档样本数量 / 含有某个单词的文档样本数量

词频-逆文档频率（TF-IDF）

词频矩阵中的每一个元素，乘以相应单词的逆文档频率，其值越大说明该词对样本语义的贡献度越大，根据每个词的贡献力度，构建学习模型。

获取TF-IDF矩阵的API：

# 首先获取词袋模型矩阵
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toArray()
# 获取TF-IDF模型训练器
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)

案例：

tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
print(tfidf.toarray())

文本分类（主题识别）

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo08_news.py 主题识别
"""
import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb

train = sd.load_files(
    '../ml_data/20news', encoding='latin1',
    shuffle=True, random_state=7)
# load_file将会在文件夹下所有文件
train_data = train.data
train_y = train.target
# 获取所有类别标签
categories = train.target_names
# 构建词袋模型
cv = ft.CountVectorizer()
train_bow = cv.fit_transform(train_data)
# 构建TFIDF矩阵
tt = ft.TfidfTransformer()
train_x = tt.fit_transform(train_bow)
# 基于多项分布的朴素贝叶斯分类器进行模型训练
model = nb.MultinomialNB()
model.fit(train_x, train_y)

# 测试模型：
test_data = [
    'The curveballs of right handed '
    'pitchers tend to curve to the left.',
    'Caesar cipher is an ancient form '
    'of encryption.',
    'This two-wheeler is really good '
    'on slippery roads']
test_bow = cv.transform(test_data)
test_x = tt.transform(test_bow)
pred_test_y = model.predict(test_x)

for sent, i in zip(test_data, pred_test_y):
    print(sent, '->', categories[i])

nltk分类器

nltk提供了朴素贝叶斯分类器方便的处理自然语言相关的分类问题。并且可以自动处理词袋、TFIDF矩阵的整理，完成模型训练，最终实现类别预测。

情感分析

分析语料库中movie_reviews文档，通过正面即负面的评价进行自然语言训练，实现情感分析。

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo09_movie.py  情感分析
"""
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu

pdata = []
# 获取pos目录下的所有文件id
fileids = nc.movie_reviews.fileids('pos')
# 整理所有正面评论的单词，存入pdata列表
for fileid in fileids:
    sample = {}
    # 对fileid该文件执行分词操作
    words = nc.movie_reviews.words(fileid)
    for word in words:
        sample[word] = True
    pdata.append((sample, 'POSITIVE'))

ndata = []
# 获取neg目录下的所有文件id
fileids = nc.movie_reviews.fileids('neg')
# 整理所有正面评论的单词，存入ndata列表
for fileid in fileids:
    sample = {}
    # 对fileid该文件执行分词操作
    words = nc.movie_reviews.words(fileid)
    for word in words:
        sample[word] = True
    ndata.append((sample, 'NEGATIVE'))

# 拆分训练集与测试集 (80%作为训练)
pnumb = int(0.8 * len(pdata))
nnumb = int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 基于朴素贝叶斯分类器开始训练
model =\
    cf.NaiveBayesClassifier.train(train_data)
ac = cu.accuracy(model, test_data)

# 模拟业务场景：
reviews = [
    'It is an amazing movie.',
    'This is a dull movie, I would never '
    'recommend it to anyone.',
    'The cinematography is pretty great '
    'in this movie.',
    'The direction was terrible and the '
    'story was all over the place.']
for review in reviews:
    sample = {}
    words = review.split()
    for word in words:
        sample[word] = True
    pcls = model.classify(sample)
    print(review, '->', pcls)

语音识别

通过傅里叶变换，将时间域的声音函数分解为一系列不同频率的正弦函数的叠加，通过频率谱线的特殊分布，建立音频内容和文本的对应关系，以此作为模型训练的基础。

案例：freq.wav

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo10_freq.py 音频分解
"""
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp

sample_rate, sigs = \
    wf.read('../ml_data/freq.wav')
print(sample_rate)
print(sigs.shape, sigs.dtype)
# x坐标
times = np.arange(len(sigs)) / sample_rate
# 傅里叶变换，获取拆解出的正弦波频率与能量信息
freqs = nf.fftfreq(sigs.size, 1 / sample_rate)
ffts = nf.fft(sigs)
pows = np.abs(ffts)
# 绘制两个图像
mp.figure('Audio', facecolor='lightgray')
mp.subplot(121)
mp.title('Time Domain')
mp.xlabel('Time', fontsize=12)
mp.ylabel('Signal', fontsize=12)
mp.grid(linestyle=':')
mp.plot(times, sigs, c='dodgerblue')
# 绘制频域图像
mp.subplot(122)
mp.title('Frequency Domain')
mp.xlabel('Frequency', fontsize=12)
mp.ylabel('Pow', fontsize=12)
mp.grid(linestyle=':')
mp.plot(freqs[freqs > 0], pows[freqs > 0],
        c='orangered')

mp.tight_layout()
mp.show()

语音识别

梅尔频率倒谱系数（MFCC）：对声音做傅里叶变换后，发现与声音内容密切相关的13个特殊频率所对应的能量分布。可以使用MFCC矩阵作为语音识别的特征。基于隐马尔科夫模型进行模式识别，找到测试样本最匹配的声音模型，从而识别语音内容。

MFCC相关API：

import scipy.io.wavfile as wf
import python_speech_features as sf

# 读取音频文件获取采样率及每个采样点的值
sample_rate, sigs = wf.read('../data/freq.wav')
# 交给语音特征提取器获取该拼音的梅尔频率倒普矩阵
mfcc = sf.mfcc(sigs, sample_rate)

案例：比较不同音频的mfcc矩阵。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo11_mfcc.py
"""
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import python_speech_features as sf
import matplotlib.pyplot as mp

sample_rate, sigs = \
    wf.read('../ml_data/speeches/training'
            '/banana/banana01.wav')
mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)

mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.xlabel('Sample', fontsize=12)
mp.ylabel('Feature', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()

隐马尔可夫模型相关API：

import hmmlearn.hmm as hl
# 构建隐马模型
# n_components：用几个高斯分布函数拟合样本数据
# convariance_type： 使用相关矩阵的辅对角线进行相关性比较
# n_iter：最大迭代上限
model = hl.GaussianHMM(
    n_components=4,     
    convariance_type='diag',
    n_iter=1000
)
model.fit(mfccs)
score = model.score(test_mfccs)

机器学习DAY07

语音识别

案例：freq.wav

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo10_freq.py 音频分解
"""
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp

sample_rate, sigs = \
    wf.read('../ml_data/freq.wav')
print(sample_rate)
print(sigs.shape, sigs.dtype)
# x坐标
times = np.arange(len(sigs)) / sample_rate
# 傅里叶变换，获取拆解出的正弦波频率与能量信息
freqs = nf.fftfreq(sigs.size, 1 / sample_rate)
ffts = nf.fft(sigs)
pows = np.abs(ffts)
# 绘制两个图像
mp.figure('Audio', facecolor='lightgray')
mp.subplot(121)
mp.title('Time Domain')
mp.xlabel('Time', fontsize=12)
mp.ylabel('Signal', fontsize=12)
mp.grid(linestyle=':')
mp.plot(times, sigs, c='dodgerblue')
# 绘制频域图像
mp.subplot(122)
mp.title('Frequency Domain')
mp.xlabel('Frequency', fontsize=12)
mp.ylabel('Pow', fontsize=12)
mp.grid(linestyle=':')
mp.plot(freqs[freqs > 0], pows[freqs > 0],
        c='orangered')

mp.tight_layout()
mp.show()

语音识别

MFCC相关API：

import scipy.io.wavfile as wf
import python_speech_features as sf

# 读取音频文件获取采样率及每个采样点的值
sample_rate, sigs = wf.read('../data/freq.wav')
# 交给语音特征提取器获取该拼音的梅尔频率倒普矩阵
mfcc = sf.mfcc(sigs, sample_rate)

案例：比较不同音频的mfcc矩阵。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo11_mfcc.py
"""
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import python_speech_features as sf
import matplotlib.pyplot as mp

sample_rate, sigs = \
    wf.read('../ml_data/speeches/training'
            '/banana/banana01.wav')
mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)

mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.xlabel('Sample', fontsize=12)
mp.ylabel('Feature', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()

隐马尔可夫模型相关API：

import hmmlearn.hmm as hl
# 构建隐马模型
# n_components：用几个高斯分布函数拟合样本数据
# convariance_type： 使用相关矩阵的辅对角线进行相关性比较
# n_iter：最大迭代上限
model = hl.GaussianHMM(
    n_components=4,     
    convariance_type='diag',
    n_iter=1000
)
model.fit(mfccs)
score = model.score(test_mfccs)

案例：遍历文件夹，使用多个隐马模型训练所有音频的mfcc矩阵。然后使用测试集验证训练结果。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo01_hmm.py 语音识别
"""
import os
import numpy as np
import scipy.io.wavfile as wf
import python_speech_features as sf
import hmmlearn.hmm as hl


def search_files(directory):
    """
    搜索directory目录下的所有wav文件，返回字典
    """
    directory = os.path.normpath(directory)
    objects = {}
    for curdir, subdirs, files in \
            os.walk(directory):
        for file in files:
            if file.endswith('.wav'):
                label = curdir.split(
                    os.path.sep)[-1]
                if label not in objects:
                    objects[label] = []
                path = os.path.join(curdir, file)
                objects[label].append(path)
    return objects

train_samples = \
    search_files('../ml_data/speeches/training')
print(train_samples)

# 整理训练集数据，把每一个类别中音频的mfcc
# 整理入训练集中.
train_x, train_y = [], []
for label, filenames in train_samples.items():
    mfccs = np.array([])
    for filename in filenames:
        sample_rate, sigs = wf.read(filename)
        mfcc = sf.mfcc(sigs, sample_rate)
        if len(mfccs) == 0:
            mfccs = mfcc
        else:
            mfccs = np.append(
                mfccs, mfcc, axis=0)
    train_x.append(mfccs)
    train_y.append(label)

# 基于隐马模型，进行训练，把所有训练结果model
# 都存起来，供以后使用
models = {}
for mfccs, label in zip(train_x, train_y):
    model = hl.GaussianHMM(
        n_components=3, covariance_type='diag',
        n_iter=1000)
    models[label] = model.fit(mfccs)

# 读取测试集中的文件，输出模式匹配的结果
test_samples = \
    search_files('../ml_data/speeches/testing')
test_x, test_y = [], []
for label, filenames in test_samples.items():
    mfccs = np.array([])
    for filename in filenames:
        sample_rate, sigs = wf.read(filename)
        mfcc = sf.mfcc(sigs, sample_rate)
        if len(mfccs) == 0:
            mfccs = mfcc
        else:
            mfccs = np.append(
                mfccs, mfcc, axis=0)
    test_x.append(mfccs)
    test_y.append(label)

# 让每一个隐马模型验证每一个测试样本的匹配度
pred_test_y = []
for mfccs in test_x:
    best_score, best_label = None, None
    for label, model in models.items():
        score = model.score(mfccs)
        if (best_score is None) or \
                (best_score < score):
            best_score, best_label = score, label
    pred_test_y.append(best_label)
print(test_y)
print(pred_test_y)

图像识别

OpenCV基础

openCV是一个开源的计算机视觉库。提供了很多图像处理的常用工具。

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo02_cv.py  opencv基础
"""
import numpy as np
import cv2 as cv

img = cv.imread('../ml_data/forest.jpg')
cv.imshow('Img', img)
# 显示图像每个颜色通道的信息
blue = np.zeros_like(img)
# 仅保留原数组的蓝色通道
blue[:, :, 0] = img[:, :, 0]
cv.imshow('blue', blue)
green = np.zeros_like(img)
green[:, :, 1] = img[:, :, 1]
cv.imshow('green', green)
red = np.zeros_like(img)
red[:, :, 2] = img[:, :, 2]
cv.imshow('red', red)
# 图像裁剪
h, w = img.shape[:2]
l, t = int(w / 4), int(h / 4)
r, b = int(w * 3 / 4), int(h * 3 / 4)
cropped = img[t:b, l:r]
cv.imshow('Cropped', cropped)
# 图像缩小  给出目标图像的w与h即可
print(w, h)
sc1 = cv.resize(
    img, (int(w / 4), int(h / 4)),
    interpolation=cv.INTER_LINEAR)
cv.imshow('Scale1', sc1)
# 图像放大 若不给出w与h，可以给出x、y缩放比例
sc2 = cv.resize(
    sc1, None, fx=4, fy=4,
    interpolation=cv.INTER_LINEAR)
cv.imshow('Scale2', sc2)
cv.waitKey()

# 图像保存
cv.imwrite('../ml_data/blue.jpg', blue)
cv.imwrite('../ml_data/green.jpg', green)
cv.imwrite('../ml_data/red.jpg', red)
cv.imwrite('../ml_data/sc2.jpg', sc2)

边缘检测

物体的边缘检测是物体识别常用的手段，边缘检测常用亮度梯度的方法。通过识别亮度梯度变化最大的像素点从而检测出物体的边缘。

边缘识别相关API：

# 索贝尔边缘识别
# cv.CV_64F:图片像素为整型，转为浮点型避免精度损失
# 1:水平方向做索贝尔偏微分
# 0:垂直方向不做索贝尔偏微分
# ksize: 索贝尔卷积核为5*5
cv.Sobel(img, cv.CV_64F, 1, 0, ksize=5)
# 拉普拉斯边缘识别
cv.Laplacian(img, cv.CV_64F)
# Canny边缘识别
# 50: 水平方向的阈值  240: 垂直方向的阈值
cv.Canny(img, 50, 240)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
sobel.py 边缘识别
"""
import cv2 as cv
img = cv.imread('../ml_data/chair.jpg',
                cv.IMREAD_GRAYSCALE)
cv.imshow('image', img)
# 索贝尔边缘识别器
s1 = cv.Sobel(img, cv.CV_64F, 1, 0, ksize=5)
cv.imshow('H-Sobel', s1)
s2 = cv.Sobel(img, cv.CV_64F, 0, 1, ksize=5)
cv.imshow('V-Sobel', s2)
s3 = cv.Sobel(img, cv.CV_64F, 1, 1, ksize=5)
cv.imshow('Sobel', s3)
# 拉普拉斯边缘识别
laplacian = cv.Laplacian(img, cv.CV_64F)
cv.imshow('laplacian', laplacian)
# canny边缘识别
canny = cv.Canny(img, 50, 200)
cv.imshow('Canny', canny)


cv.waitKey()

亮度提升

OpenCV提供了直方图均衡化的方式实现亮度提升，亮度提升后更有利于边缘识别模型的训练。

# 针对灰度图像 做直方图均衡化处理
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
equalized_gray = cv.equalizeHist(gray)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo04_equalhist.py  直方图均衡化
"""
import cv2 as cv
img = cv.imread('../ml_data/sunrise.jpg')
# 直方图均衡化
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
cv.imshow('gray', gray)
equalized_gray = cv.equalizeHist(gray)
cv.imshow('equalized_gray', equalized_gray)
# 彩色图提亮度  YUV：亮度，色度，饱和度
yuv = cv.cvtColor(img, cv.COLOR_BGR2YUV)
yuv[:, :, 0] = cv.equalizeHist(yuv[:, :, 0])
img = cv.cvtColor(yuv, cv.COLOR_YUV2BGR)
cv.imshow('equalized_color', img)

cv.waitKey()

角点检测

角点 - 平直棱线的交汇点。（颜色梯度方向改变的像素点的位置）

OpenCV提供的角点检测相关API：

gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
# Harris角点检测器
# 边缘的水平方向、垂直方向颜色值改变超过阈值7,5时即为边缘
# 边缘线的方向改变超过阈值0.04即为一个角点
corners = cv.cornerHarris(gray, 7, 5, 0.04)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo05_harris.py  角点检测
"""
import cv2 as cv

img = cv.imread('../ml_data/box.png')
cv.imshow('Image', img)
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
cv.imshow('Image', gray)
# 返回所有的角点坐标
corners = cv.cornerHarris(gray, 7, 5, 0.1)
mixture = img.copy()
print(corners)
print(corners.shape)
print(corners.max())
# 对彩色图像做掩码，找到角点把颜色改为红色
mixture[corners > corners.max() * 0.01] =\
    [0, 0, 255]
cv.imshow('Mixture', mixture)

cv.waitKey()

特征点检测

特征点检测集合了边缘检测与角点检测，从而可以识别出图像的特征点。

常用的特征点检测：STAR特征点检测、SIFT特征点检测

star特征值检测相关API：

gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
# star特征点检测  创建start特征点检测器
star = cv.xfeatures2d.StarDetector_create()
# 获取检测到的所有特征点
keypoints = star.detect(gray)
# 绘制特征点
mixture = img.copy()
f = cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS
# (原图，特征点，目标图像，flag)
cv.drawKeypoints(img, keypoints, mixture, flags=f)
cv.imshow('Mixture', mixture)

sift特征值检测相关API：

# sift特征点检测  创建sift特征点检测器
sift = cv.xfeatrures2d.SIFT_create()
# 获取检测到的所有特征点
keypoints = sift.detect(gray)

特征值矩阵

图像的特征值矩阵（描述矩阵）记录了图像的特征点以及每个特征点的梯度信息。相似图像的特征值矩阵也相似。这样只要有足够多的样本，就可以基于隐马尔科夫模型进行图像内容的识别。

特征值矩阵相关API：

# sift特征点检测  创建sift特征点检测器
sift = cv.xfeatrures2d.SIFT_create()
# 获取检测到的所有特征点
keypoints = sift.detect(gray)
# 获取图像的特征值矩阵
_, desc = sift.compute(gray, keypoints)

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo08_feat.py  特征值矩阵
"""
import cv2 as cv
import matplotlib.pyplot as mp

img = cv.imread('../ml_data/green.jpg')
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
# sift特征点检测  创建sift特征点检测器
sift = cv.xfeatures2d.SIFT_create()
# 获取检测到的所有特征点
keypoints = sift.detect(gray)
# 获取特征值矩阵
_, desc = sift.compute(gray, keypoints)
mp.matshow(desc.T, cmap='jet')
mp.title('Description', fontsize=18)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.tight_layout()
mp.show()

物体识别

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo09_jpg.py   物体识别
"""
import os
import numpy as np
import cv2 as cv
import hmmlearn.hmm as hl


def search_files(directory):
    """
    搜索directory目录下的所有jpg文件，返回字典
    """
    directory = os.path.normpath(directory)
    objects = {}
    for curdir, subdirs, files in \
            os.walk(directory):
        for file in files:
            if file.endswith('.jpg'):
                label = curdir.split(
                    os.path.sep)[-1]
                if label not in objects:
                    objects[label] = []
                path = os.path.join(curdir, file)
                objects[label].append(path)
    return objects

train_samples = search_files(
    '../ml_data/objects/training')

# 加载训练集样本数据，训练模型，存储模型
train_x, train_y = [], []
for label, filenames in train_samples.items():
    descs = np.array([])
    for filename in filenames:
        image = cv.imread(filename)
        gray = cv.cvtColor(
            image, cv.COLOR_BGR2GRAY)
        # 整理图片的大小 若宽度较小，则把宽度
        # 缩放到200，高度自适应。若高度较小，
        # 则把高度缩放到200，宽度自适应。
        h, w = gray.shape[:2]
        f = 200 / min(h, w)
        gray = cv.resize(gray, None, fx=f, fy=f)
        sift = cv.xfeatures2d.SIFT_create()
        keypoints = sift.detect(gray)
        _, desc = sift.compute(gray, keypoints)
        if len(descs) == 0:
            descs = desc
        else:
            descs = np.append(
                descs, desc, axis=0)
    train_x.append(descs)
    train_y.append(label)

# 训练隐马模型
models = {}
for descs, label in zip(train_x, train_y):
    model = hl.GaussianHMM(
        n_components=4, covariance_type='diag',
        n_iter=1000)
    models[label] = model.fit(descs)

# 模型测试

test_samples = search_files(
    '../ml_data/objects/testing')
test_x, test_y = [], []
for label, filenames in test_samples.items():
    descs = np.array([])
    for filename in filenames:
        image = cv.imread(filename)
        gray = cv.cvtColor(
            image, cv.COLOR_BGR2GRAY)
        # 整理图片的大小 若宽度较小，则把宽度
        # 缩放到200，高度自适应。若高度较小，
        # 则把高度缩放到200，宽度自适应。
        h, w = gray.shape[:2]
        f = 200 / min(h, w)
        gray = cv.resize(gray, None, fx=f, fy=f)
        sift = cv.xfeatures2d.SIFT_create()
        keypoints = sift.detect(gray)
        _, desc = sift.compute(gray, keypoints)
        if len(descs) == 0:
            descs = desc
        else:
            descs = np.append(
                descs, desc, axis=0)
    test_x.append(descs)
    test_y.append(label)

for descs, test_label in zip(test_x, test_y):
    for pred_label, model in models.items():
        score = model.score(descs)
        print(test_label, '->', pred_label, score)

airplane -> airplane -120454.03702592754
airplane -> car -121061.63978531306
airplane -> motorbike -119660.8862679648
car -> airplane -162306.35159566175
car -> car -161333.75710071798
car -> motorbike -162209.1739355846
motorbike -> airplane -292652.6592317896
motorbike -> car -294093.2031186345
motorbike -> motorbike -290136.95963470405

人脸识别

人脸识别与图像识别的区别在于人脸识别需要识别出两个人的不同点。

视频捕捉

使用opencv访问视频设备，获取图像帧。

import cv2 as cv
# 获取视频设备
video_capture = cv.videoCapture(0)
# 调用方法读取视频设备捕获的一帧图片
frame = video_capture.read()[1]
cv.imshow('', frame)
video_capture.release()
cv.destroyAllWindows()

案例：

import cv2 as cv
# 0：下标为0的设备
vc = cv.VideoCapture(0)
while True:
    frame = vc.read()[1]
    cv.imshow('Video Capture', frame)
    if cv.waitKey(33) == 27:
        break

vc.release()
cv.destroyAllWindows()

人脸定位

哈尔级联人脸定位

fd = cv.CascadeClassifier('../ml_data/haar/face.xml')
ed = cv.CascadeClassifier('../ml_data/haar/eye.xml')
nd = cv.CascadeClassifier('../ml_data/haar/nose.xml')
# 构建级联人脸定位器
# 1.3: 物体的最小尺寸
# 5:   查找物体的最大个数
faces = fd.detectMultiScale(frame, 1.3, 5)
# faces: [(),(),(),() ...]
for l, t, w, h in faces:
    a, b = int(w / 2), int(h / 2)
    cv.ellipse(frame,    #  原图片
               (l + a, t + b),    # 椭圆心
               (a, b),    #半径
               0, # 椭圆旋转角度
               0, 360, # 起始角，终止角
               (255, 0, 0), 2 # 椭圆颜色   线宽
    )

案例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
"""
demo11_face.py 人脸定位
"""
import cv2 as cv

fd = cv.CascadeClassifier('../ml_data/haar/face.xml')
ed = cv.CascadeClassifier('../ml_data/haar/eye.xml')
nd = cv.CascadeClassifier('../ml_data/haar/nose.xml')

vc = cv.VideoCapture(0)
while True:
    frame = vc.read()[1]
    # 识别脸
    faces = fd.detectMultiScale(frame, 1.3, 5)
    for l, t, w, h in faces:
        a, b = int(w / 2), int(h / 2)
        cv.ellipse(frame,
                   (l + a, t + b),
                   (a, b),
                   0, 0, 360, (255, 0, 0), 2)
        # 识别鼻子和眼睛
        face = frame[t:t + h, l:l + w]
        eyes = ed.detectMultiScale(face, 1.3, 2)
        for l, t, w, h in eyes:
            a, b = int(w / 2), int(h / 2)
            cv.ellipse(
                face, (l + a, t + h), (a, b),
                0, 0, 360, (0, 255, 255), 2)
        nose = nd.detectMultiScale(face, 1.3, 1)
        for l, t, w, h in nose:
            a, b = int(w / 2), int(h / 2)
            cv.ellipse(
                face, (l + a, t + h), (a, b),
                0, 0, 360, (0, 0, 255), 2)

    cv.imshow('Video', frame)
    if cv.waitKey(33) == 27:
        break
vc.release()
cv.destroyAllWindows()