1、kNN-k近邻算法
1.1 k-Nearest Neighbors
这是一个时间和肿瘤大小的点图,图上的点代表每个病人,红色的代表是良性肿瘤,蓝色的代表是恶性肿瘤,假设现在来了一个新的病人(绿色的点),那么我们怎么知道这个病人得的是良性肿瘤还是恶心肿瘤呢?
根据k近邻算法,假设目前已知k=3,那么我们需要找到离着绿色点最近的三个点,然后进行投票,发现恶性肿瘤:良性肿瘤 = 3:0 ,所以我们默认新来的这个点有很高的概率是蓝色的
k近邻算法通俗说就是如果两个样本足够的相似的话,那么这两个样本很有可能属于同一个类别,当然,仅仅和一个样本进行比较是不够的。
k近邻算法首先可以解决的是监督学习中的分类问题
1.2 kNN的过程
1.2.1首先是创建矩阵和数据集的过程
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
raw_data_X = [[3.393533211,2.331273381],
[3.110073483,1.781539638],
[1.343808831,3.368360954],
[3.582294042,4.679179110],
[2.280362439,2.866990263],
[7.423436942,4.696522875],
[5.745051997,3.533989803],
[9.172168622,2.511101045],
[7.792783481,3.424088941],
[7.939820817,0.791637231]]
raw_data_y = [0,0,0,0,0,1,1,1,1,1]
[3.110073483,1.781539638],
[1.343808831,3.368360954],
[3.582294042,4.679179110],
[2.280362439,2.866990263],
[7.423436942,4.696522875],
[5.745051997,3.533989803],
[9.172168622,2.511101045],
[7.792783481,3.424088941],
[7.939820817,0.791637231]]
raw_data_y = [0,0,0,0,0,1,1,1,1,1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
y_train = np.array(raw_data_y)
plt.scatter(X_train[y_train==0,0],X_train[y_train==0,1],color = 'g')
plt.scatter(X_train[y_train==1,0],X_train[y_train==1,1],color = 'r')
plt.show()
plt.scatter(X_train[y_train==1,0],X_train[y_train==1,1],color = 'r')
plt.show()
x = np.array([8.093607318,3.365731514])
plt.scatter(X_train[y_train==0,0],X_train[y_train==0,1],color = 'g')
plt.scatter(X_train[y_train==1,0],X_train[y_train==1,1],color = 'r')
plt.scatter(x[0],x[1],color='b')
plt.show()
plt.scatter(X_train[y_train==1,0],X_train[y_train==1,1],color = 'r')
plt.scatter(x[0],x[1],color='b')
plt.show()
1.2.2 kNN计算过程
from math import sqrt
distances = []
for x_train in X_train:
d = sqrt(np.sum((x_train - x)**2))
distances.append(d)
distances = []
for x_train in X_train:
d = sqrt(np.sum((x_train - x)**2))
distances.append(d)
distances
[4.812566907609877,
5.229270827235305,
6.749798999160064,
4.6986266144110695,
5.83460014556857,
1.4900114024329525,
2.354574897431513,
1.3761132675144652,
0.3064319992975,
2.5786840957478887]
[4.812566907609877,
5.229270827235305,
6.749798999160064,
4.6986266144110695,
5.83460014556857,
1.4900114024329525,
2.354574897431513,
1.3761132675144652,
0.3064319992975,
2.5786840957478887]
distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train] ------这个是求解图像中所有的点到新加入的点的距离(这一步和上面的for循环的操作是一样的)
np.argsort(distances) ------将距离进行排序,并且输出排序后的数值的相应的索引值
结果:
array([8, 7, 5, 6, 9, 3, 0, 1, 4, 2], dtype=int64)
k = 6 ------设k=6
topK_y = [y_train[i] for i in nearest[:k]]
[1, 1, 1, 1, 1, 0]from collections import Counter
Counter(topK_y) ----将topK_y中的数据按照类别收集起来
Counter({1: 5, 0: 1})votes = Counter(topK_y)
votes.most_common(1) ----------我们只需要票数最多的一个元素 所以取值为1
[(1, 5)]predict_y = votes.most_common(1)[0][0] --------我们最终要知道的是,新加入的点到底属于哪一类,所以我们只需要知道票数最多的那一个元素的第零个元素的第零个值
predict_y
1
2、scikit-learn 机器学习算法的封装
所有机器学习中的算法都是以面向对象的形式进行包装的
2.1 scikit-learn 中的kNN算法
这是标准的流程
自己编写 将算法函数进行封装
import numpy as np from math import sqrt from collections import Counter class KNNClassifier: def __init__(self,k): """初始化knn分类器""" assert k>=1,"k must be valid" self.k = k self._X_train = None self._y_train = None def fit(self,X_train,y_train): """根据数据训练集X_train,y_train训练knn分类器""" assert X_train.shape[0] == y_train.shape[0] assert self.k <=X_train.shape[0] self.X_train = X_train self.y_train = y_train return self def predict(self,X_predict): assert self._X_train is not None and self._y_train is not None assert X_predict.shape[1] ==self._X_train.shape[1] y_predict = [self._predict(x) for x in X_predict] return np.array(y_predict) def _predict(self,x): assert x.shape[0] == self._X_train.shape[1] distances = [sqrt(np.sum((x_train - x)**2)) for x_train in self._X_train] nearest = np.argsort(distances) topK_y = [self._y_train[i] for i in nearest[:self.k]] votes = Counter(topK_y) return votes.most_common(1)[0][0]
3 判断机器学习算法的性能
可以使用sklearn中封装好的函数
4 超参数
kNN算法中的k就是典型的超参数
4.1 寻找最好k
import numpy as np from sklearn import datasets digits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 666) from sklearn.neighbors import KNeighborsClassifier knn_clf = KNeighborsClassifier(n_neighbors=3) knn_clf.fit(X_train,y_train) knn_clf.score(X_test,y_test) best_score = 0.0 best_k = -1 for k in range(1,11): knn_clf = KNeighborsClassifier(n_neighbors=k) knn_clf.fit(X_train,y_train) score = knn_clf.score(X_test,y_test) if score > best_score: best_score = score best_k = k print("best_k = ",best_k) print("best_score = ",best_score)得到的结果是最好的k=4,成功率达到百分之99.16,声明:如果我们找到的最好的k位于边界值,那么很有可能有更好k在这个范围之外,所以我们需要稍微扩展一下搜索的范围
4.2 k近邻算法的另一个超参数,节点的距离权重
采用距离的倒数作为权重,考虑距离还能够解决绿色周围三个节点颜色各不相同的问题
import numpy as np from sklearn import datasets digits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 666) from sklearn.neighbors import KNeighborsClassifier knn_clf = KNeighborsClassifier(n_neighbors=3) knn_clf.fit(X_train,y_train) knn_clf.score(X_test,y_test) best_method = " " best_score = 0.0 best_k = -1 for method in ["uniform","distance"]: for k in range(1,11): knn_clf = KNeighborsClassifier(n_neighbors=k) knn_clf.fit(X_train,y_train) score = knn_clf.score(X_test,y_test) if score > best_score: best_score = score best_k = k best_method = method print("best_method = ",best_method) print("best_k = ",best_k) print("best_score = ",best_score)
best_method = uniform best_k = 4 best_score = 0.9916666666666667uniform和distance分别表示不需要考虑距离的权重和需要考虑距离的权重问题
5 数据归一化
除了很明显的像学生成绩数据这样的有明确的边界值的情况以外,一般情况我们都是使用均值方差归一化
为了对测试数据集进行归一化处理,在scikit-learn 中封装了一个scalar函数来进行归一化处理
Scalar处理的流程
使用Scikit-learn中的StandardScaler类来实现训练集的数据归一化处理
mean_是均值 scale_是方差
在使用knn进行预测分数时,一定要使用经过归一化的数据,即X_test_standard,否则得到的准确率将会非常低
6. knn算法的缺点
1、效率低下
如果训练集有m个样本,n个特征,那么每次预测一个新的数据,时间复杂度为O(m*n)
2、数据高度相关
3、预测结果不具有可解释性
4、维数灾难
随着维度的增加,“看似相近”的两个点之间的距离越来越大