流程
KNN 算法基本流程为:
- 准备数据集:划分好的带标签(已知类别)的训练样本数据。
- 确定 K 值:选择一个合适的 K 值,表示要考虑的最近邻居的数量。
- 计算距离:对于待分类的样本,计算其与训练样本之间的距离。常用的距离度量方法包括欧氏距离、曼哈顿距离等。
- 选择 K 个最近邻居:根据计算得到的距离,选择与待分类样本最近的 K 个样本作为最近邻居。
- 确定类别:对于分类任务,根据最近邻居的类别进行投票或加权投票,将待分类样本归类为得票最多的类别。对于回归任务,可以计算最近邻居的平均值或加权平均值作为待分类样本的预测值。
- 输出结果:将待分类样本归类或预测的结果输出。
代码
import numpy as np | |
import pandas as pd | |
from sklearn.model_selection import train_test_split | |
from sklearn.metrics import accuracy_score | |
from sklearn.neighbors import KNeighborsClassifier | |
def euclidean_distance(x1, x2): | |
return np.sqrt(np.sum((x1 - x2) ** 2)) | |
class KNN: | |
def __init__(self, k=3): | |
self.k = k | |
def fit(self, X, y): | |
self.X_train = X | |
self.y_train = y | |
def predict(self, X): | |
y_pred = [self._predict(x) for x in X] | |
return np.array(y_pred) | |
def _predict(self, x): | |
# 计算当前样本与训练集样本的距离 | |
distances = [euclidean_distance(x, x_train) for x_train in self.X_train] | |
# 根据距离排序,获取最近的 k 个样本的索引 | |
k_indices = np.argsort(distances)[:self.k] | |
# 获取最近的 k 个样本的标签 | |
k_labels = [self.y_train[i] for i in k_indices] | |
# 统计最近 k 个样本中标签出现的次数 | |
most_common = Counter(k_labels).most_common(1) | |
return most_common[0][0] | |
if __name__ == '__main__': | |
path = r"./diabetes.csv" | |
df = pd.read_csv(path) | |
print(df.head(2)) | |
print(df.shape) | |
print(df.info()) | |
features = [c for c in df.columns if c != 'Outcome'] | |
X = df.loc[:, features] | |
y = df.loc[:, 'Outcome'] | |
X_train, X_test, y_train, y_test = train_test_split(X, y, | |
test_size=0.4, random_state=42, stratify=y) | |
n_neighbors = 7 | |
knn = KNN(k=n_neighbors) | |
knn.fit(X_train.values, y_train.values) | |
y_pred = knn.predict(X_test.values) | |
print("%40 test data scratch knn accuracy score:", | |
accuracy_score(y_test, y_pred)) | |
sk_knn = KNeighborsClassifier(n_neighbors=n_neighbors) | |
sk_knn.fit(X_train, y_train) | |
y_pred_sk = sk_knn.predict(X_test) | |
print("%40 test data sklearn knn accuracy score:", | |
accuracy_score(y_test, y_pred_sk)) |
得到训练 accuracy 分别如下
%40 test data scratch knn accuracy score: 0.730519480519480
%40 test data sklearn knn accuracy score: 0.730519480519480
正文完