流程
KNN 算法基本流程为:
- 准备数据集:划分好的带标签(已知类别)的训练样本数据。
- 确定 K 值:选择一个合适的 K 值,表示要考虑的最近邻居的数量。
- 计算距离:对于待分类的样本,计算其与训练样本之间的距离。常用的距离度量方法包括欧氏距离、曼哈顿距离等。
- 选择 K 个最近邻居:根据计算得到的距离,选择与待分类样本最近的 K 个样本作为最近邻居。
- 确定类别:对于分类任务,根据最近邻居的类别进行投票或加权投票,将待分类样本归类为得票最多的类别。对于回归任务,可以计算最近邻居的平均值或加权平均值作为待分类样本的预测值。
- 输出结果:将待分类样本归类或预测的结果输出。
代码
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
class KNN:
def __init__(self, k=3):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X):
y_pred = [self._predict(x) for x in X]
return np.array(y_pred)
def _predict(self, x):
# 计算当前样本与训练集样本的距离
distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
# 根据距离排序,获取最近的 k 个样本的索引
k_indices = np.argsort(distances)[:self.k]
# 获取最近的 k 个样本的标签
k_labels = [self.y_train[i] for i in k_indices]
# 统计最近 k 个样本中标签出现的次数
most_common = Counter(k_labels).most_common(1)
return most_common[0][0]
if __name__ == '__main__':
path = r"./diabetes.csv"
df = pd.read_csv(path)
print(df.head(2))
print(df.shape)
print(df.info())
features = [c for c in df.columns if c != 'Outcome']
X = df.loc[:, features]
y = df.loc[:, 'Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4, random_state=42, stratify=y)
n_neighbors = 7
knn = KNN(k=n_neighbors)
knn.fit(X_train.values, y_train.values)
y_pred = knn.predict(X_test.values)
print("%40 test data scratch knn accuracy score:",
accuracy_score(y_test, y_pred))
sk_knn = KNeighborsClassifier(n_neighbors=n_neighbors)
sk_knn.fit(X_train, y_train)
y_pred_sk = sk_knn.predict(X_test)
print("%40 test data sklearn knn accuracy score:",
accuracy_score(y_test, y_pred_sk))
得到训练 accuracy 分别如下
%40 test data scratch knn accuracy score: 0.730519480519480
%40 test data sklearn knn accuracy score: 0.730519480519480
正文完