From scratch build KNN

流程

KNN 算法基本流程为:

  1. 准备数据集:划分好的带标签(已知类别)的训练样本数据。
  2. 确定 K 值:选择一个合适的 K 值,表示要考虑的最近邻居的数量。
  3. 计算距离:对于待分类的样本,计算其与训练样本之间的距离。常用的距离度量方法包括欧氏距离、曼哈顿距离等。
  4. 选择 K 个最近邻居:根据计算得到的距离,选择与待分类样本最近的 K 个样本作为最近邻居。
  5. 确定类别:对于分类任务,根据最近邻居的类别进行投票或加权投票,将待分类样本归类为得票最多的类别。对于回归任务,可以计算最近邻居的平均值或加权平均值作为待分类样本的预测值。
  6. 输出结果:将待分类样本归类或预测的结果输出。

 

代码

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier


def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        # 计算当前样本与训练集样本的距离
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        # 根据距离排序,获取最近的 k 个样本的索引
        k_indices = np.argsort(distances)[:self.k]
        # 获取最近的 k 个样本的标签
        k_labels = [self.y_train[i] for i in k_indices]
        # 统计最近 k 个样本中标签出现的次数
        most_common = Counter(k_labels).most_common(1)
        return most_common[0][0]

if __name__ == '__main__':
    path = r"./diabetes.csv"
    df = pd.read_csv(path)
    print(df.head(2))
    print(df.shape)
    print(df.info())

    features = [c for c in df.columns if c != 'Outcome']
    X = df.loc[:, features]
    y = df.loc[:, 'Outcome']
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.4, random_state=42, stratify=y)
    n_neighbors = 7
    knn = KNN(k=n_neighbors)
    knn.fit(X_train.values, y_train.values)
    y_pred = knn.predict(X_test.values)
    print("%40 test data scratch knn accuracy score:",
          accuracy_score(y_test, y_pred))

    sk_knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    sk_knn.fit(X_train, y_train)
    y_pred_sk = sk_knn.predict(X_test)
    print("%40 test data sklearn knn accuracy score:",
          accuracy_score(y_test, y_pred_sk))

得到训练 accuracy 分别如下

%40 test data scratch knn accuracy score: 0.730519480519480
%40 test data sklearn knn accuracy score: 0.730519480519480

 

 
正文完
 
admin
版权声明:本站原创文章,由 admin 2023-11-26发表,共计1733字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请联系tensortimes@gmail.com。
评论(没有评论)
验证码