Chapter 9 k-nearest neighbors

Optional Reading: Chapter 3 in Machine Learning with R by Brett Lantz

9.1 Introduction

Setting: We have data \(\{(x_i, y_i) : i=1,\ldots,n \}\), where \(x_i\) and \(y_i\) are the vector of the features the class label for the \(i\)th observation respectively. For example, in breast cancer diagnosis, \(x_i\) could be some summary statistics of the radius, texture, perimeter, area of the cell nuclei from a digitized image of a fine needle aspirate of a breast mass; \(y_i\) is the diagnosis (malignant or benign).

We now have a new data point \(x^*\) (of course we do not have the class label \(y^*\)) and we want to assign a class label to it. For example, after a subject has a breast fine needle aspiration, can we use existing data to predict if the subject has breast cancer?

In this chapter, we will study the \(k\)-nearest neighbors (\(k\)-NN) algorithm for classification. \(k\)-NN algorithm uses information about an example’s \(k\) nearest neighbors to classify unlabeled examples. To measure how close the examples are, we need a distance measure. Traditionally, the \(k\)-NN algorithm uses Euclidean distance. Given two points \(u = (u_1,\ldots,u_p)\) and \(w = (w_1,\ldots,w_p)\), the Euclidean distance between them is \[\begin{equation*} d(u, w) = \sqrt{ \sum^p_{j=1}(u_j - w_j)^2}. \end{equation*}\]

Algorithm:

Compute the Euclidean distance \(d(x_i, x^*)\) for \(i=1,\ldots,n\)
Find the \(k\) training data with the smallest \(d(x_i, x^*)\)
The predicted class for \(x^*\) is determined by the majority vote of the \(k\) training data in Step 2.

The idea before the algorithm is simple. We expect that observations with similar features should have the same class label. In the following figure, we have some labeled data. Suppose the black dot is our new data without label, which group will you assign this data to? A natural choice is to assign the black dot to group A because the \(5\) nearest neighbors are all in group A. This is the idea of \(k\)-nearest neighbors algorithm.

Remark:

We can use the \(k\)-NN algorithm for regression. In that case, \(y_i\) is the numeric response. The corresponding algorithm replaces the majority vote by the mean of the responses in the \(k\) nearest training data.
Other distance measures could be used.
There are different methods to break ties.
There is no learning phase.

Applications:

Computer vision applications, including optical character recognition and facial recognition in both still images and video
Recommendation systems that predict whether a person will enjoy a movie or song
Identifying patterns in genetic data to detect specific proteins or diseases

Strengths of \(k\)-NN:

Simple and effective
Makes no assumptions about the underlying data distribution

Weaknesses of \(k\)-NN:

Does not produce a model, limiting the ability to understand how the features are related to the class
Requires selection of an appropriate \(k\)
Slow classification phase
Nominal features and missing data require additional processing

Choosing an appropriate \(k\):

Large \(k\): reduce the impact or variance caused by noisy data, may underfit the traning data
Small \(k\): may overfit the data

Extreme case: \(k = n\), the most common class will always be the prediction

Some people suggest using \(\sqrt{n}\) as \(k\). One may also use a validation set or cross-validation to choose the appropriate \(k\) (will discuss these later).

9.2 Feature Scaling

Clearly, if the features are in different scales, the distance measure computed will be dominated by some of the features and will not take into account of the importance of other features. Let’s assume we have \(n\) data and \(p\) features.

Two common methods for feature scaling:

min-max normalization

For each \(i=1,\ldots,n\), \(j=1,\ldots,p\), \[\begin{equation*} x^*_{ij} = \frac{x_{ij} - \min_{i=1,\ldots,n} x_{ij}}{\max_{i=1,\ldots,n} x_{ij} -\min_{i=1,\ldots,n} x_{ij}}. \end{equation*}\] The normalized features will have values between \(0\) and \(1\).
\(z\)-score standardization For each \(i=1,\ldots,n\), \(j=1,\ldots,p\), \[\begin{equation*} x^*_{ij} = \frac{x_{ij} - \overline{x}_j}{s_j}, \end{equation*}\] where \(\overline{x}_j\) and \(s_j\) are the sample mean and standard deviation of \(\{x_{1j},\ldots,x_{nj}\}\).

For nominal features, we can convert them into a numeric feature using dummy coding (also known as one-hot encoding). For example, if a nominal feature called temperature takes three values: hot, medium and cold. You can define two additional binary indicator variables: \[\begin{equation*} \text{hot} = \left\{ \begin{array}{ll} 1 & \text{if temperature = hot} \\ 0 & \text{otherwise} \end{array} \right. \end{equation*}\] and \[\begin{equation*} \text{medium} = \left\{ \begin{array}{ll} 1 & \text{if temperature = medium} \\ 0 & \text{otherwise.} \end{array} \right. \end{equation*}\] When both hot and medium are \(0\), we know the temperature is cold. In general, if we have \(n\) categories, we only need to create \(n-1\) additional binary indicators.

9.3 Example: Classifying Breast Cancers

We will use the breast cancecr from UC Irvine Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Download the dataset from onQ. Read the data (change the path to where you save your data):

wbcd <- read.csv(".../wisc_bc_data.csv")

Variables:

names(wbcd)
##  [1] "id"                     
##  [2] "diagnosis"              
##  [3] "radius_mean"            
##  [4] "texture_mean"           
##  [5] "perimeter_mean"         
##  [6] "area_mean"              
##  [7] "smoothness_mean"        
##  [8] "compactness_mean"       
##  [9] "concavity_mean"         
## [10] "concave.points_mean"    
## [11] "symmetry_mean"          
## [12] "fractal_dimension_mean" 
## [13] "radius_se"              
## [14] "texture_se"             
## [15] "perimeter_se"           
## [16] "area_se"                
## [17] "smoothness_se"          
## [18] "compactness_se"         
## [19] "concavity_se"           
## [20] "concave.points_se"      
## [21] "symmetry_se"            
## [22] "fractal_dimension_se"   
## [23] "radius_worst"           
## [24] "texture_worst"          
## [25] "perimeter_worst"        
## [26] "area_worst"             
## [27] "smoothness_worst"       
## [28] "compactness_worst"      
## [29] "concavity_worst"        
## [30] "concave.points_worst"   
## [31] "symmetry_worst"         
## [32] "fractal_dimension_worst"

Creating Training and Testing Datasets

The first column is id, which should not be included in the classification. The second column is diagnosis. B means benign and M means malignant. In short, the meaning of malignant is cancerous and the meaning of benign is non-cancerous. We will separate the labels from the features. To evaluate the model performance, we always split our full dataset into a training dataset and a testing daatset.

set.seed(6) # reproduce the result
# create the random numbers for selecting the rows in the dataset
random_index <- sample(nrow(wbcd), 469) 

# our "x"
wbcd_train <- wbcd[random_index, -(1:2)]
wbcd_test <- wbcd[-random_index, -(1:2)]

# our "y"
wbcd_train_labels <- wbcd[random_index, ]$diagnosis
wbcd_test_labels <- wbcd[-random_index, ]$diagnosis

Note: In forming a training dataset and a testing dataset, you should not select the first \(469\) rows (unless you know the data have been randomly organized).

Normalizing the data

We have to normalize the features in both the training datasets and testing datasets
We have to use the same normalizing methods for these two datasets
Compute the min and max (or mean and sd) using only the training datasets

wbcd_train_n <- wbcd_train
wbcd_test_n <- wbcd_test

train_min <- apply(wbcd_train, 2, min)
train_max <- apply(wbcd_train, 2, max)

for (i in 1:ncol(wbcd_train)) {
  wbcd_train_n[, i] <- (wbcd_train[, i] - train_min[i]) / (train_max[i] - train_min[i]) 
  # use the min and max from training data to normalize the testing data
  wbcd_test_n[, i] <- (wbcd_test[, i] - train_min[i]) / (train_max[i] - train_min[i]) 
}

We will use knn() in the package class to perform \(k\)-NN classification. knn() will return a factor of classifications of testing dataset.

library(class) # install it if you haven't done so
knn_predicted <- knn(train = wbcd_train_n, test = wbcd_test_n, 
                     cl = wbcd_train_labels, k = 21)

train : training dataset
test: testing dataset
cl: training labels
k: \(k\)-nearest neighbors will be used

9.3.1 Evaluating Model Performance

table(wbcd_test_labels, knn_predicted)
##                 knn_predicted
## wbcd_test_labels  B  M
##                B 57  2
##                M  4 37

False positive: an error where the test result incorrectly indicates the presence of a condition such as a disease when the disease is not present. In this example, we have \(2\) false positives.

False Negative: an error where the test result incorrectly fails to indicate the presence of a condition when it is present. In this example, we have \(4\) false negatives.

Here the “test result” means our prediction.

The accuracy is \[\begin{equation*} \text{accuracy} = \frac{57+37}{57+2+4+37} = 0.94. \end{equation*}\]

The error rate is \[\begin{equation*} \text{error rate} = \frac{2 + 4}{57+2+4+37} = 0.06 = 1- \text{accuracy}. \end{equation*}\]

9.3.2 Using \(z\)-score standardization

Let’s try the \(z\)-score standardization.

wbcd_train_s <- wbcd_train
wbcd_test_s <- wbcd_test

train_mean <- apply(wbcd_train, 2, mean)
train_sd <- apply(wbcd_train, 2, sd)

for (i in 1:ncol(wbcd_train)) {
  wbcd_train_s[, i] <- (wbcd_train[, i] - train_mean[i]) / train_sd[i]
  # use the mean and sd from training data to normalize the testing data
  wbcd_test_s[, i] <- (wbcd_test[, i] - train_mean[i]) / train_sd[i]
}

Perform \(k\)-NN:

knn_predicted <- knn(train = wbcd_train_s, test = wbcd_test_s, 
                     cl = wbcd_train_labels, k = 21)

Evaluate the performance:

table(wbcd_test_labels, knn_predicted)
##                 knn_predicted
## wbcd_test_labels  B  M
##                B 58  1
##                M  5 36

The performance using \(z\)-score standardization and min-max normalization is similar.

9.3.3 Testing alternative values of \(k\)

k <- c(1, 5, 11, 15, 21, 27)
result <- matrix(0, nrow = length(k), ncol = 4)
result[, 1] <- k
colnames(result) = c("k", "False Neg", "False Pos", 
                     "% Classified Correctly")

for (i in 1:length(k)) {
  knn_predicted <- knn(train = wbcd_train_n, test = wbcd_test_n,
                       cl = wbcd_train_labels, k = k[i])
  confusion_matrix <- table(wbcd_test_labels, knn_predicted)
  result[i, 2] <- confusion_matrix[2, 1]
  result[i, 3] <- confusion_matrix[1, 2]
  result[i, 4] <- sum(diag(confusion_matrix)) / length(wbcd_test_labels) * 100
}

result
##       k False Neg False Pos % Classified Correctly
## [1,]  1         4         3                     93
## [2,]  5         4         1                     95
## [3,] 11         3         1                     96
## [4,] 15         3         2                     95
## [5,] 21         4         2                     94
## [6,] 27         3         2                     95

Important remark: while the accuracy is the highest when \(k = 11\), it does not mean the model is the best for this data. Note that

We split the dataset into a training dataset and a testing dataset using one particular random partition. With another random partition, the results would be different. The more appropriate way to assess the accuracy is to use repeated \(k\)-fold cross validation (the \(k\) here is different from the \(k\) in our \(k\)-NN algorithm). We shall discuss this later.
Having false Negatives could be a more severe problem than having false positives in this example. Although in this example, when \(k=11\), it also produces the lowest number of false negatives.