ISyE6501 HW2 Question 3.1 First, I set up my data file and load kknn library. rm(list = ls()) library(kknn) set.seed(42) CCdata <-
... [Show More] read.table("credit_card_data.txt", stringsAsFactors = FALSE, he ader = FALSE) 3.1 (a) For part a, I creat a KNN model for cross validating using leave-one-out (train.kknn) method in R, because leave-one-out cross validation is computationally efficient. Since we are testing the model using the entire dataset for part a, I just write a for loop to test through each K, and compare the fitted values from each model to the response values, to find out which K within 1-100 renders the highest accuracy of predicting, from where I find out the best K and best classifier. The best K by the highest accuracy will overwrite whichever K the model gives us since our response is discrete data. And I will do splitting of data in part b. accuracy <- c() for (k in 1:50) { model <- train.kknn(V11~., CCdata, kmax=k, scale=TRUE) pred_k <- as.integer(fitted(model)[[k]][1:nrow(CCdata)] + 0.5) accuracy[k] = sum(pred_k == CCdata[,11])/nrow(CCdata)} kval <- c(1:50) plot(kval,accuracy)max(accuracy) ## [1] 0.853211 which.max(accuracy) ## [1] 12 From the result of the leave-one-out cross validation model, when K=12 the prediction is the most accurate at 85.32%. So our best classifier is k=12. Second, we can use the cv.kknn function to perform a specific type of cross-validation on the dataset too. I use 10-fold cross-validation in my model because it is mentioned in the course video that 10 folds are common. And then I compare different models to get the best K by a for loop. set.seed(42) accuracy <- c() for (k in 1:50) { model_cv <- cv.kknn(V11~., CCdata, kcv=10, k=k,scale=TRUE) #10-fold cross-v alidation pred_k <- as.integer(model_cv[[1]][,2] + 0.5) accuracy[k] = sum(pred_k == CCdata[,11])/nrow(CCdata)} kval <- c(1:50) plot(kval,accuracy) max(accuracy) ## [1] 0.8608563 which.max(accuracy) ## [1] 12 Using cv.kknn function and 10 fold cross-validation, our best classifier is when k=12, and it renders the highest accuracy of 86.09%.3.1 (b) First I use R’s built in function sample to split the data set into three groups. I want the traning, validation and testing data to be 60/20/20 percent randomly chosen from the dataset. I separate the traning data first, which is 60% of the credit card data file. Then I divide equally the rest of CCdata into the validation and testing datasets [Show Less]