In my above article — do some classification models by iris data set , I used a metric of accuracy, that divided the proportion of correct predictions by the total number of predictions, to measure classification performance in kinds of models .
There are also many other metrics such as precision, recall, they are often used to evaluated the learner.
These metrics all are computed by confusion matrix, if two models have the same confusion matrix, then they will have the same metrics.
Classification often involves a balance between being overly conservative and overly aggressive in decision making.
For example, an e-mail filter could guarantee to eliminate every spam message by aggressively eliminating nearly every ham message at the same time.
On the other hand, a guarantee that no ham messages will be inadvertently filtered might allow an unacceptable amount of spam to pass through the filter.
This trade-off is captured by metrics: sensitivity and specificity.
The sensitivity of a model (also called the true positive rate), measures the proportion of positive examples that were correctly classified.
\[sensitivity\ =\ \frac{TP}{TP\ +\ FN}\]
The specificity of a model (also called the true negative rate), measures the proportion of negative examples that were correctly classified.
\[specificity\ =\ \frac{TN}{TN\ +\ FP}\]
Closely related to sensitivity and specificity are two other performance metrics: precision and recall.
The precision (also known as the positive predictive value) is defined as the proportion of positive examples that are truly positive, in other words, when a model predicts the positive class, how often is it correct?
A precise model will only predict the positive class in cases very likely to be positive.
It will be very trustworthy.
\[precision\ =\ \frac{TP}{TP\ +\ FP}\]
On the other hand, recall is a metric of how complete the results are.
This is defined as the number of true positives over the total number of positives.
You may recognize that this is the same as sensitivity, only the interpretation differs.
A model with high recall captures a large portion of the positive examples, meaning that it has wide breadth.
\[recall\ =\ \frac{TP}{TP\ +\ FN}\]
A metric of model performance that combines precision and recall into a single number is known as the F-measure (also sometimes called the F1 score or the F-score).
\[F-measure\ =\ \frac{2\ \cdot\ precision \cdot\ recall}{recall\ +\ precision}=\frac{2\cdot TP}{2\cdot TP\ +\ FP\ +\ FN}\]
The ROC curve (Receive Operating Characteristic) is commonly used to examine the trade-off between the detection of true positives, while avoiding the false positives.
The closer the curve is to the perfect classifier, the better it is at identifying positive values.
This can measured using a metric known as the area under the ROC curve (abbreviated AUC).
AUC ranges from 0.5 (for a classifier with no predictive value), to 1.0 (for a perfect classifier).
Here, we will use field.goals dataset in nutshell package as an example, try some classifier to do some metrics of model performance.
#######################################################
### use field.goals dataset to test classifier performance
#######################################################
# load the data set
library(nutshell)
data(field.goals)
# create a new binary variable for dataset
data <- transform(field.goals,
play.type = factor(ifelse(play.type == "FG good", "good", "bad")))
head(data)
## home.team week qtr away.team offense defense play.type player yards stadium.type
## 1 ARI 14 3 WAS ARI WAS good 1-N.Rackers 20 Out
## 2 ARI 2 4 STL ARI STL good 1-N.Rackers 35 Out
## 3 ARI 7 3 TEN ARI TEN good 1-N.Rackers 24 Out
## 4 ARI 12 2 JAC JAC ARI good 10-J.Scobee 30 Out
## 5 ARI 2 3 STL ARI STL good 1-N.Rackers 48 Out
## 6 ARI 7 4 TEN TEN ARI bad 15-C.Hentrich 33 Out
summary(data)
## home.team week qtr away.team offense defense play.type
## NYG : 38 Min. : 1.000 Min. :1.000 NO : 37 ARI : 45 SEA : 42 bad :195
## HOU : 37 1st Qu.: 5.000 1st Qu.:2.000 OAK : 37 NYG : 42 NO : 40 good:787
## ARI : 36 Median : 9.000 Median :2.000 STL : 37 BAL : 36 OAK : 38
## MIN : 36 Mean : 9.203 Mean :2.579 CLE : 35 BUF : 35 SF : 37
## NYJ : 36 3rd Qu.:13.750 3rd Qu.:4.000 NYG : 35 CAR : 35 CLE : 36
## BAL : 35 Max. :17.000 Max. :5.000 SEA : 35 HOU : 35 NYJ : 35
## (Other):764 (Other):766 (Other):754 (Other):754
## player yards stadium.type
## 1-N.Rackers: 42 Min. :18.00 Both: 69
## 2-J.Feely : 42 1st Qu.:28.00 In :184
## 4-J.Kasay : 35 Median :37.00 Out :729
## 9-R.Lindell: 35 Mean :36.28
## 1-P.Edinger: 34 3rd Qu.:44.00
## 3-K.Brown : 34 Max. :62.00
## (Other) :760
We split iris dataset to train-set and test-set.
#########################################
## split the data set to train and test
#########################################
n <- length(data[,1])
index1 <- 1 : n
# divide to 5 part of data
index2 <- rep(1 : 5, ceiling(n / 5))[1 : n]
set.seed(100)
# melt the order of the data
index2 <- sample(index2, n)
# get the one part of the data
m <- index1[index2 == 1]
trainset <- data[-m, ]
testset <- data[m, ]
summary(trainset)
## home.team week qtr away.team offense defense play.type
## NYG : 31 Min. : 1.000 Min. :1.000 CLE : 31 ARI : 37 SEA : 33 bad :155
## ARI : 30 1st Qu.: 5.000 1st Qu.:2.000 NO : 31 NYG : 33 NYJ : 31 good:630
## CIN : 30 Median : 9.000 Median :2.000 STL : 30 BUF : 32 ATL : 30
## NYJ : 30 Mean : 9.155 Mean :2.595 BUF : 29 BAL : 31 NO : 30
## ATL : 28 3rd Qu.:14.000 3rd Qu.:4.000 ARI : 28 DEN : 29 GB : 29
## GB : 28 Max. :17.000 Max. :5.000 BAL : 28 STL : 29 MIN : 29
## (Other):608 (Other):608 (Other):594 (Other):603
## player yards stadium.type
## 1-N.Rackers : 35 Min. :18.00 Both: 55
## 2-J.Feely : 33 1st Qu.:28.00 In :147
## 9-R.Lindell : 32 Median :37.00 Out :583
## 1-J.Elam : 29 Mean :36.26
## 14-J.Wilkins: 29 3rd Qu.:44.00
## 3-M.Stover : 29 Max. :61.00
## (Other) :598
summary(testset)
## home.team week qtr away.team offense defense play.type
## BAL : 11 Min. : 1.000 Min. :1.000 NYG : 10 HOU : 11 OAK : 11 bad : 40
## CHI : 10 1st Qu.: 5.000 1st Qu.:2.000 SF : 10 KC : 11 STL : 11 good:157
## CLE : 10 Median :10.000 Median :2.000 MIA : 9 MIN : 11 BAL : 10
## HOU : 9 Mean : 9.391 Mean :2.518 OAK : 9 CAR : 9 NO : 10
## KC : 9 3rd Qu.:13.000 3rd Qu.:4.000 MIN : 8 NYG : 9 SF : 10
## NO : 9 Max. :17.000 Max. :4.000 NE : 8 SF : 9 CHI : 9
## (Other):139 (Other):143 (Other):137 (Other):136
## player yards stadium.type
## 1-L.Tynes : 11 Min. :19.00 Both: 14
## 1-P.Edinger: 11 1st Qu.:29.00 In : 37
## 3-K.Brown : 11 Median :36.00 Out :146
## 2-J.Feely : 9 Mean :36.38
## 4-J.Kasay : 9 3rd Qu.:44.00
## 6-J.Nedney : 9 Max. :62.00
## (Other) :137
Here, our aim is to detective “bad” play type.
Because the number of “bad” is small, if all play types are predicted to “good”, then accuracy is 636/(636 + 149) = 0.81, about 81% predict is right in train set, but there is useless to detective “bad” in above predict method.
So only accuracy rate in unbalance data set (such as bad : good is 149 : 636 ) to compare the performance of classifier will lead to misunderstand and meaningless.
So, we do some classification models and compute their confusion matrix in train-set and test-set and then we compute some metrics by confusion matrix.
Logistic regression model in train set.
## logistic regression
logic_model<- glm(play.type ~ ., data = trainset, family = binomial)
#summary(logic_model)
logic_model2<-step(logic_model)
## Start: AIC=781.26
## play.type ~ home.team + week + qtr + away.team + offense + defense +
## player + yards + stadium.type
##
##
## Step: AIC=781.26
## play.type ~ home.team + week + qtr + away.team + offense + defense +
## player + yards
##
## Df Deviance AIC
## - away.team 1 534.4 776.4
## - qtr 1 534.9 776.9
## <none> 537.3 781.3
## - player 25 589.6 783.6
## - home.team 2 1868.9 2108.9
## - defense 6 9515.5 9747.5
## - week 7 9876.0 10106.0
## - yards 7 12326.9 12556.9
## - offense 1 20229.5 20471.5
##
## Step: AIC=776.38
## play.type ~ home.team + week + qtr + offense + defense + player +
## yards
##
## Df Deviance AIC
## - defense 36 584.51 754.51
## - offense 7 535.54 763.54
## - home.team 31 589.31 769.31
## - qtr 2 535.27 773.27
## <none> 534.38 776.38
## - player 24 589.56 783.56
## - yards 1 629.25 869.25
## - week 2 898.40 1136.40
##
## Step: AIC=754.51
## play.type ~ home.team + week + qtr + offense + player + yards
##
## Df Deviance AIC
## - home.team 31 614.63 722.63
## - week 1 584.51 752.51
## - qtr 1 584.54 752.54
## - offense 2 587.22 753.22
## <none> 584.51 754.51
## - player 19 630.20 762.20
## - yards -4 682.14 860.14
##
## Step: AIC=722.63
## play.type ~ week + qtr + offense + player + yards
##
## Df Deviance AIC
## - week 1 614.63 720.63
## - qtr 1 614.69 720.69
## - offense 2 618.54 722.54
## <none> 614.63 722.63
## - player 19 659.17 729.17
## - yards 1 701.87 807.87
##
## Step: AIC=720.63
## play.type ~ qtr + offense + player + yards
##
## Df Deviance AIC
## - qtr 1 614.69 718.69
## - offense 2 618.55 720.55
## <none> 614.63 720.63
## - player 19 659.18 727.18
## - yards -4 708.81 822.81
##
## Step: AIC=718.69
## play.type ~ offense + player + yards
##
## Df Deviance AIC
## - offense 2 618.67 718.67
## <none> 614.69 718.69
## - player 19 659.58 725.58
## - yards -4 800.13 912.13
##
## Step: AIC=718.67
## play.type ~ player + yards
##
## Df Deviance AIC
## - player 48 692.43 696.43
## <none> 618.67 718.67
## - yards 1 707.00 805.00
##
## Step: AIC=696.43
## play.type ~ yards
##
## Df Deviance AIC
## <none> 692.43 696.43
## - yards 1 780.05 782.05
#summary(logic_model2)
# confusion matrix in trainset
logic_model2_pred_train <- predict(logic_model2, trainset, type = "response")
table(ifelse(logic_model2_pred_train > 0.5, "good", "bad"), trainset$play.type)
##
## bad good
## bad 11 6
## good 144 624
library(caret)
confusionMatrix(factor(ifelse(logic_model2_pred_train > 0.5, "good", "bad")), trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 11 6
## good 144 624
##
## Accuracy : 0.8089
## 95% CI : (0.7796, 0.8359)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : 0.3461
##
## Kappa : 0.0925
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.07097
## Specificity : 0.99048
## Pos Pred Value : 0.64706
## Neg Pred Value : 0.81250
## Prevalence : 0.19745
## Detection Rate : 0.01401
## Detection Prevalence : 0.02166
## Balanced Accuracy : 0.53072
##
## 'Positive' Class : bad
##
Logistic regression model in test set.
# confusion matrix in testset
logic_model2_pred_test <- predict(logic_model2, testset, type = "response")
table(ifelse(logic_model2_pred_test > 0.5, "good", "bad"), testset$play.type)
##
## bad good
## bad 1 1
## good 39 156
confusionMatrix(factor(ifelse(logic_model2_pred_test > 0.5, "good", "bad")), testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 1 1
## good 39 156
##
## Accuracy : 0.797
## 95% CI : (0.7339, 0.8508)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.5422
##
## Kappa : 0.0288
##
## Mcnemar's Test P-Value : 4.909e-09
##
## Sensitivity : 0.025000
## Specificity : 0.993631
## Pos Pred Value : 0.500000
## Neg Pred Value : 0.800000
## Prevalence : 0.203046
## Detection Rate : 0.005076
## Detection Prevalence : 0.010152
## Balanced Accuracy : 0.509315
##
## 'Positive' Class : bad
##
Knn in train set.
## k-nearest neighbor
library(class)
train_x <- as.matrix(trainset[, c(2, 3, 9)])
test_x <- as.matrix(testset[, c(2, 3, 9)])
train_y <- trainset[, 7]
set.seed(1)
knn_pred_train <- knn(train_x, train_x, train_y, k = 4)
table(knn_pred_train, trainset$play.type)
##
## knn_pred_train bad good
## bad 52 31
## good 103 599
confusionMatrix(knn_pred_train, trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 52 31
## good 103 599
##
## Accuracy : 0.8293
## 95% CI : (0.8011, 0.855)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : 0.03136
##
## Kappa : 0.3471
##
## Mcnemar's Test P-Value : 8.598e-10
##
## Sensitivity : 0.33548
## Specificity : 0.95079
## Pos Pred Value : 0.62651
## Neg Pred Value : 0.85328
## Prevalence : 0.19745
## Detection Rate : 0.06624
## Detection Prevalence : 0.10573
## Balanced Accuracy : 0.64314
##
## 'Positive' Class : bad
##
Knn in test set.
knn_pred_test <- knn(train_x, test_x, train_y, k = 4)
table(knn_pred_test, testset$play.type)
##
## knn_pred_test bad good
## bad 7 17
## good 33 140
confusionMatrix(knn_pred_test, testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 7 17
## good 33 140
##
## Accuracy : 0.7462
## 95% CI : (0.6794, 0.8054)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.96578
##
## Kappa : 0.0784
##
## Mcnemar's Test P-Value : 0.03389
##
## Sensitivity : 0.17500
## Specificity : 0.89172
## Pos Pred Value : 0.29167
## Neg Pred Value : 0.80925
## Prevalence : 0.20305
## Detection Rate : 0.03553
## Detection Prevalence : 0.12183
## Balanced Accuracy : 0.53336
##
## 'Positive' Class : bad
##
Decision tree in train set.
## decision tree
library(tree)
tree_model <- tree(play.type ~ week + qtr + yards + stadium.type, trainset)
summary(tree_model)
##
## Classification tree:
## tree(formula = play.type ~ week + qtr + yards + stadium.type,
## data = trainset)
## Variables actually used in tree construction:
## [1] "yards"
## Number of terminal nodes: 4
## Residual mean deviance: 0.868 = 677.9 / 781
## Misclassification error rate: 0.1898 = 149 / 785
plot(tree_model)
text(tree_model, pretty = 0)
tree_pred_train <- predict(tree_model, trainset, type = "class")
table(tree_pred_train, trainset$play.type)
##
## tree_pred_train bad good
## bad 33 27
## good 122 603
confusionMatrix(tree_pred_train, trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 33 27
## good 122 603
##
## Accuracy : 0.8102
## 95% CI : (0.781, 0.8371)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : 0.3134
##
## Kappa : 0.2211
##
## Mcnemar's Test P-Value : 1.352e-14
##
## Sensitivity : 0.21290
## Specificity : 0.95714
## Pos Pred Value : 0.55000
## Neg Pred Value : 0.83172
## Prevalence : 0.19745
## Detection Rate : 0.04204
## Detection Prevalence : 0.07643
## Balanced Accuracy : 0.58502
##
## 'Positive' Class : bad
##
Decision tree in test set.
tree_pred_test <- predict(tree_model, testset, type = "class")
table(tree_pred_test, testset$play.type)
##
## tree_pred_test bad good
## bad 8 11
## good 32 146
confusionMatrix(tree_pred_test, testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 8 11
## good 32 146
##
## Accuracy : 0.7817
## 95% CI : (0.7175, 0.8373)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.735843
##
## Kappa : 0.1615
##
## Mcnemar's Test P-Value : 0.002289
##
## Sensitivity : 0.20000
## Specificity : 0.92994
## Pos Pred Value : 0.42105
## Neg Pred Value : 0.82022
## Prevalence : 0.20305
## Detection Rate : 0.04061
## Detection Prevalence : 0.09645
## Balanced Accuracy : 0.56497
##
## 'Positive' Class : bad
##
Bagging in train set.
## bagging
library(randomForest)
bag_model <- randomForest(play.type ~ week + qtr + yards + stadium.type, data = trainset,
mtry = 4,
importance = TRUE)
bag_model
##
## Call:
## randomForest(formula = play.type ~ week + qtr + yards + stadium.type, data = trainset, mtry = 4, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 25.35%
## Confusion matrix:
## bad good class.error
## bad 39 116 0.7483871
## good 83 547 0.1317460
bag_pred_train <- predict(bag_model, newdata = trainset)
table(bag_pred_train, trainset$play.type)
##
## bag_pred_train bad good
## bad 144 4
## good 11 626
confusionMatrix(bag_pred_train, trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 144 4
## good 11 626
##
## Accuracy : 0.9809
## 95% CI : (0.9687, 0.9893)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9387
##
## Mcnemar's Test P-Value : 0.1213
##
## Sensitivity : 0.9290
## Specificity : 0.9937
## Pos Pred Value : 0.9730
## Neg Pred Value : 0.9827
## Prevalence : 0.1975
## Detection Rate : 0.1834
## Detection Prevalence : 0.1885
## Balanced Accuracy : 0.9613
##
## 'Positive' Class : bad
##
Bagging in test set.
bag_pred_test <- predict(bag_model, newdata = testset)
table(bag_pred_test, testset$play.type)
##
## bag_pred_test bad good
## bad 8 24
## good 32 133
confusionMatrix(bag_pred_test, testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 8 24
## good 32 133
##
## Accuracy : 0.7157
## 95% CI : (0.6473, 0.7776)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.9975
##
## Kappa : 0.0509
##
## Mcnemar's Test P-Value : 0.3496
##
## Sensitivity : 0.20000
## Specificity : 0.84713
## Pos Pred Value : 0.25000
## Neg Pred Value : 0.80606
## Prevalence : 0.20305
## Detection Rate : 0.04061
## Detection Prevalence : 0.16244
## Balanced Accuracy : 0.52357
##
## 'Positive' Class : bad
##
Random forest in train set.
## random forest
rf_model <- randomForest(play.type ~ week + qtr + yards + stadium.type, data = trainset,
importance = TRUE)
rf_model
##
## Call:
## randomForest(formula = play.type ~ week + qtr + yards + stadium.type, data = trainset, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.69%
## Confusion matrix:
## bad good class.error
## bad 29 126 0.8129032
## good 60 570 0.0952381
rf_pred_train <- predict(rf_model, newdata = trainset)
table(rf_pred_train, trainset$play.type)
##
## rf_pred_train bad good
## bad 125 2
## good 30 628
confusionMatrix(rf_pred_train, trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 125 2
## good 30 628
##
## Accuracy : 0.9592
## 95% CI : (0.9429, 0.972)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.862
##
## Mcnemar's Test P-Value : 1.815e-06
##
## Sensitivity : 0.8065
## Specificity : 0.9968
## Pos Pred Value : 0.9843
## Neg Pred Value : 0.9544
## Prevalence : 0.1975
## Detection Rate : 0.1592
## Detection Prevalence : 0.1618
## Balanced Accuracy : 0.9016
##
## 'Positive' Class : bad
##
Random forest in test set.
rf_pred_test <- predict(rf_model, newdata = testset)
table(rf_pred_test, testset$play.type)
##
## rf_pred_test bad good
## bad 8 17
## good 32 140
confusionMatrix(rf_pred_test, testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 8 17
## good 32 140
##
## Accuracy : 0.7513
## 95% CI : (0.6848, 0.81)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.9510
##
## Kappa : 0.1066
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.20000
## Specificity : 0.89172
## Pos Pred Value : 0.32000
## Neg Pred Value : 0.81395
## Prevalence : 0.20305
## Detection Rate : 0.04061
## Detection Prevalence : 0.12690
## Balanced Accuracy : 0.54586
##
## 'Positive' Class : bad
##
Boosting in train set.
## boosting
library(gbm)
boost_model <- gbm(as.numeric(play.type) - 1 ~ week + qtr + yards + stadium.type, data = trainset)
## Distribution not specified, assuming bernoulli ...
summary(boost_model)
## var rel.inf
## yards yards 79.739582
## week week 12.703111
## qtr qtr 3.970923
## stadium.type stadium.type 3.586384
boost_pred_train <- predict(boost_model, newdata = trainset, n.trees = 100,
type = "response")
table(ifelse(boost_pred_train > 0.8, "good", "bad"), trainset$play.type)
##
## bad good
## bad 97 157
## good 58 473
confusionMatrix(factor(ifelse(boost_pred_train > 0.8, "good", "bad")), trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 97 157
## good 58 473
##
## Accuracy : 0.7261
## 95% CI : (0.6935, 0.757)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3035
##
## Mcnemar's Test P-Value : 2.332e-11
##
## Sensitivity : 0.6258
## Specificity : 0.7508
## Pos Pred Value : 0.3819
## Neg Pred Value : 0.8908
## Prevalence : 0.1975
## Detection Rate : 0.1236
## Detection Prevalence : 0.3236
## Balanced Accuracy : 0.6883
##
## 'Positive' Class : bad
##
Boosting in test set.
boost_pred_test <- predict(boost_model, newdata = testset, n.trees = 100,
type = "response")
table(ifelse(boost_pred_test > 0.8, "good", "bad"), testset$play.type)
##
## bad good
## bad 25 36
## good 15 121
confusionMatrix(factor(ifelse(boost_pred_test > 0.8, "good", "bad")), testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 25 36
## good 15 121
##
## Accuracy : 0.7411
## 95% CI : (0.6741, 0.8008)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.976652
##
## Kappa : 0.331
##
## Mcnemar's Test P-Value : 0.005101
##
## Sensitivity : 0.6250
## Specificity : 0.7707
## Pos Pred Value : 0.4098
## Neg Pred Value : 0.8897
## Prevalence : 0.2030
## Detection Rate : 0.1269
## Detection Prevalence : 0.3096
## Balanced Accuracy : 0.6979
##
## 'Positive' Class : bad
##
Support vector machine in train set.
## support vector machine
library(e1071)
svm_model <- svm(play.type ~ week + qtr + yards + stadium.type, data = trainset,
kernel = "linear", scale = FALSE)
summary(svm_model)
##
## Call:
## svm(formula = play.type ~ week + qtr + yards + stadium.type, data = trainset, kernel = "linear",
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 348
##
## ( 193 155 )
##
##
## Number of Classes: 2
##
## Levels:
## bad good
svm_pred_train <- predict(svm_model, trainset)
table(svm_pred_train, trainset$play.type)
##
## svm_pred_train bad good
## bad 0 0
## good 155 630
confusionMatrix(svm_pred_train, trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 0 0
## good 155 630
##
## Accuracy : 0.8025
## 95% CI : (0.7729, 0.8299)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : 0.5215
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8025
## Prevalence : 0.1975
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : bad
##
Support vector machine in test set.
svm_pred_test <- predict(svm_model, testset)
table(svm_pred_test, testset$play.type)
##
## svm_pred_test bad good
## bad 0 0
## good 40 157
confusionMatrix(svm_pred_test, testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 0 0
## good 40 157
##
## Accuracy : 0.797
## 95% CI : (0.7339, 0.8508)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.5422
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 6.984e-10
##
## Sensitivity : 0.000
## Specificity : 1.000
## Pos Pred Value : NaN
## Neg Pred Value : 0.797
## Prevalence : 0.203
## Detection Rate : 0.000
## Detection Prevalence : 0.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : bad
##
Naive bayes in train set.
## naive bayes
library(e1071)
bayes_model <- naiveBayes(trainset$yards, trainset$play.type)
summary(bayes_model)
## Length Class Mode
## apriori 2 table numeric
## tables 1 -none- list
## levels 2 -none- character
## isnumeric 1 -none- logical
## call 3 -none- call
bayes_pred_train <- predict(bayes_model, trainset$yards, type = "class")
table(bayes_pred_train, trainset$play.type)
##
## bayes_pred_train bad good
## bad 0 0
## good 155 630
confusionMatrix(bayes_pred_train, trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 0 0
## good 155 630
##
## Accuracy : 0.8025
## 95% CI : (0.7729, 0.8299)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : 0.5215
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8025
## Prevalence : 0.1975
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : bad
##
Naive bayes in test set.
bayes_pred_test <- predict(bayes_model, testset$yards, type = "class")
table(bayes_pred_test, testset$play.type)
##
## bayes_pred_test bad good
## bad 0 0
## good 40 157
confusionMatrix(bayes_pred_test, testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 0 0
## good 40 157
##
## Accuracy : 0.797
## 95% CI : (0.7339, 0.8508)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.5422
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 6.984e-10
##
## Sensitivity : 0.000
## Specificity : 1.000
## Pos Pred Value : NaN
## Neg Pred Value : 0.797
## Prevalence : 0.203
## Detection Rate : 0.000
## Detection Prevalence : 0.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : bad
##
Artificial neural networks in train set.
## artificial neural networks
library(nnet)
ann_model <- nnet(play.type ~ week + qtr + yards + stadium.type, data = trainset, size = 3)
## # weights: 22
## initial value 526.864211
## iter 10 value 390.021519
## iter 20 value 382.073496
## iter 30 value 349.819340
## iter 40 value 346.730901
## iter 50 value 345.183199
## iter 60 value 342.878101
## iter 70 value 334.338800
## iter 80 value 331.712558
## iter 90 value 330.983791
## iter 100 value 330.705134
## final value 330.705134
## stopped after 100 iterations
ann_pred_train <- predict(ann_model, newdata = trainset, type = "class")
table(ann_pred_train, trainset$play.type)
##
## ann_pred_train bad good
## bad 22 19
## good 133 611
confusionMatrix(factor(ann_pred_train), trainset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 22 19
## good 133 611
##
## Accuracy : 0.8064
## 95% CI : (0.777, 0.8335)
## No Information Rate : 0.8025
## P-Value [Acc > NIR] : 0.4147
##
## Kappa : 0.1547
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.14194
## Specificity : 0.96984
## Pos Pred Value : 0.53659
## Neg Pred Value : 0.82124
## Prevalence : 0.19745
## Detection Rate : 0.02803
## Detection Prevalence : 0.05223
## Balanced Accuracy : 0.55589
##
## 'Positive' Class : bad
##
Artificial neural networks in test set.
ann_pred_test <- predict(ann_model, newdata = testset, type = "class")
table(ann_pred_test, testset$play.type)
##
## ann_pred_test bad good
## bad 5 6
## good 35 151
confusionMatrix(factor(ann_pred_test), testset$play.type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 5 6
## good 35 151
##
## Accuracy : 0.7919
## 95% CI : (0.7284, 0.8463)
## No Information Rate : 0.797
## P-Value [Acc > NIR] : 0.611
##
## Kappa : 0.1189
##
## Mcnemar's Test P-Value : 1.226e-05
##
## Sensitivity : 0.12500
## Specificity : 0.96178
## Pos Pred Value : 0.45455
## Neg Pred Value : 0.81183
## Prevalence : 0.20305
## Detection Rate : 0.02538
## Detection Prevalence : 0.05584
## Balanced Accuracy : 0.54339
##
## 'Positive' Class : bad
##
We summary above confusion matrix and then find out: decision tree, support vector machine and artificial networks confusion matrix in test set are same, predict all play type as “good”.
We display metrics of all above models in a table and compare kinds of measures between train set and test set, we will see some classifier do better in train set, but worse in test set, so split the data set to train set and test set is important.
Our aim is detect more “bad” play type, at the same time we want to achieve higher accuracy as possible.
Firstly, we choose high recall in test set, secondly, we take account of better accuracy in test set.
So, we choose boosting, knn and logistic regression .
In fact, classifier is not only return a predict value “good” or “bad”, it will return a probability value about “good” or “bad”.
Such as logistic regression, in above model, I simply choose 0.5 as threshold value, then get confusion matrix, recall and precision, etc.
If I do not chose a threshold value, to every probability, we will get a confusion matrix, recall and so on, eventually, we will get a series confusion matrix, recall, precision, we could plot series measures in an plot, such as ROC curve.
Logistic regression model in train set.
When cut-off: 0.5, TPR (true positive rate, recall, sensitivity) : 0.04, now best cut-off: 0.8326, TPR: 0.58, better.
############################################################
### make roc curve
############################################################
# make single roc plot method 1
library(ROCR)
pred <- prediction(logic_model2_pred_train, trainset$play.type)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
grid(5, 5, lwd = 1)
points(c(0,1), c(0,1), type="l", lty=2, lwd=2, col="grey")
auc.tmp <- performance(pred,"auc")
auc <- as.numeric(auc.tmp@y.values)
auc <- round(auc, 4)
We could compare roc curve of some models in train set at one graph.
The more near left top corner roc curve is, the better the model is.
From prediction in train set, bagging and random forest is best.
#############################################
# make multiply roc plot in train set
#############################################
fpr_tpr <- function(p, dataset) {
TPR <- rep(0, length(p))
FPR <- rep(0, length(p))
for(i in 1 : length(p)){
p0 <- p[order(p)[i]]
label_true <- ifelse(dataset$play.type == "good", 1, 0)
label_pred <- 1 * (p > p0)
TPR[i] <- sum(label_pred * label_true) / sum(label_true)
FPR[i] <- sum(label_pred * (1 - label_true)) / sum(1 - label_true)
}
list(fpr = FPR, tpr = TPR)
}
# plot logistic regression roc
logic_train_measures <- fpr_tpr(p = logic_model2_pred_train, dataset = trainset)
plot(logic_train_measures$fpr, logic_train_measures$tpr, type = "l", col = 2,
ylab = "TPR", xlab = "FPR")
title("ROC curve in train set")
points(c(0, 1), c(0, 1), type = "l", lty = 2)
# plot knn roc
knn_pred_train2 <- knn(train_x, train_x, train_y, k = 4, prob = TRUE)
knn_train_measures <- fpr_tpr(p = attr(knn_pred_train2, "prob"), dataset = trainset)
points(knn_train_measures$fpr, knn_train_measures$tpr, type = "l", col = 3)
# plot tree roc
tree_pred_train2 <- predict(tree_model, trainset, type = "vector")
tree_train_measures <- fpr_tpr(p = tree_pred_train2[, 2], dataset = trainset)
points(tree_train_measures$fpr, tree_train_measures$tpr, type = "l", col = 4)
# plot bagging roc
bag_pred_train2 <- predict(bag_model, newdata = trainset, type = "prob")
bag_train_measures <- fpr_tpr(p = bag_pred_train2[, 2], dataset = trainset)
points(bag_train_measures$fpr, bag_train_measures$tpr, type = "l", col = 5)
# plot random forest roc
rf_pred_train2 <- predict(rf_model, newdata = trainset, type = "prob")
rf_train_measures <- fpr_tpr(p = rf_pred_train2[, 2], dataset = trainset)
points(rf_train_measures$fpr, rf_train_measures$tpr, type = "l", col = 6)
# plot boosting roc
boost_train_measures <- fpr_tpr(p = boost_pred_train, dataset = trainset)
points(boost_train_measures$fpr, boost_train_measures$tpr, type = "l", col = 7)
# plot support vector machine roc
svm_pred_train2 <- predict(svm_model, trainset, decision.values = TRUE)
svm_train_measures <- fpr_tpr(p = attr(svm_pred_train2, "decision.values"), dataset = trainset)
points(svm_train_measures$fpr, svm_train_measures$tpr, type = "l", col = 8)
# plot naive bayes roc
bayes_pred_train2 <- predict(bayes_model, trainset$yards, type = "raw")
bayes_train_measures <- fpr_tpr(p = bayes_pred_train2[, 2], dataset = trainset)
points(bayes_train_measures$fpr, bayes_train_measures$tpr, type = "l", col = 9)
# plot artificial neural networks
ann_pred_train2 <- predict(ann_model, newdata = trainset, type = "raw")
ann_train_measures <- fpr_tpr(p = ann_pred_train2[, 1], dataset = trainset)
points(ann_train_measures$fpr, ann_train_measures$tpr, type = "l", col = 10)
legend("bottomright", c("logis", "knn", "tree", "bag", "rf", "boost",
"svm", "bayes", "ann"),
lwd = c(2, 2, 2),
lty = rep(1, 9), col = 2:10, cex = .8)
We see roc curve of kinds of model in test set.
################################################
# make multiply roc plot in test set
################################################
# plot logistic regression roc
logic_test_measures <- fpr_tpr(p = logic_model2_pred_test, dataset = testset)
plot(logic_test_measures$fpr, logic_test_measures$tpr, type = "l", col = 2,
ylab = "TPR", xlab = "FPR")
title("ROC curve in test set")
points(c(0, 1), c(0, 1), type = "l", lty = 2)
# plot knn roc
knn_pred_test2 <- knn(train_x, test_x, train_y, k = 4, prob = TRUE)
knn_test_measures <- fpr_tpr(p = attr(knn_pred_test2, "prob"), dataset = testset)
points(knn_test_measures$fpr, knn_test_measures$tpr, type = "l", col = 3)
# plot tree roc
tree_pred_test2 <- predict(tree_model, testset, type = "vector")
tree_test_measures <- fpr_tpr(p = tree_pred_test2[, 2], dataset = testset)
points(tree_test_measures$fpr, tree_test_measures$tpr, type = "l", col = 4)
# plot bagging roc
bag_pred_test2 <- predict(bag_model, newdata = testset, type = "prob")
bag_test_measures <- fpr_tpr(p = bag_pred_test2[, 2], dataset = testset)
points(bag_test_measures$fpr, bag_test_measures$tpr, type = "l", col = 5)
# plot random forest roc
rf_pred_test2 <- predict(rf_model, newdata = testset, type = "prob")
rf_test_measures <- fpr_tpr(p = rf_pred_test2[, 2], dataset = testset)
points(rf_test_measures$fpr, rf_test_measures$tpr, type = "l", col = 6)
# plot boosting roc
boost_test_measures <- fpr_tpr(p = boost_pred_test, dataset = testset)
points(boost_test_measures$fpr, boost_test_measures$tpr, type = "l", col = 7)
# plot support vector machine roc
svm_pred_test2 <- predict(svm_model, testset, decision.values = TRUE)
svm_test_measures <- fpr_tpr(p = attr(svm_pred_test2, "decision.values"), dataset = testset)
points(svm_test_measures$fpr, svm_test_measures$tpr, type = "l", col = 8)
# plot naive bayes roc
bayes_pred_test2 <- predict(bayes_model, testset$yards, type = "raw")
bayes_test_measures <- fpr_tpr(p = bayes_pred_test2[, 2], dataset = testset)
points(bayes_test_measures$fpr, bayes_test_measures$tpr, type = "l", col = 9)
# plot artificial neural networks
ann_pred_test2 <- predict(ann_model, newdata = testset, type = "raw")
ann_test_measures <- fpr_tpr(p = ann_pred_test2[, 1], dataset = testset)
points(ann_test_measures$fpr, ann_test_measures$tpr, type = "l", col = 10)
legend("bottomright", c("logis", "knn", "tree", "bag", "rf", "boost",
"svm", "bayes", "ann"),
lwd = c(2, 2, 2),
lty = rep(1, 9), col = 2:10, cex = .8)
# plot better model in roc
plot(logic_test_measures$fpr, logic_test_measures$tpr, type = "l", col = 2,
ylab = "TPR", xlab = "FPR")
title("ROC curve in test set")
points(c(0, 1), c(0, 1), type = "l", lty = 2)
points(boost_test_measures$fpr, boost_test_measures$tpr, type = "l", col = 3)
points(ann_test_measures$fpr, ann_test_measures$tpr, type = "l", col = 4)
legend("bottomright", c("logis", "boost", "ann"),
lwd = c(2, 2, 2),
lty = rep(1, 3), col = 2:4, cex = .8)
Here, the better model are logistic regression, artificial neural networks, boosting.
Referenced books:
Machine Learning with R
R in a nutshell
Applied Predictive Modeling
An Introduction to Statistical Learning with Applications in R
Just record, this article was posted at linkedin, and have 97 views to November 2021.