LearnMore

Classification(Personal Project) 본문

Programming/R

Classification(Personal Project)

zionadd 2018. 10. 16. 15:54
Project4

Classification

party 패키지 설치 및 로딩, 패키지 내 데이터 목록조회

if(!require(party)){
  install.packages("party")
  library(party)
}else{
  library(party)
}
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
if(!require(caret)){
  install.packages("caret")
  library(caret)
}else{
  library(caret)
}
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
data(package="party")

part 패키지 내 readingSkills 데이터셋 로딩과 코딩북 확인

data("readingSkills")
help("readingSkills")
## starting httpd help server ... done

readingSkills 데이터셋에 대한 간단조회, 구조파악, 간단 기술통계분석

head(readingSkills)
##   nativeSpeaker age shoeSize    score
## 1           yes   5 24.83189 32.29385
## 2           yes   6 25.95238 36.63105
## 3            no  11 30.42170 49.60593
## 4           yes   7 28.66450 40.28456
## 5           yes  11 31.88207 55.46085
## 6           yes  10 30.07843 52.83124
str(readingSkills)
## 'data.frame':    200 obs. of  4 variables:
##  $ nativeSpeaker: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 1 2 2 1 ...
##  $ age          : int  5 6 11 7 11 10 7 11 5 7 ...
##  $ shoeSize     : num  24.8 26 30.4 28.7 31.9 ...
##  $ score        : num  32.3 36.6 49.6 40.3 55.5 ...
summary(readingSkills)
##  nativeSpeaker      age            shoeSize         score      
##  no :100       Min.   : 5.000   Min.   :23.17   Min.   :25.26  
##  yes:100       1st Qu.: 6.000   1st Qu.:26.23   1st Qu.:33.94  
##                Median : 8.000   Median :27.85   Median :40.33  
##                Mean   : 7.925   Mean   :27.87   Mean   :40.66  
##                3rd Qu.: 9.250   3rd Qu.:29.49   3rd Qu.:47.57  
##                Max.   :11.000   Max.   :32.33   Max.   :56.71
raw<-readingSkills

반응변수인 nativeSpeaker의 레이블순서를 yes < no 순서로 변경

raw<-readingSkills
raw<-na.omit(raw)
raw$nativeSpeaker<-factor(readingSkills$nativeSpeaker,levels = c("yes","no"),ordered = T)
str(raw)
## 'data.frame':    200 obs. of  4 variables:
##  $ nativeSpeaker: Ord.factor w/ 2 levels "yes"<"no": 1 1 2 1 1 1 2 1 1 2 ...
##  $ age          : int  5 6 11 7 11 10 7 11 5 7 ...
##  $ shoeSize     : num  24.8 26 30.4 28.7 31.9 ...
##  $ score        : num  32.3 36.6 49.6 40.3 55.5 ...

학습(트레이닝) & 검증(테스트) 데이터 70:30 비율로 추출

set.seed(1234)
train<-sample(nrow(raw),0.7*nrow(raw))
data.train<-raw[train,]
data.test<-raw[-train,]

학습 & 검증 데이터 간단조회

str(data.train)
## 'data.frame':    140 obs. of  4 variables:
##  $ nativeSpeaker: Ord.factor w/ 2 levels "yes"<"no": 2 1 2 1 2 2 1 1 1 1 ...
##  $ age          : int  8 8 10 7 5 9 6 8 11 7 ...
##  $ shoeSize     : num  28.7 27.7 29.8 26.7 28 ...
##  $ score        : num  38.1 43.8 45.7 39.5 26.2 ...
str(data.test)
## 'data.frame':    60 obs. of  4 variables:
##  $ nativeSpeaker: Ord.factor w/ 2 levels "yes"<"no": 1 1 2 2 2 2 1 1 1 1 ...
##  $ age          : int  5 11 7 6 6 7 6 8 7 7 ...
##  $ shoeSize     : num  24.8 31.9 26.7 26.9 25.2 ...
##  $ score        : num  32.3 55.5 33.9 30 30.4 ...

학습데이터를 이용한 분류규칙 생성 & 분류규칙 그래프 그리기

ctre<-ctree(nativeSpeaker~.,data = data.train)
print(ctre)
## 
##   Conditional inference tree with 8 terminal nodes
## 
## Response:  nativeSpeaker 
## Inputs:  age, shoeSize, score 
## Number of observations:  140 
## 
## 1) score <= 30.86356; criterion = 1, statistic = 26.067
##   2)*  weights = 21 
## 1) score > 30.86356
##   3) score <= 50.84003; criterion = 0.96, statistic = 6.1
##     4) age <= 6; criterion = 1, statistic = 24.668
##       5)*  weights = 17 
##     4) age > 6
##       6) age <= 9; criterion = 0.98, statistic = 7.344
##         7) score <= 43.34602; criterion = 1, statistic = 23.825
##           8) age <= 7; criterion = 0.999, statistic = 12.697
##             9) score <= 34.72458; criterion = 1, statistic = 18.526
##               10)*  weights = 10 
##             9) score > 34.72458
##               11)*  weights = 10 
##           8) age > 7
##             12)*  weights = 24 
##         7) score > 43.34602
##           13)*  weights = 21 
##       6) age > 9
##         14)*  weights = 16 
##   3) score > 50.84003
##     15)*  weights = 21
par(mfrow=c(1,1))
plot(ctre,type="simple")

분류규칙을 이용한 학습(train)데이터 분류분석

cpart.prob.train<-predict(ctre,data.train)

학습데이터 response 패턴과 분류규칙 분류패턴간 교차분석

cpart.perf.train <- table(cpart.prob.train, data.train$nativeSpeaker,
                        dnn=c( "TrainRule", "TrainActual"))
addmargins(cpart.perf.train)
##          TrainActual
## TrainRule yes  no Sum
##       yes  69   0  69
##       no    2  69  71
##       Sum  71  69 140

검증(test)데이터에 대한 분류분석

cpart.prob.test <- predict(ctre, data.test)

검증데이터 response 패턴과 분류규칙 분류패턴간 교차분석

cpart.perf.test <- table(cpart.prob.test, data.test$nativeSpeaker,
                       dnn=c("TestPredicted", "TestActual"))
addmargins(cpart.perf.test)
##              TestActual
## TestPredicted yes no Sum
##           yes  28  2  30
##           no    1 29  30
##           Sum  29 31  60

혼동표 그리기

confusionMatrix(cpart.perf.test)
## Confusion Matrix and Statistics
## 
##              TestActual
## TestPredicted yes no
##           yes  28  2
##           no    1 29
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8608, 0.9896)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 1.837e-13       
##                                           
##                   Kappa : 0.9             
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9655          
##             Specificity : 0.9355          
##          Pos Pred Value : 0.9333          
##          Neg Pred Value : 0.9667          
##              Prevalence : 0.4833          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9505          
##                                           
##        'Positive' Class : yes             
## 

'Programming > R' 카테고리의 다른 글

Regression(Personal Project)  (0) 2018.10.16
Association Rule Analysis(Personal Project)  (0) 2018.10.16
ABTest(Personal Project)  (0) 2018.10.16
'right' and 'include.lowest' parameter in cut() function  (0) 2018.08.28
t() function in table  (0) 2018.08.28
Comments