본문 바로가기

Kaggle/House Prices

Boruta Feature Importance Analysis by Jim Thompson (With R)

https://www.kaggle.com/jimthompson/house-prices-advanced-regression-techniques/boruta-feature-importance-analysis/comments


library(caret)
library(data.table)
library(Boruta)
library(plyr)
library(dplyr)
library(pROC)

ROOT.DIR <- ".."

ID.VAR <- "Id"
TARGET.VAR <- "SalePrice"


이것은 어떤 변수들이 house sale price와 연관되었는지 결정하는 과정을 보여준다. 이 분석은 the Boruta package에 기반한다.


Data Preparation for Bourta Analysis

# 데이터 불러오기
sample.df <- read.csv(file.path(ROOT.DIR,"input/train.csv"),stringsAsFactors = FALSE)


각각 후보 설명변수들의 데이터 타입을 알아보자.


# 후보 변수명만 일단 뽑기
candidate.features <- setdiff(names(sample.df),c(ID.VAR,TARGET.VAR))
# 각 변수의 데이터 타입
data.type <- sapply(candidate.features,function(x){class(sample.df[[x]])})
table(data.type)
## data.type
## character   integer 
##        43        36
print(data.type)
##    MSSubClass      MSZoning   LotFrontage       LotArea        Street 
##     "integer"   "character"     "integer"     "integer"   "character" 
##         Alley      LotShape   LandContour     Utilities     LotConfig 
##   "character"   "character"   "character"   "character"   "character" 
##     LandSlope  Neighborhood    Condition1    Condition2      BldgType 
##   "character"   "character"   "character"   "character"   "character" 
##    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd 
##   "character"     "integer"     "integer"     "integer"     "integer" 
##     RoofStyle      RoofMatl   Exterior1st   Exterior2nd    MasVnrType 
##   "character"   "character"   "character"   "character"   "character" 
##    MasVnrArea     ExterQual     ExterCond    Foundation      BsmtQual 
##     "integer"   "character"   "character"   "character"   "character" 
##      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2 
##   "character"   "character"   "character"     "integer"   "character" 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC 
##     "integer"     "integer"     "integer"   "character"   "character" 
##    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##   "character"   "character"     "integer"     "integer"     "integer" 
##     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
##     "integer"     "integer"     "integer"     "integer"     "integer" 
##  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
##     "integer"     "integer"   "character"     "integer"   "character" 
##    Fireplaces   FireplaceQu    GarageType   GarageYrBlt  GarageFinish 
##     "integer"   "character"   "character"     "integer"   "character" 
##    GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive 
##     "integer"     "integer"   "character"   "character"   "character" 
##    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch 
##     "integer"     "integer"     "integer"     "integer"     "integer" 
##      PoolArea        PoolQC         Fence   MiscFeature       MiscVal 
##     "integer"   "character"   "character"   "character"     "integer" 
##        MoSold        YrSold      SaleType SaleCondition 
##     "integer"     "integer"   "character"   "character"
# 데이터 타입 알아보기.
explanatory.attributes <- setdiff(names(sample.df),c(ID.VAR,TARGET.VAR)) # ID, VAR, TARGET.VAR 을 제외한 변수들(설명변수)
data.classes <- sapply(explanatory.attributes,function(x){class(sample.df[[x]])}) # 각 설명변수의 데이터 타입보기

# 데이터 타입종류 보기.
unique.classes <- unique(data.classes)

# 데이터 타입 종류(unique.classes)에 있는 데이터 타입과 부합한 변수명 리스트
attr.data.types <- lapply(unique.classes,function(x){names(data.classes[data.classes==x])})
names(attr.data.types) <- unique.classes

Boruta analysis를 위해 데이터셋을 준비해라, 이 분석을 위해, 결측값은 다음과 같이 다루어 진다. 수치형 데이터 결측값은 -1로 문자형 데이터 결측값은 MISSING으로 할당한다.


문자형 변수는 범주형(factors)으로 한다. 이렇게 문자형 데이터를 범주형으로 다루는 것에는 한가지 한계점이 있는데 범주의 수준들(levels)에 상대적인 순서가 있다는 가정이 있다.


# 종속변수 
response <- sample.df$SalePrice

# id 변수와 종속변수를 제외한 변수들(설명변수) 저장.
sample.df <- sample.df[candidate.features]

# 랜덤 포레스트를 실행하기 위해 수치형 결측값은 -1로 설정.
for (x in attr.data.types$integer){
  sample.df[[x]][is.na(sample.df[[x]])] <- -1
}

# 랜덤포레스트를 실행하기 위해 문자형 결측값은 "*MISSING*" 으로 설정
for (x in attr.data.types$character){
  sample.df[[x]][is.na(sample.df[[x]])] <- "*MISSING*"
}

Run Boruta Analysis

set.seed(13)
bor.results <- Boruta(sample.df,response,
                   maxRuns=101,
                   doTrace=0)

Boruta results

## 
## Summary of Boruta run:
## Boruta performed 100 iterations in 1.375794 mins.
##  50 attributes confirmed important: BedroomAbvGr, BldgType,
## BsmtCond, BsmtFinSF1, BsmtFinType1 and 45 more.
##  20 attributes confirmed unimportant: BsmtFinSF2, BsmtHalfBath,
## Condition2, ExterCond, Heating and 15 more.
##  9 tentative attributes left: Alley, BsmtExposure, Condition1,
## Electrical, EnclosedPorch and 4 more.

These attributes were deemed as relevent to predicting house sale price.

## 
## 
## Relevant Attributes:
##  [1] "MSSubClass"   "MSZoning"     "LotArea"      "LandContour" 
##  [5] "Neighborhood" "BldgType"     "HouseStyle"   "OverallQual" 
##  [9] "OverallCond"  "YearBuilt"    "YearRemodAdd" "Exterior1st" 
## [13] "Exterior2nd"  "MasVnrType"   "MasVnrArea"   "ExterQual"   
## [17] "Foundation"   "BsmtQual"     "BsmtCond"     "BsmtFinType1"
## [21] "BsmtFinSF1"   "BsmtFinType2" "BsmtUnfSF"    "TotalBsmtSF" 
## [25] "HeatingQC"    "CentralAir"   "X1stFlrSF"    "X2ndFlrSF"   
## [29] "GrLivArea"    "BsmtFullBath" "FullBath"     "HalfBath"    
## [33] "BedroomAbvGr" "KitchenAbvGr" "KitchenQual"  "TotRmsAbvGrd"
## [37] "Functional"   "Fireplaces"   "FireplaceQu"  "GarageType"  
## [41] "GarageYrBlt"  "GarageFinish" "GarageCars"   "GarageArea"  
## [45] "GarageQual"   "GarageCond"   "PavedDrive"   "WoodDeckSF"  
## [49] "OpenPorchSF"  "Fence"

다음 그래프는 각 후보 설명변수의 상대적 중요도를 보여준다. x 축은 후보 각 후보 설명 변수명을 나타내고, 초록색은 예측과 상관(relevant)이 있는 변수를 나타내고, 빨간색은 반대로 비상관인 변수를 나타낸다. 노란색은 반응변수를 예측하는데 상관이 있을수도 있고 없을 수도 있는 변수를 나타낸다.

각 설명변수에 대한 자세한 결과.

## 
## 
## Attribute Importance Details:
##             attr     meanImp   medianImp       minImp     maxImp normHits  decision
## 1      GrLivArea 20.80162258 20.78502805 18.535841023 23.5716257     1.00 Confirmed
## 2    OverallQual 17.20747969 17.26901173 14.735546931 18.7973371     1.00 Confirmed
## 3      X2ndFlrSF 13.97016925 14.02202180 10.257385951 16.5308220     1.00 Confirmed
## 4    TotalBsmtSF 14.03116430 13.98628589 10.614392984 16.4311992     1.00 Confirmed
## 5      X1stFlrSF 13.53842250 13.59501091 10.887199720 15.3962367     1.00 Confirmed
## 6     GarageCars 12.99096564 13.10911536 10.005608253 14.8684197     1.00 Confirmed
## 7      YearBuilt 12.77227303 12.96056795  7.156390354 14.3606464     1.00 Confirmed
## 8     GarageArea 12.87362160 12.94534460  9.580747522 14.3714374     1.00 Confirmed
## 9      ExterQual 11.86835387 11.85978444  9.401588413 13.1237173     1.00 Confirmed
## 10  YearRemodAdd 11.09346736 11.22757800  8.721053537 12.7273884     1.00 Confirmed
## 11   FireplaceQu 10.87762424 10.93432931  5.498479708 13.0706797     1.00 Confirmed
## 12   GarageYrBlt 10.67005514 10.82840809  8.501312236 12.1919028     1.00 Confirmed
## 13      FullBath 10.47163205 10.51677018  8.732885348 11.8151891     1.00 Confirmed
## 14    MSSubClass  9.96401126  9.98159856  7.613579283 12.2973335     1.00 Confirmed
## 15       LotArea  9.79296636  9.89830848  6.943826388 12.5682534     1.00 Confirmed
## 16    Fireplaces  9.66905933  9.84059387  5.554397962 11.6734722     1.00 Confirmed
## 17   KitchenQual  9.77537742  9.69474211  7.989157468 11.6201471     1.00 Confirmed
## 18      MSZoning  9.07975279  9.07074946  6.434779296 12.8499436     1.00 Confirmed
## 19    GarageType  8.89128148  8.88041430  6.364256309 11.2319383     1.00 Confirmed
## 20    BsmtFinSF1  8.84836324  8.83661964  5.059794077 12.2808132     1.00 Confirmed
## 21  Neighborhood  8.57710266  8.51393920  7.188309021  9.9319210     1.00 Confirmed
## 22      BsmtQual  8.23958487  8.36102702  5.663597143 10.1943948     1.00 Confirmed
## 23  TotRmsAbvGrd  8.29632972  8.20660262  5.880224404 10.7969525     1.00 Confirmed
## 24      HalfBath  7.60841029  7.66129333  5.608800395  9.4860838     1.00 Confirmed
## 25      BldgType  7.64851612  7.63473450  5.604601563  9.7300625     1.00 Confirmed
## 26  GarageFinish  7.37692819  7.48666734  5.001123540  9.6642313     1.00 Confirmed
## 27    Foundation  7.04380599  7.07941011  3.939492747  8.3260506     1.00 Confirmed
## 28  BedroomAbvGr  6.97855776  6.92797733  5.180428532  9.2605613     1.00 Confirmed
## 29    HouseStyle  6.47562365  6.52982412  3.809114311  8.8377330     1.00 Confirmed
## 30    CentralAir  6.32975959  6.41964096  4.194141479  8.3902280     1.00 Confirmed
## 31   OpenPorchSF  6.30199504  6.28881903  3.942399414  8.8216412     1.00 Confirmed
## 32     HeatingQC  6.15728069  6.11610877  4.608494036  7.7905369     1.00 Confirmed
## 33  BsmtFinType1  5.85144584  5.92653496  4.122724825  8.1796761     1.00 Confirmed
## 34     BsmtUnfSF  5.72337504  5.77767513  2.475702180  7.6975185     0.99 Confirmed
## 35    GarageCond  5.65861401  5.68221518  3.893953120  7.3790515     1.00 Confirmed
## 36    GarageQual  5.32749624  5.47989904  2.204987026  7.5430524     0.98 Confirmed
## 37  KitchenAbvGr  5.20143995  5.20928831  3.115232120  6.4785585     1.00 Confirmed
## 38   OverallCond  5.31829536  5.13141579  3.055097039  8.9185568     1.00 Confirmed
## 39      BsmtCond  4.82229965  4.87416889  2.918196299  7.9059434     1.00 Confirmed
## 40    MasVnrArea  4.93836457  4.85991807  2.408535137  8.1105552     0.98 Confirmed
## 41  BsmtFullBath  4.41994585  4.30670156  2.401730987  7.4791359     0.96 Confirmed
## 42   Exterior1st  4.23179837  4.20688483  2.413359612  6.1084226     0.96 Confirmed
## 43   Exterior2nd  3.97397863  4.07693335  0.914055339  6.2613844     0.89 Confirmed
## 44    PavedDrive  3.97533101  3.96847203  1.957404499  5.6070983     0.96 Confirmed
## 45    WoodDeckSF  3.74508340  3.78245883  1.512694778  6.1396681     0.86 Confirmed
## 46   LandContour  3.65162656  3.64218746  1.018541052  5.1008705     0.87 Confirmed
## 47    MasVnrType  2.95553280  3.05884348  0.346963203  5.0935353     0.70 Confirmed
## 48  BsmtFinType2  3.06100746  3.03632436  0.568929418  4.5369251     0.75 Confirmed
## 49    Functional  2.97692887  3.01410578  1.007001090  4.9277739     0.72 Confirmed
## 50         Fence  2.87437847  2.84348778  1.237296721  3.9823772     0.68 Confirmed
## 51 SaleCondition  2.65025391  2.70646118  0.951359668  4.4601754     0.52 Tentative
## 52         Alley  2.67035733  2.70472574  1.268680305  4.3114284     0.58 Tentative
## 53      LotShape  2.63112778  2.67582920  0.699589380  4.8668368     0.54 Tentative
## 54    Electrical  2.63078879  2.65208364  1.020791484  3.8209051     0.57 Tentative
## 55     RoofStyle  2.47017605  2.61479890 -0.109630991  5.1061453     0.54 Tentative
## 56    Condition1  2.29266882  2.40508855  0.039632740  3.7447582     0.42 Tentative
## 57  BsmtExposure  2.49802250  2.38929143 -0.873824709  4.6534540     0.48 Tentative
## 58 EnclosedPorch  2.24471184  2.35625698 -0.311363698  3.9287300     0.36 Tentative
## 59     LandSlope  2.21914061  2.12040886  0.389757474  4.5015780     0.37 Tentative
## 60   ScreenPorch  1.90948552  1.89448142 -1.010748305  4.3130872     0.24  Rejected
## 61     ExterCond  1.58266622  1.68788921 -0.862448807  2.9008638     0.05  Rejected
## 62    BsmtFinSF2  1.64351802  1.65987702  0.008169336  3.2694426     0.06  Rejected
## 63  BsmtHalfBath  1.19509038  1.10871066  0.321781144  2.3977840     0.00  Rejected
## 64      SaleType  0.91404726  1.10231766 -0.436758166  2.3716369     0.01  Rejected
## 65       Heating  0.68977287  0.89593061 -1.731472554  2.5739110     0.00  Rejected
## 66   LotFrontage  0.90911718  0.85949362 -1.295566198  3.7324623     0.03  Rejected
## 67      RoofMatl  0.93948085  0.76615636 -0.658065658  2.6659225     0.01  Rejected
## 68   MiscFeature  0.45040187  0.68914609 -1.447059177  1.7249220     0.00  Rejected
## 69       MiscVal  0.04688162  0.16037794 -1.239842213  1.3587293     0.00  Rejected
## 70        YrSold  0.21921826  0.15677094 -1.207612208  3.4774269     0.01  Rejected
## 71     Utilities  0.00000000  0.00000000  0.000000000  0.0000000     0.00  Rejected
## 72        Street -0.13766475 -0.02067032 -1.460759956  1.7349818     0.00  Rejected
## 73    X3SsnPorch -0.16913321 -0.13689920 -1.375120096  1.4265937     0.00  Rejected
## 74     LotConfig -0.01457229 -0.48393604 -1.399303940  1.6503859     0.00  Rejected
## 75    Condition2 -0.54443089 -0.79624125 -1.486676281  1.0290424     0.00  Rejected
## 76        MoSold -0.69949282 -0.83162718 -1.964452383  0.5779010     0.00  Rejected
## 77  LowQualFinSF -0.68098320 -0.92709289 -2.256080083  0.6391406     0.00  Rejected
## 78      PoolArea -0.75349519 -1.03299146 -2.815935064  1.2482400     0.00  Rejected
## 79        PoolQC -1.34541507 -1.40561700 -2.565317848  0.1861122     0.00  Rejected


Reference

https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/

https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf

https://www.r-bloggers.com/feature-selection-all-relevant-selection-with-the-boruta-package/