##########################################################################
본 게시글은 Kaggle Competition에서 House prices의 TOP Kernel 중 하나를 번역한 것임.
저작권에는 문제가 없어 보이나 문제가 될시 바로 삭제하겠음.
##########################################################################
Boruta Feature Importance Analysis
Jim Thompson
2016-09-01
library(caret)
library(data.table)
library(Boruta)
library(plyr)
library(dplyr)
library(pROC)
ROOT.DIR <- ".."
ID.VAR <- "Id"
TARGET.VAR <- "SalePrice"
이것은 어떤 변수들이 house sale price와 연관되었는지 결정하는 과정을 보여준다. 이 분석은 the Boruta package에 기반한다.
Data Preparation for Bourta Analysis
# 데이터 불러오기
sample.df <- read.csv(file.path(ROOT.DIR,"input/train.csv"),stringsAsFactors = FALSE)
각각 후보 설명변수들의 데이터 타입을 알아보자.
# 후보 변수명만 일단 뽑기
candidate.features <- setdiff(names(sample.df),c(ID.VAR,TARGET.VAR))
# 각 변수의 데이터 타입
data.type <- sapply(candidate.features,function(x){class(sample.df[[x]])})
table(data.type)
## data.type
## character integer
## 43 36
print(data.type)
## MSSubClass MSZoning LotFrontage LotArea Street
## "integer" "character" "integer" "integer" "character"
## Alley LotShape LandContour Utilities LotConfig
## "character" "character" "character" "character" "character"
## LandSlope Neighborhood Condition1 Condition2 BldgType
## "character" "character" "character" "character" "character"
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
## "character" "integer" "integer" "integer" "integer"
## RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## "character" "character" "character" "character" "character"
## MasVnrArea ExterQual ExterCond Foundation BsmtQual
## "integer" "character" "character" "character" "character"
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## "character" "character" "character" "integer" "character"
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## "integer" "integer" "integer" "character" "character"
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## "character" "character" "integer" "integer" "integer"
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## "integer" "integer" "integer" "integer" "integer"
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## "integer" "integer" "character" "integer" "character"
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## "integer" "character" "character" "integer" "character"
## GarageCars GarageArea GarageQual GarageCond PavedDrive
## "integer" "integer" "character" "character" "character"
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## "integer" "integer" "integer" "integer" "integer"
## PoolArea PoolQC Fence MiscFeature MiscVal
## "integer" "character" "character" "character" "integer"
## MoSold YrSold SaleType SaleCondition
## "integer" "integer" "character" "character"
# 데이터 타입 알아보기. explanatory.attributes <- setdiff(names(sample.df),c(ID.VAR,TARGET.VAR)) # ID, VAR, TARGET.VAR 을 제외한 변수들(설명변수) data.classes <- sapply(explanatory.attributes,function(x){class(sample.df[[x]])}) # 각 설명변수의 데이터 타입보기 # 데이터 타입종류 보기. unique.classes <- unique(data.classes)
# 데이터 타입 종류(unique.classes)에 있는 데이터 타입과 부합한 변수명 리스트
attr.data.types <- lapply(unique.classes,function(x){names(data.classes[data.classes==x])})
names(attr.data.types) <- unique.classes
Boruta analysis를 위해 데이터셋을 준비해라, 이 분석을 위해, 결측값은 다음과 같이 다루어 진다. 수치형 데이터 결측값은 -1로 문자형 데이터 결측값은 MISSING으로 할당한다.
문자형 변수는 범주형(factors)으로 한다. 이렇게 문자형 데이터를 범주형으로 다루는 것에는 한가지 한계점이 있는데 범주의 수준들(levels)에 상대적인 순서가 있다는 가정이 있다.
# 종속변수
response <- sample.df$SalePrice
# id 변수와 종속변수를 제외한 변수들(설명변수) 저장.
sample.df <- sample.df[candidate.features]
# 랜덤 포레스트를 실행하기 위해 수치형 결측값은 -1로 설정.
for (x in attr.data.types$integer){
sample.df[[x]][is.na(sample.df[[x]])] <- -1
}
# 랜덤포레스트를 실행하기 위해 문자형 결측값은 "*MISSING*" 으로 설정
for (x in attr.data.types$character){
sample.df[[x]][is.na(sample.df[[x]])] <- "*MISSING*"
}
Run Boruta Analysis
set.seed(13)
bor.results <- Boruta(sample.df,response,
maxRuns=101,
doTrace=0)
Boruta results
##
## Summary of Boruta run:
## Boruta performed 100 iterations in 1.375794 mins.
## 50 attributes confirmed important: BedroomAbvGr, BldgType,
## BsmtCond, BsmtFinSF1, BsmtFinType1 and 45 more.
## 20 attributes confirmed unimportant: BsmtFinSF2, BsmtHalfBath,
## Condition2, ExterCond, Heating and 15 more.
## 9 tentative attributes left: Alley, BsmtExposure, Condition1,
## Electrical, EnclosedPorch and 4 more.
These attributes were deemed as relevent to predicting house sale price.
##
##
## Relevant Attributes:
## [1] "MSSubClass" "MSZoning" "LotArea" "LandContour"
## [5] "Neighborhood" "BldgType" "HouseStyle" "OverallQual"
## [9] "OverallCond" "YearBuilt" "YearRemodAdd" "Exterior1st"
## [13] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [17] "Foundation" "BsmtQual" "BsmtCond" "BsmtFinType1"
## [21] "BsmtFinSF1" "BsmtFinType2" "BsmtUnfSF" "TotalBsmtSF"
## [25] "HeatingQC" "CentralAir" "X1stFlrSF" "X2ndFlrSF"
## [29] "GrLivArea" "BsmtFullBath" "FullBath" "HalfBath"
## [33] "BedroomAbvGr" "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd"
## [37] "Functional" "Fireplaces" "FireplaceQu" "GarageType"
## [41] "GarageYrBlt" "GarageFinish" "GarageCars" "GarageArea"
## [45] "GarageQual" "GarageCond" "PavedDrive" "WoodDeckSF"
## [49] "OpenPorchSF" "Fence"
다음 그래프는 각 후보 설명변수의 상대적 중요도를 보여준다. x 축은 후보 각 후보 설명 변수명을 나타내고, 초록색은 예측과 상관(relevant)이 있는 변수를 나타내고, 빨간색은 반대로 비상관인 변수를 나타낸다. 노란색은 반응변수를 예측하는데 상관이 있을수도 있고 없을 수도 있는 변수를 나타낸다.
각 설명변수에 대한 자세한 결과.
##
##
## Attribute Importance Details:
## attr meanImp medianImp minImp maxImp normHits decision
## 1 GrLivArea 20.80162258 20.78502805 18.535841023 23.5716257 1.00 Confirmed
## 2 OverallQual 17.20747969 17.26901173 14.735546931 18.7973371 1.00 Confirmed
## 3 X2ndFlrSF 13.97016925 14.02202180 10.257385951 16.5308220 1.00 Confirmed
## 4 TotalBsmtSF 14.03116430 13.98628589 10.614392984 16.4311992 1.00 Confirmed
## 5 X1stFlrSF 13.53842250 13.59501091 10.887199720 15.3962367 1.00 Confirmed
## 6 GarageCars 12.99096564 13.10911536 10.005608253 14.8684197 1.00 Confirmed
## 7 YearBuilt 12.77227303 12.96056795 7.156390354 14.3606464 1.00 Confirmed
## 8 GarageArea 12.87362160 12.94534460 9.580747522 14.3714374 1.00 Confirmed
## 9 ExterQual 11.86835387 11.85978444 9.401588413 13.1237173 1.00 Confirmed
## 10 YearRemodAdd 11.09346736 11.22757800 8.721053537 12.7273884 1.00 Confirmed
## 11 FireplaceQu 10.87762424 10.93432931 5.498479708 13.0706797 1.00 Confirmed
## 12 GarageYrBlt 10.67005514 10.82840809 8.501312236 12.1919028 1.00 Confirmed
## 13 FullBath 10.47163205 10.51677018 8.732885348 11.8151891 1.00 Confirmed
## 14 MSSubClass 9.96401126 9.98159856 7.613579283 12.2973335 1.00 Confirmed
## 15 LotArea 9.79296636 9.89830848 6.943826388 12.5682534 1.00 Confirmed
## 16 Fireplaces 9.66905933 9.84059387 5.554397962 11.6734722 1.00 Confirmed
## 17 KitchenQual 9.77537742 9.69474211 7.989157468 11.6201471 1.00 Confirmed
## 18 MSZoning 9.07975279 9.07074946 6.434779296 12.8499436 1.00 Confirmed
## 19 GarageType 8.89128148 8.88041430 6.364256309 11.2319383 1.00 Confirmed
## 20 BsmtFinSF1 8.84836324 8.83661964 5.059794077 12.2808132 1.00 Confirmed
## 21 Neighborhood 8.57710266 8.51393920 7.188309021 9.9319210 1.00 Confirmed
## 22 BsmtQual 8.23958487 8.36102702 5.663597143 10.1943948 1.00 Confirmed
## 23 TotRmsAbvGrd 8.29632972 8.20660262 5.880224404 10.7969525 1.00 Confirmed
## 24 HalfBath 7.60841029 7.66129333 5.608800395 9.4860838 1.00 Confirmed
## 25 BldgType 7.64851612 7.63473450 5.604601563 9.7300625 1.00 Confirmed
## 26 GarageFinish 7.37692819 7.48666734 5.001123540 9.6642313 1.00 Confirmed
## 27 Foundation 7.04380599 7.07941011 3.939492747 8.3260506 1.00 Confirmed
## 28 BedroomAbvGr 6.97855776 6.92797733 5.180428532 9.2605613 1.00 Confirmed
## 29 HouseStyle 6.47562365 6.52982412 3.809114311 8.8377330 1.00 Confirmed
## 30 CentralAir 6.32975959 6.41964096 4.194141479 8.3902280 1.00 Confirmed
## 31 OpenPorchSF 6.30199504 6.28881903 3.942399414 8.8216412 1.00 Confirmed
## 32 HeatingQC 6.15728069 6.11610877 4.608494036 7.7905369 1.00 Confirmed
## 33 BsmtFinType1 5.85144584 5.92653496 4.122724825 8.1796761 1.00 Confirmed
## 34 BsmtUnfSF 5.72337504 5.77767513 2.475702180 7.6975185 0.99 Confirmed
## 35 GarageCond 5.65861401 5.68221518 3.893953120 7.3790515 1.00 Confirmed
## 36 GarageQual 5.32749624 5.47989904 2.204987026 7.5430524 0.98 Confirmed
## 37 KitchenAbvGr 5.20143995 5.20928831 3.115232120 6.4785585 1.00 Confirmed
## 38 OverallCond 5.31829536 5.13141579 3.055097039 8.9185568 1.00 Confirmed
## 39 BsmtCond 4.82229965 4.87416889 2.918196299 7.9059434 1.00 Confirmed
## 40 MasVnrArea 4.93836457 4.85991807 2.408535137 8.1105552 0.98 Confirmed
## 41 BsmtFullBath 4.41994585 4.30670156 2.401730987 7.4791359 0.96 Confirmed
## 42 Exterior1st 4.23179837 4.20688483 2.413359612 6.1084226 0.96 Confirmed
## 43 Exterior2nd 3.97397863 4.07693335 0.914055339 6.2613844 0.89 Confirmed
## 44 PavedDrive 3.97533101 3.96847203 1.957404499 5.6070983 0.96 Confirmed
## 45 WoodDeckSF 3.74508340 3.78245883 1.512694778 6.1396681 0.86 Confirmed
## 46 LandContour 3.65162656 3.64218746 1.018541052 5.1008705 0.87 Confirmed
## 47 MasVnrType 2.95553280 3.05884348 0.346963203 5.0935353 0.70 Confirmed
## 48 BsmtFinType2 3.06100746 3.03632436 0.568929418 4.5369251 0.75 Confirmed
## 49 Functional 2.97692887 3.01410578 1.007001090 4.9277739 0.72 Confirmed
## 50 Fence 2.87437847 2.84348778 1.237296721 3.9823772 0.68 Confirmed
## 51 SaleCondition 2.65025391 2.70646118 0.951359668 4.4601754 0.52 Tentative
## 52 Alley 2.67035733 2.70472574 1.268680305 4.3114284 0.58 Tentative
## 53 LotShape 2.63112778 2.67582920 0.699589380 4.8668368 0.54 Tentative
## 54 Electrical 2.63078879 2.65208364 1.020791484 3.8209051 0.57 Tentative
## 55 RoofStyle 2.47017605 2.61479890 -0.109630991 5.1061453 0.54 Tentative
## 56 Condition1 2.29266882 2.40508855 0.039632740 3.7447582 0.42 Tentative
## 57 BsmtExposure 2.49802250 2.38929143 -0.873824709 4.6534540 0.48 Tentative
## 58 EnclosedPorch 2.24471184 2.35625698 -0.311363698 3.9287300 0.36 Tentative
## 59 LandSlope 2.21914061 2.12040886 0.389757474 4.5015780 0.37 Tentative
## 60 ScreenPorch 1.90948552 1.89448142 -1.010748305 4.3130872 0.24 Rejected
## 61 ExterCond 1.58266622 1.68788921 -0.862448807 2.9008638 0.05 Rejected
## 62 BsmtFinSF2 1.64351802 1.65987702 0.008169336 3.2694426 0.06 Rejected
## 63 BsmtHalfBath 1.19509038 1.10871066 0.321781144 2.3977840 0.00 Rejected
## 64 SaleType 0.91404726 1.10231766 -0.436758166 2.3716369 0.01 Rejected
## 65 Heating 0.68977287 0.89593061 -1.731472554 2.5739110 0.00 Rejected
## 66 LotFrontage 0.90911718 0.85949362 -1.295566198 3.7324623 0.03 Rejected
## 67 RoofMatl 0.93948085 0.76615636 -0.658065658 2.6659225 0.01 Rejected
## 68 MiscFeature 0.45040187 0.68914609 -1.447059177 1.7249220 0.00 Rejected
## 69 MiscVal 0.04688162 0.16037794 -1.239842213 1.3587293 0.00 Rejected
## 70 YrSold 0.21921826 0.15677094 -1.207612208 3.4774269 0.01 Rejected
## 71 Utilities 0.00000000 0.00000000 0.000000000 0.0000000 0.00 Rejected
## 72 Street -0.13766475 -0.02067032 -1.460759956 1.7349818 0.00 Rejected
## 73 X3SsnPorch -0.16913321 -0.13689920 -1.375120096 1.4265937 0.00 Rejected
## 74 LotConfig -0.01457229 -0.48393604 -1.399303940 1.6503859 0.00 Rejected
## 75 Condition2 -0.54443089 -0.79624125 -1.486676281 1.0290424 0.00 Rejected
## 76 MoSold -0.69949282 -0.83162718 -1.964452383 0.5779010 0.00 Rejected
## 77 LowQualFinSF -0.68098320 -0.92709289 -2.256080083 0.6391406 0.00 Rejected
## 78 PoolArea -0.75349519 -1.03299146 -2.815935064 1.2482400 0.00 Rejected
## 79 PoolQC -1.34541507 -1.40561700 -2.565317848 0.1861122 0.00 Rejected
Reference
https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/
https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf
https://www.r-bloggers.com/feature-selection-all-relevant-selection-with-the-boruta-package/
'Kaggle > House Prices' 카테고리의 다른 글
Detailed Data Exploration in Python by Angela (With Python) (0) | 2016.11.15 |
---|---|
Ensemble Modeling : Stack Model Example by J.Thompson (with R) (0) | 2016.11.13 |
Housing Data Exploratory Analysis by AiO (With R) (0) | 2016.11.13 |
Regularized Linear Models by Alexandru Papiu (With Python) (0) | 2016.11.08 |
Data Description (0) | 2016.10.31 |