본문 바로가기

Kaggle/House Prices

Using XGBoost For Feature Selection by Mei-Cheng Shih (With Python)


이 커널은  JMT5802의 포스팅에서 영감을 받음. 이 커널의 목적은 boruta 패키지의 중요요소인 RF(랜덤포레스트)를 대채하기 위해 XGBoost를 사용하는 것이 목적이다. 이 Case에서 XGBoost가 RF보다 더 좋은 예측을 내기 때문에, 이 kernel의 결과는 이를 잘 나타낸다. 더욱이, 이 코드는 필자가 사용했던 데이터전처리 과정을 포함한다.

먼저 전처리와 데이터를 불러오기 위한 패키지를 불러온다.

In [1]:
from scipy.stats.mstats import mode
import pandas as pd
import numpy as np
import time
from sklearn.preprocessing import LabelEncoder

"""
Read Data
"""
train=pd.read_csv('../input/train.csv')
test=pd.read_csv('../input/test.csv')
target=train['SalePrice']
train=train.drop(['SalePrice'],axis=1)
trainlen=train.shape[0]

전처리를 위해 train셋과 test셋을 합친다.

In [2]:
alldata=pd.concat([train, test], axis=0, join='outer',ignore_index=True)
alldata=alldata.drop(['Id','Utilities'],axis=1)
alldata.ix[:,(alldata.dtypes=='int64') & (alldata.columns!='MSSubClass')]=alldata.ix[:,(alldata.dtypes=='int64') & (alldata.columns!='MSSubClass')].astype('float64')

변수들에서 결측값을 description에 기반해 몇몇은 0으로 몇몇은 중앙값으로 imputation 한다.

In [3]:
fMedlist=['LotFrontage']
fArealist=['MasVnrArea','TotalBsmtSF','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','BsmtFullBath', 'BsmtHalfBath','MasVnrArea','Fireplaces','GarageArea','GarageYrBlt','GarageCars']

for i in fArealist:
    alldata.ix[pd.isnull(alldata.ix[:,i]),i]=0
        
for i in fMedlist:
   alldata.ix[pd.isnull(alldata.ix[:,i]),i]=np.nanmedian(alldata.ix[:,i])    
       
### Transforming Data
le=LabelEncoder()
nacount_category=np.array(alldata.columns[((alldata.dtypes=='int64') | (alldata.dtypes=='object')) & (pd.isnull(alldata).sum()>0)])
category=np.array(alldata.columns[((alldata.dtypes=='int64') | (alldata.dtypes=='object'))])
Bsmtset=set(['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2'])
MasVnrset=set(['MasVnrType'])
Garageset=set(['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond'])
Fireplaceset=set(['FireplaceQu'])
Poolset=set(['PoolQC'])
NAset=set(['Fence','MiscFeature','Alley'])

for i in nacount_category:
    if i in Bsmtset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['TotalBsmtSF']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in MasVnrset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['MasVnrArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Garageset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['GarageArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Fireplaceset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['Fireplaces']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in Poolset:
        alldata.ix[pd.isnull(alldata.ix[:,i]) & (alldata['PoolArea']==0),i]='Empty'
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]
    elif i in NAset:
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]='Empty'
    else:
        alldata.ix[pd.isnull(alldata.ix[:,i]),i]=alldata.ix[:,i].value_counts().index[0]

for i in category:
    alldata.ix[:,i]=le.fit_transform(alldata.ix[:,i])

train=alldata.ix[0:trainlen-1,:]
test=alldata.ix[trainlen:alldata.shape[0],:]

변수선택과정을 위해 필요한 패키지들을 불러온다.

In [4]:
import xgboost as xgb
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import mean_squared_error
from sklearn.utils import shuffle
/opt/conda/lib/python3.5/site-packages/sklearn/cross_validation.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

코드를 실행하고 몇몇 이상치들을 제거한다. 파이썬의 통계모델 패키지에 의해 결측치가 감지된다. 자세한 사항은 여기서 생략한다.

In [5]:
o=[30,462,523,632,968,970, 1298, 1324]

train=train.drop(o,axis=0)
target=target.drop(o,axis=0)

train.index=range(0,train.shape[0])
target.index=range(0,train.shape[0])

XGB 모델을 설정하고, 베이지안 최적화 프로세스(Bayesian Optimization Process)에 기반한 교차검증으로부터 파라미터들을 얻는다.

In [6]:
est=xgb.XGBRegressor(colsample_bytree=0.4,
                 gamma=0.045,                 
                 learning_rate=0.07,
                 max_depth=20,
                 min_child_weight=1.5,
                 n_estimators=300,                                                                    
                 reg_alpha=0.65,
                 reg_lambda=0.45,
                 subsample=0.95)

test 프로세스를 시작해라, 기본적인 생각은 임의적으로 각 컬럼에서 요소들의 순서를 바꾼후 이 순서변환의 영향을 보는 것이다. 변수중요도(Feature importance)의 평가계산식을 위해, 나는 ((순서변환된 데이터의 MSE)-(Original 데이터의 MSE))/(original 데이터의 MSE)를 사용했다,

In [7]:
n=200

scores=pd.DataFrame(np.zeros([n,train.shape[1]]))
scores.columns=train.columns
ct=0

for train_idx, test_idx in ShuffleSplit(train.shape[0], n, .25):
    ct+=1
    X_train, X_test = train.ix[train_idx,:], train.ix[test_idx,:]
    Y_train, Y_test = target.ix[train_idx], target.ix[test_idx]
    r = est.fit(X_train, Y_train)
    acc = mean_squared_error(Y_test, est.predict(X_test))
    for i in range(train.shape[1]):
        X_t = X_test.copy()
        X_t.ix[:,i]=shuffle(np.array(X_t.ix[:, i]))
        shuff_acc =  mean_squared_error(Y_test, est.predict(X_t))
        scores.ix[ct-1,i]=((acc-shuff_acc)/acc)

결과(스코어 변동에 관한 정보인 평균, 중앙값 최대값 그리고 최소값)를 출력해라,

In [8]:
fin_score=pd.DataFrame(np.zeros([train.shape[1],4]))
fin_score.columns=['Mean','Median','Max','Min']
fin_score.index=train.columns
fin_score.ix[:,0]=scores.mean()
fin_score.ix[:,1]=scores.median()
fin_score.ix[:,2]=scores.min()
fin_score.ix[:,3]=scores.max()

변수들의 중요도를 보자. 더 높은 값일 수록 더 적은 중요도를 가진 요소이다,

In [9]:
pd.set_option('display.max_rows', None)
fin_score.sort_values('Mean',axis=0)
Out[9]:
MeanMedianMaxMin
OverallQual-1.721634-1.695310e+00-2.974808-0.890185
GrLivArea-0.684206-6.807324e-01-1.138712-0.287562
TotalBsmtSF-0.659126-6.494009e-01-1.147119-0.259313
GarageCars-0.224096-2.209438e-01-0.419189-0.005701
2ndFlrSF-0.204795-2.015302e-01-0.382840-0.028903
ExterQual-0.131748-1.260424e-01-0.4308430.007827
1stFlrSF-0.118198-1.120805e-01-0.2877170.032578
TotRmsAbvGrd-0.114792-1.109858e-01-0.3040130.031706
BsmtFinSF1-0.111651-1.070108e-01-0.2736980.006067
LotArea-0.098553-9.903931e-02-0.2115480.007124
YearRemodAdd-0.094342-9.050064e-02-0.2573270.001438
YearBuilt-0.066130-6.476141e-02-0.1738260.002589
KitchenQual-0.062434-5.731984e-02-0.2076720.088340
GarageArea-0.057476-5.727871e-02-0.1669800.051742
OverallCond-0.055216-5.341505e-02-0.1136460.001735
BsmtQual-0.040043-3.495320e-02-0.2680500.024164
Neighborhood-0.037366-3.574983e-02-0.0911850.010521
FullBath-0.035615-3.065354e-02-0.1306560.021763
Fireplaces-0.033859-3.244807e-02-0.1239910.056778
FireplaceQu-0.026058-2.476338e-02-0.0826210.014619
BsmtExposure-0.025270-2.555516e-02-0.1056170.031870
GarageType-0.017804-1.614043e-02-0.0672020.018833
BsmtFullBath-0.016328-1.562269e-02-0.0517500.017629
HalfBath-0.016214-1.419492e-02-0.1058030.029675
SaleCondition-0.016183-1.453541e-02-0.0758990.036155
OpenPorchSF-0.015003-1.334575e-02-0.0870580.026106
GarageYrBlt-0.013680-1.196422e-02-0.0533360.019144
MSZoning-0.011545-1.119452e-02-0.0360340.006571
BsmtUnfSF-0.010716-1.274519e-02-0.0620530.047117
LotFrontage-0.010319-6.702814e-03-0.0862700.054732
CentralAir-0.009013-8.141515e-03-0.0240850.001464
BedroomAbvGr-0.008846-9.131193e-03-0.0400720.023932
HouseStyle-0.008089-6.875350e-03-0.0548180.022105
MSSubClass-0.007513-6.740919e-03-0.0421860.073967
MasVnrArea-0.007067-6.656751e-03-0.0575830.040266
WoodDeckSF-0.006501-5.816901e-03-0.0371520.033252
KitchenAbvGr-0.006229-5.256144e-03-0.0353950.007294
Functional-0.005711-5.263566e-03-0.0319330.012218
SaleType-0.005616-5.602358e-03-0.0434440.031259
BsmtFinType1-0.004452-4.816908e-03-0.0229840.017094
Exterior1st-0.004175-4.227994e-03-0.0224020.023642
LotShape-0.004066-3.043836e-03-0.0299500.019051
LandSlope-0.003859-3.133321e-03-0.0364240.012639
GarageFinish-0.003601-3.235027e-03-0.0306980.016652
Condition1-0.003532-3.748848e-03-0.0186020.008014
HeatingQC-0.003156-2.455328e-03-0.0400240.017052
BldgType-0.002625-2.220249e-03-0.0148000.007582
PavedDrive-0.002481-2.448869e-03-0.0126980.007501
BsmtCond-0.001908-1.913248e-03-0.0126080.008016
Alley-0.001363-5.523610e-04-0.0363560.013305
ScreenPorch-0.001353-1.099849e-03-0.0106940.008861
GarageQual-0.001119-1.327409e-03-0.0117670.012773
Foundation-0.001045-5.493356e-04-0.0345940.038479
LandContour-0.000996-1.172577e-03-0.0166170.025768
MoSold-0.000942-1.244789e-03-0.0168800.032836
Electrical-0.000923-7.329373e-04-0.0087050.004379
Exterior2nd-0.000865-1.325317e-03-0.0226250.029316
PoolArea-0.000597-1.084900e-05-0.0228890.007964
ExterCond-0.000440-1.713312e-04-0.0256550.010897
RoofMatl-0.000367-5.775049e-05-0.0163370.015518
MiscFeature-0.000316-2.680548e-04-0.0031850.001744
GarageCond-0.000284-3.102211e-04-0.0063710.004772
BsmtFinType2-0.000177-9.897633e-05-0.0050210.004085
LotConfig-0.000109-4.425090e-04-0.0162710.025805
MiscVal-0.000055-4.055614e-05-0.0040870.002187
Fence-0.000033-1.978612e-04-0.0076960.015547
BsmtFinSF2-0.0000317.761363e-05-0.0106090.009558
Condition2-0.0000220.000000e+00-0.0011440.000447
Street-0.0000120.000000e+00-0.0011560.000577
Heating-0.000006-9.517935e-06-0.0016940.002295
BsmtHalfBath0.000017-1.806532e-05-0.0051690.005204
3SsnPorch0.0000244.257717e-06-0.0032930.001353
LowQualFinSF0.000086-6.713153e-05-0.0084830.013966
RoofStyle0.0001022.808409e-05-0.0157160.019742
PoolQC0.000126-1.115166e-07-0.0019080.009504
MasVnrType0.000416-1.399580e-04-0.0151350.024665
YrSold0.0008987.453469e-04-0.0089890.018542
EnclosedPorch0.0009517.400470e-04-0.0105390.034215

결과가 JMT5802가 낸 결과와 약간 상이하지만 전반적으로 비슷하다. 예를 들어, OverallQual, GrLivArea는 두 결과에서 중요하다. PoolArea와 PoolQC는 두 케이스에서 중요하지 않다. 또한 아래 링크에서 실시된 실험에 기초하여 말하자면 두 실험의 차이는 명확하지 않다. (거의 결과가 같다,)

또한 아래 링크의 예제에서 주된 코드가 수정되었다. 블로그의 작가에게 특별히 감사하다,

Updates:

몇몇 테스트 후 : 필자는 아래 리스트의 변수들을 제거했다. 이 결과 스코어가 약간 향상되었다.

 ['Exterior2nd', 'EnclosedPorch', 'RoofMatl', 'PoolQC', 'BsmtHalfBath', 'RoofStyle', 'PoolArea', 'MoSold', 'Alley', 'Fence', 'LandContour', 'MasVnrType', '3SsnPorch', 'LandSlope']