##########################################################################
본 게시글은 Kaggle Competition에서 House prices의 TOP Kernel 중 하나를 번역한 것임.
저작권에는 문제가 없어 보이나 문제가 될시 바로 삭제하겠음.
##########################################################################
Fun with Real Estate
Data Driven Real Estate Analysis
이 데이터 셋은 리얼리티 tv 쇼의 쓸모없는 충동적인 결정요소 없이 무엇이 정말로 집의 가치에 영향을 주는지에 대해 알아 볼 기회를 준다. 따라서 나는 이것을 살펴보는것이 상당히 흥미롭다.
Plan
- 데이터를 취합하고 탐색해봐라
- 변수를 정리하고, 필요한 것을 만들어라
- Three Models: Linear, randomForest, and xgboost
- 최선의 모델을 선택하고 제출하기 위해 예측치를 만들어보자.
Clean the Data
자, 여기서는 무엇을 해야하는가?
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
첫번째로 할것은 몇몇 문자형 변수들을 수치형 변수로 변형(e.g. one-hot encoding)시키는 것이다. street type은 어떤가?
table(train$Street)
##
## Grvl Pave
## 6 1454
그닥 좋아보이진 않는다. Pave 인지 아닌지 표시하도록 만들어보자.LotShape는 어떠한가?
train$paved[train$Street == "Pave"] <- 1
train$paved[train$Street != "Pave"] <- 0
table(train$LotShape)
##
## IR1 IR2 IR3 Reg
## 484 41 10 925
irregular에 약간의 변동성이 있다. 한번 regular이고 아닌것으로 가보자. 그리고 난 다음 이 변수를 후에 더 세부적으로 나눌 것인지를 정한다. 다음 변수로는 land contour이다.
train$regshape[train$LotShape == "Reg"] <- 1
train$regshape[train$LotShape != "Reg"] <- 0
table(train$LandContour)
##
## Bnk HLS Low Lvl
## 63 50 36 1311
공간을 아끼기 위해, 나머지 범주형 변수들은 뛰어 넘겠다.
모든 코드를 보고싶다면, 위 링크에 들어가 code 탭으로 들어가면 약 300줄 짜리 코드를 볼 수 있다.
Poking Around
자 1단계를 끝냈다. 이제 범주형이 아닌 수치형 변수를 정리해 보자. 몇몇 필요없는 변수는 제거할 것이다.
또 다른 한가지는, 시각적으로 볼만한 그래프를 만드는 것이다. 예를 들어, 만약 집에 수영장이 있다면, 뒤뜰 같은것이 있는게 더 중요할까(좋을까)? 이러한 변수간 관계를 알아보기위해 상관관계 그래프를 사용했다. 이를 통해, 파생변수를 만들거나, 어느 변수를 모델에 넣을지 선택한다.
library(corrplot)
correlations <- cor(train[,c(5,6,7,8, 16:25)], use="everything")
corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
correlations <- cor(train[,c(5,6,7,8, 26:35)], use="everything")
corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
correlations <- cor(train[,c(5,6,7,8, 66:75)], use="everything")
## Warning in cor(train[, c(5, 6, 7, 8, 66:75)], use = "everything"): the
## standard deviation is zero
corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
어쨌든, 공간(평수)와 관련 있는 상관관계들은 제거할 것이다. 왜냐하면 size of the total와 size of a floor는 명백히 상관관계가 있기 때문이다.
pairs(~YearBuilt+OverallQual+TotalBsmtSF+GrLivArea,data=train,
main="Simple Scatterplot Matrix")
이것 또한 흥미롭다. 강한 상관관계를 가진 몇몇 변수들을 골랐는데, 명백히 거실면적의 크기가 클수록 지하실의 크기도 시간에 따라 더 커진다. 또한 SalePrice와 몇몇 수치형변수들간의 관계에 흥미를 느꼈으나 이것은 시각화하기 약간 까다로웠다.
library(car)
scatterplot(SalePrice ~ YearBuilt, data=train, xlab="Year Built", ylab="Sale Price", grid=FALSE)
scatterplot(SalePrice ~ YrSold, data=train, xlab="Year Sold", ylab="Sale Price", grid=FALSE)
scatterplot(SalePrice ~ X1stFlrSF, data=train, xlab="Square Footage Floor 1", ylab="Sale Price", grid=FALSE)
가격들은 새집일수록 더 높다. 판매가격들은 우리가 오를거라고 기대했을때 가격이 떨어졌음을 볼 수 있다. 우리는 또한 first floor square footage 변수에서 몇몇 이상치들을 볼 수 있다. 이 데이터는 좋지 않지만 예측에 큰 영향을 미치진 않는다.
# NA인 경우 0으로 할당
train$GarageYrBlt[is.na(train$GarageYrBlt)] <- 0
train$MasVnrArea[is.na(train$MasVnrArea)] <- 0
train$LotFrontage[is.na(train$LotFrontage)] <- 0
# 상관관계에 기반한 상호작용, 파생변수생성
train$year_qual <- train$YearBuilt*train$OverallQual #overall condition
train$year_r_qual <- train$YearRemodAdd*train$OverallQual #quality x remodel
train$qual_bsmt <- train$OverallQual*train$TotalBsmtSF #quality x basement size
train$livarea_qual <- train$OverallQual*train$GrLivArea #quality x living area
train$qual_bath <- train$OverallQual*train$FullBath #quality x baths
train$qual_ext <- train$OverallQual*train$exterior_cond #quality x exterior
#names(train)
Model Prepping
이제 데이터를 나누자! caret partitioning function이 상당히 유용하다.
outcome <- train$SalePrice
partition <- createDataPartition(y=outcome,
p=.5,
list=F)
training <- train[partition,]
testing <- train[-partition,]
A Linear Model
마침내, 우리는 우리만의 데이터를 가졌고 이를 통해 몇몇 모델들을 만들 수 있다. 예측하고자 하는 변수가 SalePrice이므로 , GLM이 아닌 선형회귀분석을 할 것이다. 이렇게 모델링의 감을 얻기 위해, 적절한 하나의 회귀모형을 돌리는 것을 선호한다.
lm_model_15 <- lm(SalePrice ~ ., data=training)
summary(lm_model_15)
##
## Call:
## lm(formula = SalePrice ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -232667 -14959 -97 13514 173651
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.668e+06 2.088e+06 0.799 0.424630
## Id -1.714e+00 2.891e+00 -0.593 0.553480
## MSSubClass -1.692e+02 4.194e+01 -4.035 6.11e-05 ***
## LotFrontage -6.983e+01 4.057e+01 -1.721 0.085713 .
## LotArea -1.538e-01 2.015e-01 -0.763 0.445686
## OverallQual -2.199e+05 1.469e+05 -1.498 0.134740
## OverallCond 8.638e+03 1.597e+03 5.409 8.94e-08 ***
## YearBuilt -2.774e+02 2.660e+02 -1.043 0.297487
## YearRemodAdd -3.798e+02 4.297e+02 -0.884 0.377150
## MasVnrArea 1.333e+01 1.073e+01 1.243 0.214464
## BsmtFinSF1 4.520e+01 1.793e+01 2.520 0.011962 *
## BsmtFinSF2 5.020e+01 1.877e+01 2.675 0.007662 **
## BsmtUnfSF 2.517e+01 1.736e+01 1.450 0.147651
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 6.945e+01 1.823e+01 3.810 0.000152 ***
## X2ndFlrSF 4.897e+01 1.909e+01 2.565 0.010539 *
## LowQualFinSF -1.573e+01 3.251e+01 -0.484 0.628590
## GrLivArea NA NA NA NA
## BsmtFullBath 7.686e+01 3.697e+03 0.021 0.983419
## BsmtHalfBath -1.376e+04 5.783e+03 -2.380 0.017594 *
## FullBath -7.716e+03 1.444e+04 -0.534 0.593365
## HalfBath 4.247e+03 3.887e+03 1.093 0.274961
## BedroomAbvGr -3.450e+03 2.388e+03 -1.445 0.148963
## KitchenAbvGr -1.174e+04 8.531e+03 -1.376 0.169211
## TotRmsAbvGrd 4.940e+03 1.641e+03 3.009 0.002719 **
## Fireplaces 8.204e+03 3.923e+03 2.091 0.036904 *
## GarageYrBlt -1.192e+01 7.525e+00 -1.584 0.113672
## GarageCars 1.427e+04 3.730e+03 3.827 0.000142 ***
## GarageArea -1.192e+01 1.273e+01 -0.936 0.349533
## WoodDeckSF 4.207e+01 1.083e+01 3.885 0.000113 ***
## OpenPorchSF -5.440e+00 1.969e+01 -0.276 0.782361
## EnclosedPorch -1.710e+01 2.135e+01 -0.801 0.423446
## X3SsnPorch 8.117e+01 3.738e+01 2.171 0.030263 *
## ScreenPorch 7.006e+01 2.426e+01 2.888 0.004008 **
## PoolArea -1.657e+02 3.855e+01 -4.297 1.99e-05 ***
## MiscVal -9.980e-01 1.768e+00 -0.565 0.572583
## MoSold -2.377e+02 4.568e+02 -0.520 0.602937
## YrSold -2.507e+02 9.341e+02 -0.268 0.788471
## paved 2.186e+04 1.726e+04 1.267 0.205654
## regshape -9.187e+02 2.770e+03 -0.332 0.740281
## flat 1.459e+04 4.857e+03 3.003 0.002774 **
## pubutil 6.443e+04 3.382e+04 1.905 0.057171 .
## gentle_slope -1.564e+04 6.784e+03 -2.305 0.021494 *
## culdesac_fr3 NA NA NA NA
## nbhd_price_level 1.772e+04 2.641e+03 6.708 4.29e-11 ***
## pos_features_1 -2.946e+04 8.600e+03 -3.426 0.000652 ***
## pos_features_2 NA NA NA NA
## twnhs_end_or_1fam -7.559e+03 7.279e+03 -1.038 0.299465
## house_style_level -2.528e+03 4.158e+03 -0.608 0.543374
## roof_hip_shed 8.045e+03 3.418e+03 2.354 0.018889 *
## roof_matl_hi 2.820e+04 1.537e+04 1.835 0.066928 .
## exterior_1 -1.215e+04 7.285e+03 -1.668 0.095797 .
## exterior_2 1.883e+04 7.133e+03 2.640 0.008491 **
## exterior_mason_1 -6.577e+03 3.723e+03 -1.766 0.077797 .
## exterior_cond -7.244e+04 1.431e+04 -5.062 5.42e-07 ***
## exterior_cond2 -1.458e+03 3.984e+03 -0.366 0.714469
## found_concrete -2.059e+03 4.042e+03 -0.509 0.610686
## bsmt_cond1 5.173e+03 3.217e+03 1.608 0.108281
## bsmt_cond2 -1.516e+03 3.797e+03 -0.399 0.689781
## bsmt_exp 3.453e+03 1.493e+03 2.313 0.021051 *
## bsmt_fin1 3.168e+03 1.501e+03 2.111 0.035124 *
## bsmt_fin2 2.452e+03 1.975e+03 1.241 0.214897
## gasheat 9.562e+03 1.517e+04 0.630 0.528765
## heatqual 9.957e+00 1.682e+03 0.006 0.995277
## air -1.267e+04 6.738e+03 -1.881 0.060473 .
## standard_electric -3.917e+03 5.117e+03 -0.765 0.444277
## kitchen 9.784e+03 3.213e+03 3.045 0.002422 **
## fire -2.155e+03 1.830e+03 -1.178 0.239331
## gar_attach -5.462e+02 3.861e+03 -0.141 0.887550
## gar_finish 1.349e+03 3.660e+03 0.369 0.712498
## garqual 1.218e+04 7.688e+03 1.584 0.113684
## garqual2 -5.855e+03 7.688e+03 -0.762 0.446598
## paved_drive NA NA NA NA
## housefunction 1.251e+04 5.454e+03 2.293 0.022158 *
## pool_good 2.745e+05 3.211e+04 8.548 < 2e-16 ***
## priv_fence 3.332e+02 6.474e+03 0.051 0.958971
## sale_cat 4.281e+03 3.160e+03 1.355 0.175908
## sale_cond 9.957e+03 4.580e+03 2.174 0.030044 *
## zone -1.465e+02 3.457e+03 -0.042 0.966218
## alleypave -1.018e+03 7.531e+03 -0.135 0.892559
## year_qual 5.615e+01 4.335e+01 1.295 0.195672
## year_r_qual 5.088e+01 7.915e+01 0.643 0.520561
## qual_bsmt -6.519e+00 2.494e+00 -2.614 0.009166 **
## livarea_qual -2.214e+00 2.475e+00 -0.895 0.371274
## qual_bath 1.879e+03 2.314e+03 0.812 0.417072
## qual_ext 1.113e+04 2.084e+03 5.338 1.30e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30930 on 650 degrees of freedom
## Multiple R-squared: 0.8694, Adjusted R-squared: 0.8533
## F-statistic: 54.07 on 80 and 650 DF, p-value: < 2.2e-16
많은 변수들을 당장 제거할 수 있다. 이것은 좋지만 약간의 다중공선성이 약간의 변수를 유의하게 만들어주고 있다. 하지만 당장은 아직 괜찮다. 또한 모형을 사용하여 분산의 몇 퍼센트가 설명되었는지를 나타내는 R-squared값도 나쁘지 않다.
lm_model_15 <- lm(SalePrice ~ MSSubClass+LotArea+BsmtUnfSF+
X1stFlrSF+X2ndFlrSF+GarageCars+
WoodDeckSF+nbhd_price_level+
exterior_cond+pos_features_1+
bsmt_exp+kitchen+housefunction+pool_good+sale_cond+
qual_ext+qual_bsmt, data=training)
summary(lm_model_15)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + BsmtUnfSF + X1stFlrSF +
## X2ndFlrSF + GarageCars + WoodDeckSF + nbhd_price_level +
## exterior_cond + pos_features_1 + bsmt_exp + kitchen + housefunction +
## pool_good + sale_cond + qual_ext + qual_bsmt, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -322019 -16772 -503 14881 218443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.815e+04 1.506e+04 -3.197 0.001449 **
## MSSubClass -1.865e+02 3.162e+01 -5.899 5.64e-09 ***
## LotArea -2.211e-01 1.839e-01 -1.202 0.229787
## BsmtUnfSF -1.354e+01 3.248e+00 -4.169 3.43e-05 ***
## X1stFlrSF 6.134e+01 6.535e+00 9.387 < 2e-16 ***
## X2ndFlrSF 4.349e+01 3.703e+00 11.743 < 2e-16 ***
## GarageCars 1.069e+04 2.221e+03 4.812 1.83e-06 ***
## WoodDeckSF 3.850e+01 1.102e+01 3.493 0.000507 ***
## nbhd_price_level 1.854e+04 2.247e+03 8.249 7.68e-16 ***
## exterior_cond -4.108e+04 7.112e+03 -5.776 1.14e-08 ***
## pos_features_1 -1.696e+04 8.914e+03 -1.903 0.057443 .
## bsmt_exp 7.569e+03 1.396e+03 5.421 8.12e-08 ***
## kitchen 1.196e+04 3.136e+03 3.813 0.000149 ***
## housefunction 1.240e+04 5.449e+03 2.275 0.023223 *
## pool_good 1.419e+05 2.536e+04 5.596 3.13e-08 ***
## sale_cond 1.386e+04 3.270e+03 4.238 2.55e-05 ***
## qual_ext 7.227e+03 7.563e+02 9.555 < 2e-16 ***
## qual_bsmt -1.534e+00 7.689e-01 -1.995 0.046382 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34240 on 713 degrees of freedom
## Multiple R-squared: 0.8244, Adjusted R-squared: 0.8202
## F-statistic: 196.9 on 17 and 713 DF, p-value: < 2.2e-16
이것이 우리의 최종 선형회귀모델이다. RMSE가 어느정도 나오는가? 이 수치가 사실 최종적으로 알고 싶은 결과이다.
prediction <- predict(lm_model_15, testing, type="response")
model_output <- cbind(testing, prediction)
model_output$log_prediction <- log(model_output$prediction)
model_output$log_SalePrice <- log(model_output$SalePrice)
# RMSE검증.
rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] 0.1625762
A Random Forest
이것이 선형회귀 임을 고려하면, 그리 나쁘지 않다. RF로 모형을 훈련시켜 보자. RF는 자체적으로 Feature selection을 하기 때문에 전체 변수들을 넣어 실행해 보자.
model_1 <- randomForest(SalePrice ~ ., data=training)
# test셋으로 예측
prediction <- predict(model_1, testing)
model_output <- cbind(testing, prediction)
model_output$log_prediction <- log(model_output$prediction)
model_output$log_SalePrice <- log(model_output$SalePrice)
# RMSE검증
rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] 0.129874
An xgboost
좋다! xgboost는 어떨까?
# 데이터 취합 및 형식화
training$log_SalePrice <- log(training$SalePrice)
testing$log_SalePrice <- log(testing$SalePrice)
# data frame에서 matrix로 변환
trainData<- as.matrix(training, rownames.force=NA)
testData<- as.matrix(testing, rownames.force=NA)
# sparse matrix 변환
train2 <- as(trainData, "sparseMatrix")
test2 <- as(testData, "sparseMatrix")
#####
#colnames(train2)
#Cross Validate the model
vars <- c(2:37, 39:86) # 모델에 사용할 변수들 인덱스 저장
trainD <- xgb.DMatrix(data = train2[,vars], label = train2[,"SalePrice"]) #Convert to xgb.DMatrix format
# 모델 교차검증
cv.sparse <- xgb.cv(data = trainD,
nrounds = 600,
min_child_weight = 0,
max_depth = 10,
eta = 0.02,
subsample = .7,
colsample_bytree = .7,
booster = "gbtree",
eval_metric = "rmse",
verbose = TRUE,
print_every_n = 50,
nfold = 4,
nthread = 2,
objective="reg:linear")
## [1] train-rmse:195005.003906+1012.923119 test-rmse:195150.914062+2962.570278
## [51] train-rmse:78467.261719+662.759181 test-rmse:82230.642578+4199.727883
## [101] train-rmse:34697.281250+804.307500 test-rmse:44915.177735+6595.603107
## [151] train-rmse:17470.770752+945.518765 test-rmse:34572.629395+7150.926451
## [201] train-rmse:9939.554443+894.324742 test-rmse:32001.610351+6913.437300
## [251] train-rmse:6230.806641+720.441702 test-rmse:31467.872070+6651.298992
## [301] train-rmse:4304.733704+586.261003 test-rmse:31520.323731+6552.300338
## [351] train-rmse:3130.306519+460.549863 test-rmse:31694.579590+6542.268936
## [401] train-rmse:2337.401337+340.867469 test-rmse:31877.824218+6583.607992
## [451] train-rmse:1782.607483+260.568346 test-rmse:32051.724121+6624.026290
## [501] train-rmse:1359.533905+190.523414 test-rmse:32144.135254+6630.878098
## [551] train-rmse:1048.664414+146.240873 test-rmse:32221.508301+6641.308896
## [600] train-rmse:821.868271+116.823602 test-rmse:32290.149414+6656.170419
# 모델 훈련
# 하이퍼파라미터 설정
param <- list(colsample_bytree = .7,
subsample = .7,
booster = "gbtree",
max_depth = 10,
eta = 0.02,
eval_metric = "rmse",
objective="reg:linear")
# xgb 모델 훈련
bstSparse <-
xgb.train(params = param,
data = trainD,
nrounds = 600,
watchlist = list(train = trainD),
verbose = TRUE,
print_every_n = 50,
nthread = 2)
## [1] train-rmse:195014.750000
## [51] train-rmse:77897.039062
## [101] train-rmse:34041.437500
## [151] train-rmse:17198.125000
## [201] train-rmse:9977.225586
## [251] train-rmse:6436.147461
## [301] train-rmse:4580.320312
## [351] train-rmse:3416.647949
## [401] train-rmse:2639.834473
## [451] train-rmse:2092.339355
## [501] train-rmse:1679.307617
## [551] train-rmse:1357.634521
## [600] train-rmse:1093.774414
예측 후 RMSE 측정.
testD <- xgb.DMatrix(data = test2[,vars])
#Column names must match the inputs EXACTLY
prediction <- predict(bstSparse, testD) #Make the prediction based on the half of the training data set aside
#Put testing prediction and test dataset all together
test3 <- as.data.frame(as.matrix(test2))
prediction <- as.data.frame(as.matrix(prediction))
colnames(prediction) <- "prediction"
model_output <- cbind(test3, prediction)
model_output$log_prediction <- log(model_output$prediction)
model_output$log_SalePrice <- log(model_output$SalePrice)
#Test with RMSE
rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] 0.1197072
결과가 좋다! XGB 성능이 꽤 좋다. 이것이 좋다고 생각하고 이제 submission 파일을 만들어보자. 여러분이 모델을 제훈련시키고 submission 파일을 만들고 싶지 않은이상, 여기가 사실상 마지막 부분이다.
Retrain on the full sample
rm(bstSparse)
# Create matrices from the data frames
retrainData<- as.matrix(train, rownames.force=NA)
# Turn the matrices into sparse matrices
retrain <- as(retrainData, "sparseMatrix")
param <- list(colsample_bytree = .7,
subsample = .7,
booster = "gbtree",
max_depth = 10,
eta = 0.02,
eval_metric = "rmse",
objective="reg:linear")
retrainD <- xgb.DMatrix(data = retrain[,vars], label = retrain[,"SalePrice"])
#retrain the model using those parameters
bstSparse <-
xgb.train(params = param,
data = retrainD,
nrounds = 600,
watchlist = list(train = trainD),
verbose = TRUE,
print_every_n = 50,
nthread = 2)
## [1] train-rmse:194950.546875
## [51] train-rmse:77041.585938
## [101] train-rmse:33339.394531
## [151] train-rmse:16391.695312
## [201] train-rmse:9333.628906
## [251] train-rmse:6018.882324
## [301] train-rmse:4418.943848
## [351] train-rmse:3487.291504
## [401] train-rmse:2858.591553
## [451] train-rmse:2388.746338
## [501] train-rmse:1956.183472
## [551] train-rmse:1606.430786
## [600] train-rmse:1324.839355
Prepare the prediction data
훈련셋에서 했던 작업을 단지 반복한 것뿐이다. 더 자세한 사항은 code tap을 봐보자.
그 다음, xgboost에 맞는 형식으로 데이터를 변형시켜라.
# Get the supplied test data ready #
predict <- as.data.frame(test) #Get the dataset formatted as a frame for later combining
#Create matrices from the data frames
predData<- as.matrix(predict, rownames.force=NA)
#Turn the matrices into sparse matrices
predicting <- as(predData, "sparseMatrix")
트레이닝셋과 테스트셋이 같은 변수를 가지고 있는지 확실히 확인해라.
colnames(train[,c(2:37, 39:86)])
## [1] "MSSubClass" "LotFrontage" "LotArea"
## [4] "OverallQual" "OverallCond" "YearBuilt"
## [7] "YearRemodAdd" "MasVnrArea" "BsmtFinSF1"
## [10] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF"
## [13] "X1stFlrSF" "X2ndFlrSF" "LowQualFinSF"
## [16] "GrLivArea" "BsmtFullBath" "BsmtHalfBath"
## [19] "FullBath" "HalfBath" "BedroomAbvGr"
## [22] "KitchenAbvGr" "TotRmsAbvGrd" "Fireplaces"
## [25] "GarageYrBlt" "GarageCars" "GarageArea"
## [28] "WoodDeckSF" "OpenPorchSF" "EnclosedPorch"
## [31] "X3SsnPorch" "ScreenPorch" "PoolArea"
## [34] "MiscVal" "MoSold" "YrSold"
## [37] "paved" "regshape" "flat"
## [40] "pubutil" "gentle_slope" "culdesac_fr3"
## [43] "nbhd_price_level" "pos_features_1" "pos_features_2"
## [46] "twnhs_end_or_1fam" "house_style_level" "roof_hip_shed"
## [49] "roof_matl_hi" "exterior_1" "exterior_2"
## [52] "exterior_mason_1" "exterior_cond" "exterior_cond2"
## [55] "found_concrete" "bsmt_cond1" "bsmt_cond2"
## [58] "bsmt_exp" "bsmt_fin1" "bsmt_fin2"
## [61] "gasheat" "heatqual" "air"
## [64] "standard_electric" "kitchen" "fire"
## [67] "gar_attach" "gar_finish" "garqual"
## [70] "garqual2" "paved_drive" "housefunction"
## [73] "pool_good" "priv_fence" "sale_cat"
## [76] "sale_cond" "zone" "alleypave"
## [79] "year_qual" "year_r_qual" "qual_bsmt"
## [82] "livarea_qual" "qual_bath" "qual_ext"
vars <- c("MSSubClass","LotFrontage","LotArea","OverallQual","OverallCond","YearBuilt",
"YearRemodAdd","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF" ,
"X1stFlrSF","X2ndFlrSF","LowQualFinSF","GrLivArea","BsmtFullBath","BsmtHalfBath" ,
"FullBath","HalfBath","BedroomAbvGr","KitchenAbvGr","TotRmsAbvGrd","Fireplaces" ,
"GarageYrBlt","GarageCars","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch" ,
"X3SsnPorch","ScreenPorch","PoolArea","MiscVal","MoSold","YrSold",
"paved","regshape","flat","pubutil","gentle_slope","culdesac_fr3" ,
"nbhd_price_level" , "pos_features_1","pos_features_2","twnhs_end_or_1fam","house_style_level", "roof_hip_shed" ,
"roof_matl_hi","exterior_1","exterior_2","exterior_mason_1","exterior_cond","exterior_cond2" ,
"found_concrete","bsmt_cond1","bsmt_cond2","bsmt_exp","bsmt_fin1","bsmt_fin2" ,
"gasheat","heatqual","air","standard_electric", "kitchen","fire",
"gar_attach","gar_finish","garqual","garqual2","paved_drive","housefunction",
"pool_good","priv_fence","sale_cat","sale_cond","zone","alleypave",
"year_qual","year_r_qual","qual_bsmt","livarea_qual","qual_bath", "qual_ext")
#colnames(predicting)
colnames(predicting[,vars])
## [1] "MSSubClass" "LotFrontage" "LotArea"
## [4] "OverallQual" "OverallCond" "YearBuilt"
## [7] "YearRemodAdd" "MasVnrArea" "BsmtFinSF1"
## [10] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF"
## [13] "X1stFlrSF" "X2ndFlrSF" "LowQualFinSF"
## [16] "GrLivArea" "BsmtFullBath" "BsmtHalfBath"
## [19] "FullBath" "HalfBath" "BedroomAbvGr"
## [22] "KitchenAbvGr" "TotRmsAbvGrd" "Fireplaces"
## [25] "GarageYrBlt" "GarageCars" "GarageArea"
## [28] "WoodDeckSF" "OpenPorchSF" "EnclosedPorch"
## [31] "X3SsnPorch" "ScreenPorch" "PoolArea"
## [34] "MiscVal" "MoSold" "YrSold"
## [37] "paved" "regshape" "flat"
## [40] "pubutil" "gentle_slope" "culdesac_fr3"
## [43] "nbhd_price_level" "pos_features_1" "pos_features_2"
## [46] "twnhs_end_or_1fam" "house_style_level" "roof_hip_shed"
## [49] "roof_matl_hi" "exterior_1" "exterior_2"
## [52] "exterior_mason_1" "exterior_cond" "exterior_cond2"
## [55] "found_concrete" "bsmt_cond1" "bsmt_cond2"
## [58] "bsmt_exp" "bsmt_fin1" "bsmt_fin2"
## [61] "gasheat" "heatqual" "air"
## [64] "standard_electric" "kitchen" "fire"
## [67] "gar_attach" "gar_finish" "garqual"
## [70] "garqual2" "paved_drive" "housefunction"
## [73] "pool_good" "priv_fence" "sale_cat"
## [76] "sale_cond" "zone" "alleypave"
## [79] "year_qual" "year_r_qual" "qual_bsmt"
## [82] "livarea_qual" "qual_bath" "qual_ext"
최종 예측.
# 변수명이 반드시 input data의 변수명과 정확히 같아야 한다.
prediction <- predict(bstSparse, predicting[,vars])
prediction <- as.data.frame(as.matrix(prediction)) #후에 결합시키기 위해 예측데이터셋을 데이터프레임화 시킨다.
colnames(prediction) <- "prediction"
model_output <- cbind(predict, prediction) # 실제 데이터셋(test)과 예측치를 결합한다.
sub2 <- data.frame(Id = model_output$Id, SalePrice = model_output$prediction)
length(model_output$prediction)
## [1] 1459
write.csv(sub2, file = "sub3.csv", row.names = F)
head(sub2$SalePrice)
## [1] 126802.0 158142.5 172708.9 190513.9 192890.8 175320.2
'Kaggle > House Prices' 카테고리의 다른 글
A study on Regression applied to the Ames dataset by juliencs (With Python) (0) | 2016.11.27 |
---|---|
RandomForestRegressor by BradenFitz-Gerald (With Python) (0) | 2016.11.17 |
Detailed Data Exploration in Python by Angela (With Python) (0) | 2016.11.15 |
Ensemble Modeling : Stack Model Example by J.Thompson (with R) (0) | 2016.11.13 |
Housing Data Exploratory Analysis by AiO (With R) (0) | 2016.11.13 |