본문 바로가기

Kaggle/House Prices

RandomForestRegressor by BradenFitz-Gerald (With Python)

https://www.kaggle.com/dfitzgerald3/house-prices-advanced-regression-techniques/randomforestregressor/notebook

##########################################################################

본 게시글은 Kaggle Competition에서 House prices의 TOP Kernel 중 하나를 번역한 것임.

저작권에는 문제가 없어 보이나 문제가 될시 바로 삭제하겠음.

##########################################################################

Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Normalizer
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import Imputer

from scipy.stats import skew

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('ggplot')
/opt/conda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Import Data

In [2]:
train = '../input/train.csv'
test = '../input/test.csv'

df_train = pd.read_csv(train)
df_test = pd.read_csv(test)

Define Median Absolute Deviation Function

이 함수는 여기서 찾았다.: http://stackoverflow.com/a/22357811/5082694

In [3]:
def is_outlier(points, thresh = 3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

Remove Skew from SalesPrice data

In [4]:
target = df_train[df_train.columns.values[-1]]
target_log = np.log(target)

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(target, bins=50)
plt.title('Original Data')
plt.xlabel('Sale Price')

plt.subplot(1,2,2)
sns.distplot(target_log, bins=50)
plt.title('Natural Log of Data')
plt.xlabel('Natural Log of Sale Price')
plt.tight_layout()
/opt/conda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Merge Train and Test to evaluate ranges and missing values

이건 트레이닝 셋과 테스트셋에 있는 범주헝 데이터가 일치한지를 확실히 하기위해 한것임.

In [5]:
df_train = df_train[df_train.columns.values[:-1]]
df = df_train.append(df_test, ignore_index = True)

Find all categorical data

In [6]:
cats = []
for col in df.columns.values:
    if df[col].dtype == 'object':
        cats.append(col)

Create separte datasets for Continuous vs Categorical

연속형 변수들과 범주형변수들로 나누어 데이터를 더 손쉽게 다루게 해준다.

In [7]:
df_cont = df.drop(cats, axis=1)
df_cat = df[cats]

Handle Missing Data for continuous data

  • 결측치가 50개 이상인 변수는 제거한다.
  • 결측치가 50개 이하인 변수는 결측치를 해당 변수의 중앙값으로 대체한다(imputation).
  • 이상치를 중앙 절대 편차값(Median Absolute Deviation)을 사용해 제거한다.
  • 각 변수마다 왜도를 측정하고, 만약 0.75가 넘는다면, (로그를 씌우든, 제곱을 하든지 하여)변형시켜라
  • 각 변수에 sklearn.Normalizer를 사용해 정규화해라,
In [8]:
for col in df_cont.columns.values:
    if np.sum(df_cont[col].isnull()) > 50:
        df_cont = df_cont.drop(col, axis = 1)
    elif np.sum(df_cont[col].isnull()) > 0:
        median = df_cont[col].median()
        idx = np.where(df_cont[col].isnull())[0]
        df_cont[col].iloc[idx] = median

        outliers = np.where(is_outlier(df_cont[col]))
        df_cont[col].iloc[outliers] = median
        
        if skew(df_cont[col]) > 0.75:
            df_cont[col] = np.log(df_cont[col])
            df_cont[col] = df_cont[col].apply(lambda x: 0 if x == -np.inf else x)
        
        df_cont[col] = Normalizer().fit_transform(df_cont[col].reshape(1,-1))[0]
/opt/conda/lib/python3.5/site-packages/pandas/core/indexing.py:132: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Handle Missing Data for Categorical Data

  • 결측치가 50개 이상이면, 해당 변수를 제거해라.
  • 만약 결측치가 50개 이하면, 결측치를 'MIA'로 대체해라.
  • sklearn.LabelEncoder를 사용해 숫자라벨링을 해라.
  • 각 범주형 변수에 에서, 범주의 갯수(number of unique values)를 정하고 각 범주에 대해서 이진화한 세로운 변수를 만든다.
In [9]:
for col in df_cat.columns.values:
    if np.sum(df_cat[col].isnull()) > 50:
        df_cat = df_cat.drop(col, axis = 1)
        continue
    elif np.sum(df_cat[col].isnull()) > 0:
        df_cat[col] = df_cat[col].fillna('MIA')
        
    df_cat[col] = LabelEncoder().fit_transform(df_cat[col])
    
    num_cols = df_cat[col].max()
    for i in range(num_cols):
        col_name = col + '_' + str(i)
        df_cat[col_name] = df_cat[col].apply(lambda x: 1 if x == i else 0)
        
    df_cat = df_cat.drop(col, axis = 1)
/opt/conda/lib/python3.5/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/opt/conda/lib/python3.5/site-packages/ipykernel/__main__.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/opt/conda/lib/python3.5/site-packages/ipykernel/__main__.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Merge Numeric and Categorical Datasets and Create Training and Testing Data

In [10]:
df_new = df_cont.join(df_cat)

df_train = df_new.iloc[:len(df_train) - 1]
df_train = df_train.join(target_log)

df_test = df_new.iloc[len(df_train) + 1:]

X_train = df_train[df_train.columns.values[1:-1]]
y_train = df_train[df_train.columns.values[-1]]

X_test = df_test[df_test.columns.values[1:]]

Create Estimator and Apply Cross Validation

다겹 교차검증(multi-fold cross validation)으로 모델의 정확도를 측정할 수 있다. 이 경우 15-fold 교차검증을 하고 각 fold 마다 RMSE를 측정하였다.

RMSE 범위는 0.11~0.17사이이고 평균은 0.14이다.

In [11]:
from sklearn.metrics import make_scorer, mean_squared_error
scorer = make_scorer(mean_squared_error, False)

clf = RandomForestRegressor(n_estimators=500, n_jobs=-1)
cv_score = np.sqrt(-cross_val_score(estimator=clf, X=X_train, y=y_train, cv=15, scoring = scorer))

plt.figure(figsize=(10,5))
plt.bar(range(len(cv_score)), cv_score)
plt.title('Cross Validation Score')
plt.ylabel('RMSE')
plt.xlabel('Iteration')

plt.plot(range(len(cv_score) + 1), [cv_score.mean()] * (len(cv_score) + 1))
plt.tight_layout()

Evaluate Feature Significance

변수중요도(Feature Importance)를 살펴보는것은 상대적으로 직관적인 과정이다. :

  1. 변수중요도계수(feature importance coefficients)를 출력한다.
  2. 각 계수들에 해당 변수명을 붙인다.
  3. 내림차순으로 변수를 정렬한다.

전처리 방법과 모델 선택에 비추어, 가장 유의한 변수는 :

  1. OverallQual
  2. GrLivArea
  3. TotalBsmtSF
  4. GarageArea
In [12]:
# 트레이닝 데이터로 모델 학습

clf.fit(X_train, y_train) # 변수 중요도 계수(feature importance coefficients)출력, 해당 변수명 할당, 그리고 값 정렬. coef = pd.Series(clf.feature_importances_, index = X_train.columns).sort_values(ascending=False) plt.figure(figsize=(10, 5)) coef.head(25).plot(kind='bar') plt.title('Feature Significance') plt.tight_layout()

Visualize Predicted vs. Actual Sales Price

예측한 값과 실제값을 비교하여 시각화 하기 위해, 트레이닝 셋과 테스트셋으로 데이터를 나눈다. 이과정은  sklearn의 train_test_split 모듈로 쉽게 할 수 있다.

데이터를 임의비복원추출하여 모델을 훈련시키고, 시각화하여 실제값과 비교한다 .

In [13]:
from sklearn.cross_validation import train_test_split

X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train, y_train)
clf = RandomForestRegressor(n_estimators=500, n_jobs=-1)

clf.fit(X_train1, y_train1)
y_pred = clf.predict(X_test1)

plt.figure(figsize=(10, 5))
plt.scatter(y_test1, y_pred, s=20)
plt.title('Predicted vs. Actual')
plt.xlabel('Actual Sale Price')
plt.ylabel('Predicted Sale Price')

plt.plot([min(y_test1), max(y_test1)], [min(y_test1), max(y_test1)])
plt.tight_layout()
In [14]: