Kaggle/House Prices

Regularized Linear Models by Alexandru Papiu (With Python)

Hoonki90 2016. 11. 8. 21:37

##########################################################################

본 게시글은 Kaggle Competition에서 House prices의 TOP Kernel 중 하나를 번역한 것임.

저작권에는 문제가 없어 보이나 문제가 될시 바로 삭제하겠음.

##########################################################################

Trying out a linear model:

Author: Alexandru Papiu

https://www.kaggle.com/apapiu/house-prices-advanced-regression-techniques/regularized-linear-models

xgboost에 대한 몇몇 좋은 스크립트 들이 이미 있으나, 좀 더 간단히 코딩을 하려고 함. : regularized linear regression model.

놀랍게도, 이 모델은 아주 약간의 피쳐 엔지니어링으로도 아주 잘 작동함. 키 포인트는 대부분의 변수들이 한쪽으로 치우쳐 졌기 때문에(즉, 왜도가 있기 때문에) 로그 변환을 하는것이다.

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr


%config InlineBackend.figure_format = 'png' #set 'png' here when working on notebook 
%matplotlib inline

In [2]:

train = pd.read_csv("../input/train.csv") # train data 불러오기
test = pd.read_csv("../input/test.csv") # train data 불러오기

In [3]:

train.head()

Out[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [4]:

# id와 SalePrice 변수를 제외한 모든 데이터를 하나로 통합.

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

Data preprocessing:

어떤 특별한 기법을 여기서 쓰진 않을 것이다.

먼저, log(변수+1)을 이용해 왜도가 있는 변수들을 변환시킬 것이다. 이 방식은 변수가 더 정규분포에 가깝게 만들어 줄것이다.
범주형 변수들에 대해서는 더미변수를 만든다.(one-hot endcoding)
수치형 변수에서 결측값(NaN)들을 각 해당 변수의 평균으로 대치(Imputation)시킨다.

In [5]:

matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)

# price와 log(price+1) 데이터프레임.
prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])}) 
prices.hist() # 히스토그램

Out[5]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f77b0759470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f77b07d2cf8>]], dtype=object)

In [6]:

# SalePrice log 변환
train["SalePrice"] = np.log1p(train["SalePrice"])

#모든 skewed numeric 변수 log 변환.
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index  # object dtype이 아닌 변수들의 index

# na 제거 후 numeric 변수들의 왜도측정
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) 
skewed_feats = skewed_feats[skewed_feats > 0.75] # 0.75 이상의 왜도를 가진 변수들만 모음.
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats]) # 모아진 변수들만 로그를 취함.

In [7]:

all_data = pd.get_dummies(all_data) # all_data 의 범주형 변수를 더미화시킴.

In [8]:

# NA를 해당 변수의 평균으로 imputation
all_data = all_data.fillna(all_data.mean())

In [9]:

# sklearn을 위한 데이터행렬 만들기.
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

Models

이제 sklearn 모듈로부터 regularized linear regression모델을 사용할 것이다. L1(Lasso), L2(Ridge) 제약 둘 다를 사용할 것이다. 또한 교차검증 RMSE error를 계산하여 모델을 평가할 수 있으면서 최고의 파라미터를 고를 수 있도록 함수를 정의할 것이다.

In [10]:

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

In [11]:

model_ridge = Ridge()

Ridge모델에서 주된 파라미터는 alpha(lambda 값)이다. 이 제약파라미터는 모델이 얼마나 유연한지를 측정한다. 제약이 더 높아질수록 우리의 모델은 더 덜 과적합되는 경향이 있다. (어찌 보면 bias를 높이는 것이다.) 그러나 제약이 높아 진다는 것은 모델의 유연성을 잃고 데이터의 모든 신호를 못잡을 수도 있다.

In [12]:

alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() 
            for alpha in alphas]

In [13]:

cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")

Out[13]:

<matplotlib.text.Text at 0x7f77a8220fd0>

위에 보는 것과 같이 U 자형 곡선이 그려진다. alpha 값이 너무 작을때, 제약은 너무 강하다. 모델은 모델의 모든 복잡성들을 학습하진 못한다. 그러나 만약 우리가 모델을 너무 유연하게 한다면(alpha 값이 크면) 모델은 과적합되기 시작한다. 위 그림에서는 alpha가 10일때 위 곡선의 가장 밑에 닿는다. ( rmse가 가장 낮다.)

In [14]:

cv_ridge.min()

Out[14]:

0.1273373466867076

그래서 Ridge regression로는 약 0.127의 RMSE를 얻는다.

이제 Lasso 모델을 시도해 보자. 우리는 여기서 약간 다르게 접근해 볼 것이다. 최적의 alpha 값을 찾기 위해 Lasso CV를 사용할 것이다. 어떤 이유로 인해, Lasso CV에서 alpha는 Ridge에서의 alpha의 역순이다.

In [15]:

model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)

In [16]:

rmse_cv(model_lasso).mean()

Out[16]:

0.12314421090977427

라쏘가 심지어 더 좋은 성능을 낸다. 따라서 라쏘로 테스트셋을 예측하자. 또 라쏘에 대한 다른 좋은 점은 자동으로 feature selection을 한다는 것이다. 즉 라쏘모델이 중요하지 않다고 여겨지는 피쳐들의 계수를 0으로 설정한다. 계수를 한번 봐보자.

In [17]:

coef = pd.Series(model_lasso.coef_, index = X_train.columns)

In [18]:

print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

Lasso picked 110 variables and eliminated the other 178 variables

여기서 주목할 점은 선택되어진 피쳐들은 특히 데이터셋에서 공선성이 있는 변수들이 많기 때문에 반드시 옳은 피쳐들은 아니라는 것이다. 한가지 방법은 부트스트랩 샘플들에 몇번 라쏘를 실행하고 얼마나 피쳐 셀렉션이 안정적인지를 보는것이다.

가장 중요한 계수들이 무엇인지 직접적으로 볼수도 있다.

In [19]:

imp_coef = pd.concat([coef.sort_values().head(10),
                     coef.sort_values().tail(10)])

In [20]:

matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")

Out[20]:

<matplotlib.text.Text at 0x7f77a81803c8>

가장 중요한 양의 변수는 GrLivArea(평수? 인가? the above ground area by area square feet)이다. 완전히 이치에 맞는다. 그다음 몇몇 다른 위치변수와 질적 변수들이 양의 변수로 되었다. 몇몇 음의 변수들은 덜 말이 되지만, 더 자세히 봐 볼 가치가 있다. 아마 이 현상은 불균형인 범주형변수들로부터 온 것으로 보여진다.

또한 랜덤포레스트로 부터 얻은 변수중요도와는 다르게 이 중요도는 모델에서의 실제 계수임을 주목해라. 따라서 여러분은 왜 예측 가격이 이런지 설명할 수 있다. 여기서 유일한 문제는 우리가 타켓변수(가격)과 수치형 변수들에 로그변환을 적용했기 때문에 실제 영향력이 어느 정도인지 해석하는것은 다소 어렵다.

In [21]:

# 잔차 또한 봐보자
matplotlib.rcParams['figure.figsize'] = (6.0, 6.0)

preds = pd.DataFrame({"preds":model_lasso.predict(X_train), "true":y})
preds["residuals"] = preds["true"] - preds["preds"]
#preds.plot(x = "preds", y = "residuals",kind = "scatter")

잔차 그래프 또한 좋아보인다. 마지막으로 테스트셋으로 예측을 해보고 리더보드에 제출해보자.

Adding an xgboost model:

Let's add an xgboost model to our linear model to see if we can improve our score:

In [22]:

import xgboost as xgb

In [23]:

dtrain = xgb.DMatrix(X_train, label = y)
dtest = xgb.DMatrix(X_test)

params = {"max_depth":2, "eta":0.1}
model = xgb.cv(params, dtrain,  num_boost_round=500, early_stopping_rounds=100)

In [24]:

model.loc[30:,["test-rmse-mean", "train-rmse-mean"]].plot()

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f77a80ab780>

In [25]:

model_xgb = xgb.XGBRegressor(n_estimators=360, max_depth=2, learning_rate=0.1) #the params were tuned using xgb.cv
model_xgb.fit(X_train, y)

Out[25]:

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=2,
       min_child_weight=1, missing=None, n_estimators=360, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

In [26]:

xgb_preds = np.expm1(model_xgb.predict(X_test))
lasso_preds = np.expm1(model_lasso.predict(X_test))

In [27]:

predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
predictions.plot(x = "xgb", y = "lasso", kind = "scatter")

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f77a809ba20>

In [28]:

preds = 0.7*lasso_preds + 0.3*xgb_preds

In [29]:

solution = pd.DataFrame({"id":test.Id, "SalePrice":preds})
solution.to_csv("ridge_sol.csv", index = False)

Trying out keras?

In [30]:

from keras.layers import Dense
from keras.models import Sequential
from keras.regularizers import l1
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Using TensorFlow backend.

In [31]:

X_train = StandardScaler().fit_transform(X_train)

In [32]:

X_tr, X_val, y_tr, y_val = train_test_split(X_train, y, random_state = 3)

In [33]:

X_tr.shape

Out[33]:

(1095, 288)

In [34]:

X_tr

Out[34]:

array([[ 1.00573733,  0.68066137, -0.46001991, ..., -0.11785113,
         0.4676514 , -0.30599503],
       [-1.12520184,  0.60296111,  0.03113183, ..., -0.11785113,
         0.4676514 , -0.30599503],
       [-1.12520184, -0.02865265, -0.74027492, ..., -0.11785113,
         0.4676514 , -0.30599503],
       ..., 
       [ 0.16426234, -0.87075036, -0.81954431, ..., -0.11785113,
        -2.13834494, -0.30599503],
       [ 0.92361154, -0.30038284, -0.44275864, ..., -0.11785113,
         0.4676514 , -0.30599503],
       [ 0.83656519,  1.98505948,  0.46455838, ..., -0.11785113,
         0.4676514 , -0.30599503]])

In [35]:

model = Sequential()
#model.add(Dense(256, activation="relu", input_dim = X_train.shape[1]))
model.add(Dense(1, input_dim = X_train.shape[1], W_regularizer=l1(0.001)))

model.compile(loss = "mse", optimizer = "adam")

In [36]:

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
dense_1 (Dense)                  (None, 1)             289         dense_input_1[0][0]              
====================================================================================================
Total params: 289
____________________________________________________________________________________________________

In [37]:

hist = model.fit(X_tr.values, y_tr, validation_data = (X_val.values, y_val))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-7967781e5991> in <module>()
----> 1 hist = model.fit(X_tr.values, y_tr, validation_data = (X_val.values, y_val))

AttributeError: 'numpy.ndarray' object has no attribute 'values'

In [38]:

pd.Series(model.predict(X_val)[:,0]).hist()

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f77540a8c18>

In [39]:

저작자표시 비영리 변경금지