In this project, we are going to predict the price of a house using its 80 features. Basically we are solving the Kaggle Competition.
Follow the “House Prices Prediction: Advanced Regression Techniques End to End Project” step by step to get 3 Bonus.
1. Raw Dataset
2. Ready to use Clean Dataset for ML project
3. Full Project in Jupyter Notebook File
House Prices: Advanced Regression Techniques¶
Goal of the Project¶
Predict the price of a house by its features. If you are a buyer or seller of the house but you don’t know the exact price of the house, so supervised machine learning regression algorithms can help you to predict the price of the house just providing features of the target house.
Import essential libraries¶
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load Data Set¶
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print("Shape of train: ", train.shape)
print("Shape of test: ", test.shape)
train.head(10)
test.head(10)
## concat train and test
df = pd.concat((train, test))
temp_df = df
print("Shape of df: ", df.shape)
df.head(6)
df.tail(6)
Exploratory Data Analysis (EDA)¶
# To show the all columns
pd.set_option("display.max_columns", 2000)
pd.set_option("display.max_rows", 85)
df.head(6)
df.tail(6)
df.info()
df.describe()
df.select_dtypes(include=['int64', 'float64']).columns
df.select_dtypes(include=['object']).columns
# Set index as Id column
df = df.set_index("Id")
df.head(6)
# Show the null values using heatmap
plt.figure(figsize=(16,9))
sns.heatmap(df.isnull())
# Get the percentages of null value
null_percent = df.isnull().sum()/df.shape[0]*100
null_percent
col_for_drop = null_percent[null_percent > 20].keys() # if the null value % 20 or > 20 so need to drop it
# drop columns
df = df.drop(col_for_drop, "columns")
df.shape
# find the unique value count
for i in df.columns:
print(i + "\t" + str(len(df[i].unique())))
# find unique values of each column
for i in df.columns:
print("Unique value of:>>> {} ({})\n{}\n".format(i, len(df[i].unique()), df[i].unique()))
# Describe the target
train["SalePrice"].describe()
# Plot the distplot of target
plt.figure(figsize=(10,8))
bar = sns.distplot(train["SalePrice"])
bar.legend(["Skewness: {:.2f}".format(train['SalePrice'].skew())])
# correlation heatmap
plt.figure(figsize=(25,25))
ax = sns.heatmap(train.corr(), cmap = "coolwarm", annot=True, linewidth=2)
# to fix the bug "first and last row cut in half of heatmap plot"
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
# correlation heatmap of higly correlated features with SalePrice
hig_corr = train.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["SalePrice"]) >= 0.5]
hig_corr_features
plt.figure(figsize=(10,8))
ax = sns.heatmap(train[hig_corr_features].corr(), cmap = "coolwarm", annot=True, linewidth=3)
# to fix the bug "first and last row cut in half of heatmap plot"
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
# Plot regplot to get the nature of highly correlated data
plt.figure(figsize=(16,9))
for i in range(len(hig_corr_features)):
if i <= 9:
plt.subplot(3,4,i+1)
plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
sns.regplot(data=train, x = hig_corr_features[i], y = 'SalePrice')
Handling Missing Value¶
missing_col = df.columns[df.isnull().any()]
missing_col
Handling missing value of Bsmt feature¶
bsmt_col = ['BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1',
'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'TotalBsmtSF']
bsmt_feat = df[bsmt_col]
bsmt_feat
bsmt_feat.info()
bsmt_feat.isnull().sum()
bsmt_feat = bsmt_feat[bsmt_feat.isnull().any(axis=1)]
bsmt_feat
bsmt_feat_all_nan = bsmt_feat[(bsmt_feat.isnull() | bsmt_feat.isin([0])).all(1)]
bsmt_feat_all_nan
bsmt_feat_all_nan.shape
qual = list(df.loc[:, df.dtypes == 'object'].columns.values)
qual
# Fillinf the mising value in bsmt features
for i in bsmt_col:
if i in qual:
bsmt_feat_all_nan[i] = bsmt_feat_all_nan[i].replace(np.nan, 'NA') # replace the NAN value by 'NA'
else:
bsmt_feat_all_nan[i] = bsmt_feat_all_nan[i].replace(np.nan, 0) # replace the NAN value inplace of 0
bsmt_feat.update(bsmt_feat_all_nan) # update bsmt_feat df by bsmt_feat_all_nan
df.update(bsmt_feat_all_nan) # update df by bsmt_feat_all_nan
"""
>>> df = pd.DataFrame({'A': [1, 2, 3],
... 'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, 5, 6],
... 'C': [7, 8, 9]})
>>> df.update(new_df)
>>> df
A B
0 1 4
1 2 5
2 3 6
"""
bsmt_feat = bsmt_feat[bsmt_feat.isin([np.nan]).any(axis=1)]
bsmt_feat
bsmt_feat.shape
print(df['BsmtFinSF2'].max())
print(df['BsmtFinSF2'].min())
pd.cut(range(0,1526),5) # create a bucket
df_slice = df[(df['BsmtFinSF2'] >= 305) & (df['BsmtFinSF2'] <= 610)]
df_slice
bsmt_feat.at[333,'BsmtFinType2'] = df_slice['BsmtFinType2'].mode()[0] # replace NAN value of BsmtFinType2 by mode of buet ((305.0, 610.0)
bsmt_feat
bsmt_feat['BsmtExposure'] = bsmt_feat['BsmtExposure'].replace(np.nan, df[df['BsmtQual'] =='Gd']['BsmtExposure'].mode()[0])
bsmt_feat['BsmtCond'] = bsmt_feat['BsmtCond'].replace(np.nan, df['BsmtCond'].mode()[0])
bsmt_feat['BsmtQual'] = bsmt_feat['BsmtQual'].replace(np.nan, df['BsmtQual'].mode()[0])
df.update(bsmt_feat)
bsmt_feat.isnull().sum()
Handling missing value of Garage feature¶
df.columns[df.isnull().any()]
garage_col = ['GarageArea', 'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt',]
garage_feat = df[garage_col]
garage_feat = garage_feat[garage_feat.isnull().any(axis=1)]
garage_feat
garage_feat.shape
garage_feat_all_nan = garage_feat[(garage_feat.isnull() | garage_feat.isin([0])).all(1)]
garage_feat_all_nan.shape
for i in garage_feat:
if i in qual:
garage_feat_all_nan[i] = garage_feat_all_nan[i].replace(np.nan, 'NA')
else:
garage_feat_all_nan[i] = garage_feat_all_nan[i].replace(np.nan, 0)
garage_feat.update(garage_feat_all_nan)
df.update(garage_feat_all_nan)
garage_feat = garage_feat[garage_feat.isnull().any(axis=1)]
garage_feat
for i in garage_col:
garage_feat[i] = garage_feat[i].replace(np.nan, df[df['GarageType'] == 'Detchd'][i].mode()[0])
garage_feat.isnull().any()
df.update(garage_feat)
Handling missing value of remain feature¶
df.columns[df.isnull().any()]
df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])
df['Exterior1st'] = df['Exterior1st'].fillna(df['Exterior1st'].mode()[0])
df['Exterior2nd'] = df['Exterior2nd'].fillna(df['Exterior2nd'].mode()[0])
df['Functional'] = df['Functional'].fillna(df['Functional'].mode()[0])
df['KitchenQual'] = df['KitchenQual'].fillna(df['KitchenQual'].mode()[0])
df['MSZoning'] = df['MSZoning'].fillna(df['MSZoning'].mode()[0])
df['SaleType'] = df['SaleType'].fillna(df['SaleType'].mode()[0])
df['Utilities'] = df['Utilities'].fillna(df['Utilities'].mode()[0])
df['MasVnrType'] = df['MasVnrType'].fillna(df['MasVnrType'].mode()[0])
df.columns[df.isnull().any()]
df[df['MasVnrArea'].isnull() == True]['MasVnrType'].unique()
df.loc[(df['MasVnrType'] == 'None') & (df['MasVnrArea'].isnull() == True), 'MasVnrArea'] = 0
df.isnull().sum()/df.shape[0] * 100
Handling missing value of LotFrontage feature¶
lotconfig = ['Corner', 'Inside', 'CulDSac', 'FR2', 'FR3']
for i in lotconfig:
df['LotFrontage'] = pd.np.where((df['LotFrontage'].isnull() == True) & (df['LotConfig'] == i) , df[df['LotConfig'] == i] ['LotFrontage'].mean(), df['LotFrontage'])
df.isnull().sum()
Feature Transformation¶
df.columns
# converting columns in str which have categorical nature but in int64
feat_dtype_convert = ['MSSubClass', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
for i in feat_dtype_convert:
df[i] = df[i].astype(str)
df['MoSold'].unique() # MoSold = Month of sold
# conver in month abbrevation
import calendar
df['MoSold'] = df['MoSold'].apply(lambda x : calendar.month_abbr[x])
df['MoSold'].unique()
quan = list(df.loc[:, df.dtypes != 'object'].columns.values)
quan
len(quan)
obj_feat = list(df.loc[:, df.dtypes == 'object'].columns.values)
obj_feat
Conver categorical code into order¶
from pandas.api.types import CategoricalDtype
df['BsmtCond'] = df['BsmtCond'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['BsmtCond'].unique()
df['BsmtExposure'] = df['BsmtExposure'].astype(CategoricalDtype(categories=['NA', 'Mn', 'Av', 'Gd'], ordered = True)).cat.codes
df['BsmtExposure'].unique()
df['BsmtFinType1'] = df['BsmtFinType1'].astype(CategoricalDtype(categories=['NA', 'Unf', 'LwQ', 'Rec', 'BLQ','ALQ', 'GLQ'], ordered = True)).cat.codes
df['BsmtFinType2'] = df['BsmtFinType2'].astype(CategoricalDtype(categories=['NA', 'Unf', 'LwQ', 'Rec', 'BLQ','ALQ', 'GLQ'], ordered = True)).cat.codes
df['BsmtQual'] = df['BsmtQual'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['ExterQual'] = df['ExterQual'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['ExterCond'] = df['ExterCond'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['Functional'] = df['Functional'].astype(CategoricalDtype(categories=['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod','Min2','Min1', 'Typ'], ordered = True)).cat.codes
df['GarageCond'] = df['GarageCond'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['GarageQual'] = df['GarageQual'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['GarageFinish'] = df['GarageFinish'].astype(CategoricalDtype(categories=['NA', 'Unf', 'RFn', 'Fin'], ordered = True)).cat.codes
df['HeatingQC'] = df['HeatingQC'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['KitchenQual'] = df['KitchenQual'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['PavedDrive'] = df['PavedDrive'].astype(CategoricalDtype(categories=['N', 'P', 'Y'], ordered = True)).cat.codes
df['Utilities'] = df['Utilities'].astype(CategoricalDtype(categories=['ELO', 'NASeWa', 'NASeWr', 'AllPub'], ordered = True)).cat.codes
df['Utilities'].unique()
Show skewness of feature with distplot¶
skewed_features = ['1stFlrSF',
'2ndFlrSF',
'3SsnPorch',
'BedroomAbvGr',
'BsmtFinSF1',
'BsmtFinSF2',
'BsmtFullBath',
'BsmtHalfBath',
'BsmtUnfSF',
'EnclosedPorch',
'Fireplaces',
'FullBath',
'GarageArea',
'GarageCars',
'GrLivArea',
'HalfBath',
'KitchenAbvGr',
'LotArea',
'LotFrontage',
'LowQualFinSF',
'MasVnrArea',
'MiscVal',
'OpenPorchSF',
'PoolArea',
'ScreenPorch',
'TotRmsAbvGrd',
'TotalBsmtSF',
'WoodDeckSF']
quan == skewed_features
plt.figure(figsize=(25,20))
for i in range(len(skewed_features)):
if i <= 28:
plt.subplot(7,4,i+1)
plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
ax = sns.distplot(df[skewed_features[i]])
ax.legend(["Skewness: {:.2f}".format(df[skewed_features[i]].skew())], fontsize = 'xx-large')
df_back = df
# decrease the skewnwnes of the data
for i in skewed_features:
df[i] = np.log(df[i] + 1)
plt.figure(figsize=(25,20))
for i in range(len(skewed_features)):
if i <= 28:
plt.subplot(7,4,i+1)
plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
ax = sns.distplot(df[skewed_features[i]])
ax.legend(["Skewness: {:.2f}".format(df[skewed_features[i]].skew())], fontsize = 'xx-large')
SalePrice = np.log(train['SalePrice'] + 1)
# get object feature to conver in numeric using dummy variable
obj_feat = list(df.loc[:,df.dtypes == 'object'].columns.values)
len(obj_feat)
# dummy varaibale
dummy_drop = []
clean_df = df
for i in obj_feat:
dummy_drop += [i + '_' + str(df[i].unique()[-1])]
df = pd.get_dummies(df, columns = obj_feat)
df = df.drop(dummy_drop, axis = 1)
df.shape
#sns.pairplot(df)
# scaling dataset with robust scaler
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(df)
df = scaler.transform(df)
Machine Learning Model Building¶
train_len = len(train)
X_train = df[:train_len]
X_test = df[train_len:]
y_train = SalePrice
print(X_train.shape)
print(X_test.shape)
print(len(y_train))
Cross Validation¶
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import make_scorer, r2_score
def test_model(model, X_train=X_train, y_train=y_train):
cv = KFold(n_splits = 3, shuffle=True, random_state = 45)
r2 = make_scorer(r2_score)
r2_val_score = cross_val_score(model, X_train, y_train, cv=cv, scoring = r2)
score = [r2_val_score.mean()]
return score
Linear Regression¶
import sklearn.linear_model as linear_model
LR = linear_model.LinearRegression()
test_model(LR)
# Cross validation
cross_validation = cross_val_score(estimator = LR, X = X_train, y = y_train, cv = 10)
print("Cross validation accuracy of LR model = ", cross_validation)
print("\nCross validation mean accuracy of LR model = ", cross_validation.mean())
rdg = linear_model.Ridge()
test_model(rdg)
lasso = linear_model.Lasso(alpha=1e-4)
test_model(lasso)
Fitting Polynomial Regression to the dataset¶
from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree = 2) X_poly = poly_reg.fit_transform(X_train) poly_reg.fit(X_poly, y_train) lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y_train)¶
test_model(lin_reg_2,X_poly)¶
import sklearn.linear_model as linear_model lin_reg_2 = linear_model.LinearRegression()
lin_reg_2.fit(X_poly, y_train)¶
test_model(lin_reg_2,X_poly)
Support Vector Machine¶
from sklearn.svm import SVR
svr_reg = SVR(kernel='rbf')
test_model(svr_reg)
Decision Tree Regressor¶
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor(random_state=21)
test_model(dt_reg)
Random Forest Regressor¶
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators = 1000, random_state=51)
test_model(rf_reg)
Bagging & boosting¶
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor
br_reg = BaggingRegressor(n_estimators=1000, random_state=51)
gbr_reg = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.1, loss='ls', random_state=51)
test_model(br_reg)
test_model(gbr_reg)
XGBoost¶
import xgboost
#xgb_reg=xgboost.XGBRegressor()
xgb_reg = xgboost.XGBRegressor(bbooster='gbtree', random_state=51)
test_model(xgb_reg)
SVM Model Bulding¶
svr_reg.fit(X_train,y_train)
y_pred = np.exp(svr_reg.predict(X_test)).round(2)
y_pred
submit_test1 = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test1.columns=['Id', 'SalePrice']
submit_test1
submit_test1.to_csv('sample_submission.csv', index=False )
SVM Model Bulding Hyperparameter Tuning¶
Hyperparameter Tuning¶
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV params = {‘kernel’: [‘linear’, ‘rbf’, ‘sigmoid’], ‘gamma’: [1, 0.1, 0.01, 0.001, 0.0001], ‘C’: [0.1, 1, 10, 100, 1000], ‘epsilon’: [1, 0.2, 0.1, 0.01, 0.001, 0.0001]}
rand_search = RandomizedSearchCV(svr_reg, param_distributions=params, n_jobs=-1, cv=11) rand_search.fit(X_train, y_train) rand_search.bestparams
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
params = {'kernel': ['rbf'],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'C': [0.1, 1, 10, 100, 1000],
'epsilon': [1, 0.2, 0.1, 0.01, 0.001, 0.0001]}
rand_search = RandomizedSearchCV(svr_reg, param_distributions=params, n_jobs=-1, cv=11)
rand_search.fit(X_train, y_train)
rand_search.best_score_
svr_reg= SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.01, gamma=0.0001,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
test_model(svr_reg)
svr_reg.fit(X_train,y_train)
y_pred = np.exp(svr_reg.predict(X_test)).round(2)
y_pred
submit_test3 = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test3.columns=['Id', 'SalePrice']
submit_test3.to_csv('sample_submission.csv', index=False)
submit_test3
Name Submitted Wait time Execution time Score sample_submission.csv 3 days ago 0 seconds 0 seconds 0.12612
XGBoost parameter tuning¶
xgb2_reg = xgboost.XGBRegressor() params_xgb = { ‘max_depth’: range(2, 20, 2), ‘n_estimators’: range(99, 2001, 80), ‘learning_rate’: [0.2, 0.1, 0.01, 0.05], ‘booster’: [‘gbtree’], ‘mon_child_weight’: range(1, 8, 1) } rand_search_xgb = RandomizedSearchCV(estimator = xgb2_reg, param_distributions=params_xgb, n_iter=100, n_jobs=-1, cv=11, verbose=11, random_state=51, return_train_score =True, scoring=’neg_mean_absolute_error’) rand_search_xgb.fit(X_train,y_train)
rand_search_xgb.bestscore
rand_search_xgb.bestparams
xgb2_reg=xgboost.XGBRegressor(n_estimators= 899,
mon_child_weight= 2,
max_depth= 4,
learning_rate= 0.05,
booster= 'gbtree')
test_model(xgb2_reg)
xgb2_reg.fit(X_train,y_train)
y_pred_xgb_rs=xgb2_reg.predict(X_test)
np.exp(y_pred_xgb_rs).round(2)
y_pred_xgb_rs = np.exp(xgb2_reg.predict(X_test)).round(2)
xgb_rs_solution = pd.concat([test['Id'], pd.DataFrame(y_pred_xgb_rs)], axis=1)
xgb_rs_solution.columns=['Id', 'SalePrice']
xgb_rs_solution.to_csv('sample_submission.csv', index=False)
xgb_rs_solution
1603 0.12484 2 1d Your Best Entry Your submission scored 0.12484, which is an improvement of your previous score of 0.12612. Great job! Tweet this!
Feature Engineering / Selection to improve accuracy¶
# correlation Barplot
plt.figure(figsize=(9,16))
corr_feat_series = pd.Series.sort_values(train.corrwith(train.SalePrice))
sns.barplot(x=corr_feat_series, y=corr_feat_series.index, orient='h')
df_back1 = df_back
df_back1.to_csv('df_for_feature_engineering.csv', index=False)
list(corr_feat_series.index)
House Prices: Advanced Regression Techniques¶
Feature Selection / Engineering¶
Import Libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('df_for_feature_engineering.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
df
#df = df.set_index('Id')
Drop feature¶
df = df.drop(['YrSold',
'LowQualFinSF',
'MiscVal',
'BsmtHalfBath',
'BsmtFinSF2',
'3SsnPorch',
'MoSold'],axis=1)
quan = list(df.loc[:,df.dtypes != 'object'].columns.values)
quan
skewd_feat = ['1stFlrSF',
'2ndFlrSF',
'BedroomAbvGr',
'BsmtFinSF1',
'BsmtFullBath',
'BsmtUnfSF',
'EnclosedPorch',
'Fireplaces',
'FullBath',
'GarageArea',
'GarageCars',
'GrLivArea',
'HalfBath',
'KitchenAbvGr',
'LotArea',
'LotFrontage',
'MasVnrArea',
'OpenPorchSF',
'PoolArea',
'ScreenPorch',
'TotRmsAbvGrd',
'TotalBsmtSF',
'WoodDeckSF']
# '3SsnPorch', 'BsmtFinSF2', 'BsmtHalfBath', 'LowQualFinSF', 'MiscVal'
# Decrease the skewness of the data
for i in skewd_feat:
df[i] = np.log(df[i] + 1)
SalePrice = np.log(train['SalePrice'] + 1)
decrease the skewnwnes of the data¶
for i in skewed_features: df[i] = np.log(df[i] + 1)
df
obj_feat = list(df.loc[:, df.dtypes == 'object'].columns.values)
print(len(obj_feat))
obj_feat
# dummy varaibale
dummy_drop = []
for i in obj_feat:
dummy_drop += [i + '_' + str(df[i].unique()[-1])]
df = pd.get_dummies(df, columns = obj_feat)
df = df.drop(dummy_drop, axis = 1)
df.shape
# scaling dataset with robust scaler
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(df)
df = scaler.transform(df)
Model Bulding¶
train_len = len(train)
X_train = df[:train_len]
X_test = df[train_len:]
y_train = SalePrice
print("Shape of X_train: ", len(X_train))
print("Shape of X_test: ", len(X_test))
print("Shape of y_train: ", len(y_train))
Cross Validation¶
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import make_scorer, r2_score
def test_model(model, X_train=X_train, y_train=y_train):
cv = KFold(n_splits = 3, shuffle=True, random_state = 45)
r2 = make_scorer(r2_score)
r2_val_score = cross_val_score(model, X_train, y_train, cv=cv, scoring = r2)
score = [r2_val_score.mean()]
return score
# first cross validation with df with log second without log
Linear Model¶
import sklearn.linear_model as linear_model
LR = linear_model.LinearRegression()
test_model(LR)
rdg = linear_model.Ridge()
test_model(rdg)
lasso = linear_model.Lasso(alpha=1e-4)
test_model(lasso)
Support vector machine¶
from sklearn.svm import SVR
svr = SVR(kernel='rbf')
test_model(svr)
svm hyper parameter tuning¶
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
params = {'kernel': ['rbf'],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'C': [0.1, 1, 10, 100, 1000],
'epsilon': [1, 0.2, 0.1, 0.01, 0.001, 0.0001]}
rand_search = RandomizedSearchCV(svr_reg, param_distributions=params, n_jobs=-1, cv=11)
rand_search.fit(X_train, y_train)
rand_search.best_score_
rand_search.best_estimator_
svr_reg1=SVR(C=10, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.001,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
test_model(svr_reg1)
svr_reg= SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.01, gamma=0.0001,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
test_model(svr_reg)
XGBoost¶
import xgboost
#xgb_reg=xgboost.XGBRegressor()
xgb_reg = xgboost.XGBRegressor(bbooster='gbtree', random_state=51)
test_model(xgb_reg)
xgb2_reg=xgboost.XGBRegressor(n_estimators= 899,
mon_child_weight= 2,
max_depth= 4,
learning_rate= 0.05,
booster= 'gbtree')
test_model(xgb2_reg)
Solution¶
xgb2_reg.fit(X_train,y_train)
y_pred = np.exp(xgb2_reg.predict(X_test)).round(2)
submit_test = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test.columns=['Id', 'SalePrice']
submit_test.to_csv('sample_submission.csv', index=False)
submit_test
"""
Rank: 1444
Red AI Productionnovice
tier
0.12278
5
now
Your Best Entry
Your submission scored 0.13481, which is not an improvement of your best score. Keep trying!"""
svr_reg.fit(X_train,y_train)
y_pred = np.exp(svr_reg.predict(X_test)).round(2)
submit_test = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test.columns=['Id', 'SalePrice']
submit_test.to_csv('sample_submission.csv', index=False)
submit_test
"""
file: sample_submission-v1-fs
rank: 1444
Red AI Productionnovice tier
0.12278
4
3m
Your Best Entry
You advanced 140 places on the leaderboard!
Your submission scored 0.12278, which is an improvement of your previous score of 0.12484. Great job!"""
Model Save¶
import pickle
pickle.dump(svr_reg, open('model_house_price_prediction.csv', 'wb'))
model_house_price_prediction = pickle.load(open('model_house_price_prediction.csv', 'rb'))
model_house_price_prediction.predict(X_test)
test_model(model_house_price_prediction)
SVM Accuracy = 90%¶
Machine Learning Model Building Never End Until And Unless App Not Stop¶
Congratulation!!!!!!!
We have completed the Machine learning Project successfully with 90% accuracy which is great for the ‘House Price Prediction: Advance Regression Technique’ project. Now, we are ready to deploy our ML model in the real estate domain.
Click on the below button to download the ‘House Price Prediction: Advance Regression Technique’ Machine Learning end to end project in the Jupyter Notebook file.
Download Project
Conclusion
To get more accuracy, we trained all top supervised regression algorithms but you can try out a few of them which are always popular. After training all algorithms, we found that SVR and XGBoost regressor have given high accuracy than remain but we have chosen SVR.
As ML Engineer, we always retrain the deployed model after some period of time to sustain the accuracy of the model. We hope our efforts will help to predict the price of a house for the buyer and seller.
Please share your feedback and doubt regarding this ML project, so we can update it.
I hope you enjoy the Machine Learning End to End project. Thank you….. -:)
Click here to learn more Machine learning end to end projects.
Sir plz make video Linear Regression And Random Forest
Thanks,
INIDIAN AI PRODUCTION