House Price Prediction - Kaggle Competition - ML Project

ML Project: House Prices Prediction Advanced Regression Techniques | Kaggle Competition

In this project, we are going to predict the price of a house using its 80 features. Basically we are solving the Kaggle Competition.

Follow the “House Prices Prediction: Advanced Regression Techniques End to End Project” step by step to get 3 Bonus.
1. Raw Dataset
2. Ready to use Clean Dataset for ML project
3. Full Project in Jupyter Notebook File

 

 

House Prices: Advanced Regression Techniques

 

Goal of the Project

 

Predict the price of a house by its features. If you are a buyer or seller of the house but you don’t know the exact price of the house, so supervised machine learning regression algorithms can help you to predict the price of the house just providing features of the target house.

 

Import essential libraries

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 

Load Data Set

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print("Shape of train: ", train.shape)
print("Shape of test: ", test.shape)
 
Shape of train:  (1460, 81)
Shape of test:  (1459, 80)
In [3]:
train.head(10)
Out[3]:
  Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub 0 NaN NaN NaN 0 12 2008 WD Normal 250000
5 6 50 RL 85.0 14115 Pave NaN IR1 Lvl AllPub 0 NaN MnPrv Shed 700 10 2009 WD Normal 143000
6 7 20 RL 75.0 10084 Pave NaN Reg Lvl AllPub 0 NaN NaN NaN 0 8 2007 WD Normal 307000
7 8 60 RL NaN 10382 Pave NaN IR1 Lvl AllPub 0 NaN NaN Shed 350 11 2009 WD Normal 200000
8 9 50 RM 51.0 6120 Pave NaN Reg Lvl AllPub 0 NaN NaN NaN 0 4 2008 WD Abnorml 129900
9 10 190 RL 50.0 7420 Pave NaN Reg Lvl AllPub 0 NaN NaN NaN 0 1 2008 WD Normal 118000

10 rows × 81 columns

In [4]:
test.head(10)
Out[4]:
  Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub 144 0 NaN NaN NaN 0 1 2010 WD Normal
5 1466 60 RL 75.0 10000 Pave NaN IR1 Lvl AllPub 0 0 NaN NaN NaN 0 4 2010 WD Normal
6 1467 20 RL NaN 7980 Pave NaN IR1 Lvl AllPub 0 0 NaN GdPrv Shed 500 3 2010 WD Normal
7 1468 60 RL 63.0 8402 Pave NaN IR1 Lvl AllPub 0 0 NaN NaN NaN 0 5 2010 WD Normal
8 1469 20 RL 85.0 10176 Pave NaN Reg Lvl AllPub 0 0 NaN NaN NaN 0 2 2010 WD Normal
9 1470 20 RL 70.0 8400 Pave NaN Reg Lvl AllPub 0 0 NaN MnPrv NaN 0 4 2010 WD Normal

10 rows × 80 columns

In [5]:
## concat train and test
df = pd.concat((train, test))
temp_df = df
print("Shape of df: ", df.shape)
 
Shape of df:  (2919, 81)
 
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  
In [6]:
df.head(6)
Out[6]:
  1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 856 854 0 NaN 3 1Fam TA No 706.0 0.0 WD 0 Pave 8 856.0 AllPub 0 2003 2003 2008
1 1262 0 0 NaN 3 1Fam TA Gd 978.0 0.0 WD 0 Pave 6 1262.0 AllPub 298 1976 1976 2007
2 920 866 0 NaN 3 1Fam TA Mn 486.0 0.0 WD 0 Pave 6 920.0 AllPub 0 2001 2002 2008
3 961 756 0 NaN 3 1Fam Gd No 216.0 0.0 WD 0 Pave 7 756.0 AllPub 0 1915 1970 2006
4 1145 1053 0 NaN 4 1Fam TA Av 655.0 0.0 WD 0 Pave 9 1145.0 AllPub 192 2000 2000 2008
5 796 566 320 NaN 1 1Fam TA No 732.0 0.0 WD 0 Pave 5 796.0 AllPub 40 1993 1995 2009

6 rows × 81 columns

In [7]:
df.tail(6)
Out[7]:
  1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
1453 546 546 0 NaN 3 Twnhs TA No 0.0 0.0 WD 0 Pave 5 546.0 AllPub 0 1970 1970 2006
1454 546 546 0 NaN 3 Twnhs TA No 0.0 0.0 WD 0 Pave 5 546.0 AllPub 0 1970 1970 2006
1455 546 546 0 NaN 3 TwnhsE TA No 252.0 0.0 WD 0 Pave 6 546.0 AllPub 0 1970 1970 2006
1456 1224 0 0 NaN 4 1Fam TA No 1224.0 0.0 WD 0 Pave 7 1224.0 AllPub 474 1960 1996 2006
1457 970 0 0 NaN 3 1Fam TA Av 337.0 0.0 WD 0 Pave 6 912.0 AllPub 80 1992 1992 2006
1458 996 1004 0 NaN 3 1Fam TA Av 758.0 0.0 WD 0 Pave 9 996.0 AllPub 190 1993 1994 2006

6 rows × 81 columns

 

Exploratory Data Analysis (EDA)

In [8]:
# To show the all columns
pd.set_option("display.max_columns", 2000)
pd.set_option("display.max_rows", 85)
In [9]:
df.head(6)
Out[9]:
  1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF CentralAir Condition1 Condition2 Electrical EnclosedPorch ExterCond ExterQual Exterior1st Exterior2nd Fence FireplaceQu Fireplaces Foundation FullBath Functional GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType GarageYrBlt GrLivArea HalfBath Heating HeatingQC HouseStyle Id KitchenAbvGr KitchenQual LandContour LandSlope LotArea LotConfig LotFrontage LotShape LowQualFinSF MSSubClass MSZoning MasVnrArea MasVnrType MiscFeature MiscVal MoSold Neighborhood OpenPorchSF OverallCond OverallQual PavedDrive PoolArea PoolQC RoofMatl RoofStyle SaleCondition SalePrice SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 856 854 0 NaN 3 1Fam TA No 706.0 0.0 GLQ Unf 1.0 0.0 Gd 150.0 Y Norm Norm SBrkr 0 TA Gd VinylSd VinylSd NaN NaN 0 PConc 2 Typ 548.0 2.0 TA RFn TA Attchd 2003.0 1710 1 GasA Ex 2Story 1 1 Gd Lvl Gtl 8450 Inside 65.0 Reg 0 60 RL 196.0 BrkFace NaN 0 2 CollgCr 61 5 7 Y 0 NaN CompShg Gable Normal 208500.0 WD 0 Pave 8 856.0 AllPub 0 2003 2003 2008
1 1262 0 0 NaN 3 1Fam TA Gd 978.0 0.0 ALQ Unf 0.0 1.0 Gd 284.0 Y Feedr Norm SBrkr 0 TA TA MetalSd MetalSd NaN TA 1 CBlock 2 Typ 460.0 2.0 TA RFn TA Attchd 1976.0 1262 0 GasA Ex 1Story 2 1 TA Lvl Gtl 9600 FR2 80.0 Reg 0 20 RL 0.0 None NaN 0 5 Veenker 0 8 6 Y 0 NaN CompShg Gable Normal 181500.0 WD 0 Pave 6 1262.0 AllPub 298 1976 1976 2007
2 920 866 0 NaN 3 1Fam TA Mn 486.0 0.0 GLQ Unf 1.0 0.0 Gd 434.0 Y Norm Norm SBrkr 0 TA Gd VinylSd VinylSd NaN TA 1 PConc 2 Typ 608.0 2.0 TA RFn TA Attchd 2001.0 1786 1 GasA Ex 2Story 3 1 Gd Lvl Gtl 11250 Inside 68.0 IR1 0 60 RL 162.0 BrkFace NaN 0 9 CollgCr 42 5 7 Y 0 NaN CompShg Gable Normal 223500.0 WD 0 Pave 6 920.0 AllPub 0 2001 2002 2008
3 961 756 0 NaN 3 1Fam Gd No 216.0 0.0 ALQ Unf 1.0 0.0 TA 540.0 Y Norm Norm SBrkr 272 TA TA Wd Sdng Wd Shng NaN Gd 1 BrkTil 1 Typ 642.0 3.0 TA Unf TA Detchd 1998.0 1717 0 GasA Gd 2Story 4 1 Gd Lvl Gtl 9550 Corner 60.0 IR1 0 70 RL 0.0 None NaN 0 2 Crawfor 35 5 7 Y 0 NaN CompShg Gable Abnorml 140000.0 WD 0 Pave 7 756.0 AllPub 0 1915 1970 2006
4 1145 1053 0 NaN 4 1Fam TA Av 655.0 0.0 GLQ Unf 1.0 0.0 Gd 490.0 Y Norm Norm SBrkr 0 TA Gd VinylSd VinylSd NaN TA 1 PConc 2 Typ 836.0 3.0 TA RFn TA Attchd 2000.0 2198 1 GasA Ex 2Story 5 1 Gd Lvl Gtl 14260 FR2 84.0 IR1 0 60 RL 350.0 BrkFace NaN 0 12 NoRidge 84 5 8 Y 0 NaN CompShg Gable Normal 250000.0 WD 0 Pave 9 1145.0 AllPub 192 2000 2000 2008
5 796 566 320 NaN 1 1Fam TA No 732.0 0.0 GLQ Unf 1.0 0.0 Gd 64.0 Y Norm Norm SBrkr 0 TA TA VinylSd VinylSd MnPrv NaN 0 Wood 1 Typ 480.0 2.0 TA Unf TA Attchd 1993.0 1362 1 GasA Ex 1.5Fin 6 1 TA Lvl Gtl 14115 Inside 85.0 IR1 0 50 RL 0.0 None Shed 700 10 Mitchel 30 5 5 Y 0 NaN CompShg Gable Normal 143000.0 WD 0 Pave 5 796.0 AllPub 40 1993 1995 2009
In [10]:
df.tail(6)
Out[10]:
  1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF CentralAir Condition1 Condition2 Electrical EnclosedPorch ExterCond ExterQual Exterior1st Exterior2nd Fence FireplaceQu Fireplaces Foundation FullBath Functional GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType GarageYrBlt GrLivArea HalfBath Heating HeatingQC HouseStyle Id KitchenAbvGr KitchenQual LandContour LandSlope LotArea LotConfig LotFrontage LotShape LowQualFinSF MSSubClass MSZoning MasVnrArea MasVnrType MiscFeature MiscVal MoSold Neighborhood OpenPorchSF OverallCond OverallQual PavedDrive PoolArea PoolQC RoofMatl RoofStyle SaleCondition SalePrice SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
1453 546 546 0 NaN 3 Twnhs TA No 0.0 0.0 Unf Unf 0.0 0.0 TA 546.0 Y Norm Norm SBrkr 0 TA TA CemntBd CmentBd GdPrv NaN 0 CBlock 1 Typ 0.0 0.0 NaN NaN NaN NaN NaN 1092 1 GasA TA 2Story 2914 1 TA Lvl Gtl 1526 Inside 21.0 Reg 0 160 RM 0.0 None NaN 0 6 MeadowV 34 5 4 Y 0 NaN CompShg Gable Normal NaN WD 0 Pave 5 546.0 AllPub 0 1970 1970 2006
1454 546 546 0 NaN 3 Twnhs TA No 0.0 0.0 Unf Unf 0.0 0.0 TA 546.0 Y Norm Norm SBrkr 0 TA TA CemntBd CmentBd NaN NaN 0 CBlock 1 Typ 0.0 0.0 NaN NaN NaN NaN NaN 1092 1 GasA Gd 2Story 2915 1 TA Lvl Gtl 1936 Inside 21.0 Reg 0 160 RM 0.0 None NaN 0 6 MeadowV 0 7 4 Y 0 NaN CompShg Gable Normal NaN WD 0 Pave 5 546.0 AllPub 0 1970 1970 2006
1455 546 546 0 NaN 3 TwnhsE TA No 252.0 0.0 Rec Unf 0.0 0.0 TA 294.0 Y Norm Norm SBrkr 0 TA TA CemntBd CmentBd NaN NaN 0 CBlock 1 Typ 286.0 1.0 TA Unf TA CarPort 1970.0 1092 1 GasA TA 2Story 2916 1 TA Lvl Gtl 1894 Inside 21.0 Reg 0 160 RM 0.0 None NaN 0 4 MeadowV 24 5 4 Y 0 NaN CompShg Gable Abnorml NaN WD 0 Pave 6 546.0 AllPub 0 1970 1970 2006
1456 1224 0 0 NaN 4 1Fam TA No 1224.0 0.0 ALQ Unf 1.0 0.0 TA 0.0 Y Norm Norm SBrkr 0 TA TA VinylSd VinylSd NaN TA 1 CBlock 1 Typ 576.0 2.0 TA Unf TA Detchd 1960.0 1224 0 GasA Ex 1Story 2917 1 TA Lvl Gtl 20000 Inside 160.0 Reg 0 20 RL 0.0 None NaN 0 9 Mitchel 0 7 5 Y 0 NaN CompShg Gable Abnorml NaN WD 0 Pave 7 1224.0 AllPub 474 1960 1996 2006
1457 970 0 0 NaN 3 1Fam TA Av 337.0 0.0 GLQ Unf 0.0 1.0 Gd 575.0 Y Norm Norm SBrkr 0 TA TA HdBoard Wd Shng MnPrv NaN 0 PConc 1 Typ 0.0 0.0 NaN NaN NaN NaN NaN 970 0 GasA TA SFoyer 2918 1 TA Lvl Gtl 10441 Inside 62.0 Reg 0 85 RL 0.0 None Shed 700 7 Mitchel 32 5 5 Y 0 NaN CompShg Gable Normal NaN WD 0 Pave 6 912.0 AllPub 80 1992 1992 2006
1458 996 1004 0 NaN 3 1Fam TA Av 758.0 0.0 LwQ Unf 0.0 0.0 Gd 238.0 Y Norm Norm SBrkr 0 TA TA HdBoard HdBoard NaN TA 1 PConc 2 Typ 650.0 3.0 TA Fin TA Attchd 1993.0 2000 1 GasA Ex 2Story 2919 1 TA Lvl Mod 9627 Inside 74.0 Reg 0 60 RL 94.0 BrkFace NaN 0 11 Mitchel 48 5 7 Y 0 NaN CompShg Gable Normal NaN WD 0 Pave 9 996.0 AllPub 190 1993 1994 2006
In [11]:
df.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 81 columns):
1stFlrSF         2919 non-null int64
2ndFlrSF         2919 non-null int64
3SsnPorch        2919 non-null int64
Alley            198 non-null object
BedroomAbvGr     2919 non-null int64
BldgType         2919 non-null object
BsmtCond         2837 non-null object
BsmtExposure     2837 non-null object
BsmtFinSF1       2918 non-null float64
BsmtFinSF2       2918 non-null float64
BsmtFinType1     2840 non-null object
BsmtFinType2     2839 non-null object
BsmtFullBath     2917 non-null float64
BsmtHalfBath     2917 non-null float64
BsmtQual         2838 non-null object
BsmtUnfSF        2918 non-null float64
CentralAir       2919 non-null object
Condition1       2919 non-null object
Condition2       2919 non-null object
Electrical       2918 non-null object
EnclosedPorch    2919 non-null int64
ExterCond        2919 non-null object
ExterQual        2919 non-null object
Exterior1st      2918 non-null object
Exterior2nd      2918 non-null object
Fence            571 non-null object
FireplaceQu      1499 non-null object
Fireplaces       2919 non-null int64
Foundation       2919 non-null object
FullBath         2919 non-null int64
Functional       2917 non-null object
GarageArea       2918 non-null float64
GarageCars       2918 non-null float64
GarageCond       2760 non-null object
GarageFinish     2760 non-null object
GarageQual       2760 non-null object
GarageType       2762 non-null object
GarageYrBlt      2760 non-null float64
GrLivArea        2919 non-null int64
HalfBath         2919 non-null int64
Heating          2919 non-null object
HeatingQC        2919 non-null object
HouseStyle       2919 non-null object
Id               2919 non-null int64
KitchenAbvGr     2919 non-null int64
KitchenQual      2918 non-null object
LandContour      2919 non-null object
LandSlope        2919 non-null object
LotArea          2919 non-null int64
LotConfig        2919 non-null object
LotFrontage      2433 non-null float64
LotShape         2919 non-null object
LowQualFinSF     2919 non-null int64
MSSubClass       2919 non-null int64
MSZoning         2915 non-null object
MasVnrArea       2896 non-null float64
MasVnrType       2895 non-null object
MiscFeature      105 non-null object
MiscVal          2919 non-null int64
MoSold           2919 non-null int64
Neighborhood     2919 non-null object
OpenPorchSF      2919 non-null int64
OverallCond      2919 non-null int64
OverallQual      2919 non-null int64
PavedDrive       2919 non-null object
PoolArea         2919 non-null int64
PoolQC           10 non-null object
RoofMatl         2919 non-null object
RoofStyle        2919 non-null object
SaleCondition    2919 non-null object
SalePrice        1460 non-null float64
SaleType         2918 non-null object
ScreenPorch      2919 non-null int64
Street           2919 non-null object
TotRmsAbvGrd     2919 non-null int64
TotalBsmtSF      2918 non-null float64
Utilities        2917 non-null object
WoodDeckSF       2919 non-null int64
YearBuilt        2919 non-null int64
YearRemodAdd     2919 non-null int64
YrSold           2919 non-null int64
dtypes: float64(12), int64(26), object(43)
memory usage: 1.8+ MB
In [12]:
df.describe()
Out[12]:
  1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtFinSF1 BsmtFinSF2 BsmtFullBath BsmtHalfBath BsmtUnfSF EnclosedPorch Fireplaces FullBath GarageArea GarageCars GarageYrBlt GrLivArea HalfBath Id KitchenAbvGr LotArea LotFrontage LowQualFinSF MSSubClass MasVnrArea MiscVal MoSold OpenPorchSF OverallCond OverallQual PoolArea SalePrice ScreenPorch TotRmsAbvGrd TotalBsmtSF WoodDeckSF YearBuilt YearRemodAdd YrSold
count 2919.000000 2919.000000 2919.000000 2919.000000 2918.000000 2918.000000 2917.000000 2917.000000 2918.000000 2919.000000 2919.000000 2919.000000 2918.000000 2918.000000 2760.000000 2919.000000 2919.000000 2919.000000 2919.000000 2919.000000 2433.000000 2919.000000 2919.000000 2896.000000 2919.000000 2919.000000 2919.000000 2919.000000 2919.000000 2919.000000 1460.000000 2919.000000 2919.000000 2918.000000 2919.000000 2919.000000 2919.000000 2919.000000
mean 1159.581706 336.483727 2.602261 2.860226 441.423235 49.582248 0.429894 0.061364 560.772104 23.098321 0.597122 1.568003 472.874572 1.766621 1978.113406 1500.759849 0.380267 1460.000000 1.044536 10168.114080 69.305795 4.694416 57.137718 102.201312 50.825968 6.213087 47.486811 5.564577 6.089072 2.251799 180921.195890 16.062350 6.451524 1051.777587 93.709832 1971.312778 1984.264474 2007.792737
std 392.362079 428.701456 25.188169 0.822693 455.610826 169.205611 0.524736 0.245687 439.543659 64.244246 0.646129 0.552969 215.394815 0.761624 25.574285 506.051045 0.502872 842.787043 0.214462 7886.996359 23.344905 46.396825 42.517628 179.334253 567.402211 2.714762 67.575493 1.113131 1.409947 35.663946 79442.502883 56.184365 1.569379 440.766258 126.526589 30.291442 20.894344 1.314964
min 334.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1895.000000 334.000000 0.000000 1.000000 0.000000 1300.000000 21.000000 0.000000 20.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 0.000000 34900.000000 0.000000 2.000000 0.000000 0.000000 1872.000000 1950.000000 2006.000000
25% 876.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 220.000000 0.000000 0.000000 1.000000 320.000000 1.000000 1960.000000 1126.000000 0.000000 730.500000 1.000000 7478.000000 59.000000 0.000000 20.000000 0.000000 0.000000 4.000000 0.000000 5.000000 5.000000 0.000000 129975.000000 0.000000 5.000000 793.000000 0.000000 1953.500000 1965.000000 2007.000000
50% 1082.000000 0.000000 0.000000 3.000000 368.500000 0.000000 0.000000 0.000000 467.000000 0.000000 1.000000 2.000000 480.000000 2.000000 1979.000000 1444.000000 0.000000 1460.000000 1.000000 9453.000000 68.000000 0.000000 50.000000 0.000000 0.000000 6.000000 26.000000 5.000000 6.000000 0.000000 163000.000000 0.000000 6.000000 989.500000 0.000000 1973.000000 1993.000000 2008.000000
75% 1387.500000 704.000000 0.000000 3.000000 733.000000 0.000000 1.000000 0.000000 805.500000 0.000000 1.000000 2.000000 576.000000 2.000000 2002.000000 1743.500000 1.000000 2189.500000 1.000000 11570.000000 80.000000 0.000000 70.000000 164.000000 0.000000 8.000000 70.000000 6.000000 7.000000 0.000000 214000.000000 0.000000 7.000000 1302.000000 168.000000 2001.000000 2004.000000 2009.000000
max 5095.000000 2065.000000 508.000000 8.000000 5644.000000 1526.000000 3.000000 2.000000 2336.000000 1012.000000 4.000000 4.000000 1488.000000 5.000000 2207.000000 5642.000000 2.000000 2919.000000 3.000000 215245.000000 313.000000 1064.000000 190.000000 1600.000000 17000.000000 12.000000 742.000000 9.000000 10.000000 800.000000 755000.000000 576.000000 15.000000 6110.000000 1424.000000 2010.000000 2010.000000 2010.000000
In [13]:
df.select_dtypes(include=['int64', 'float64']).columns
Out[13]:
Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BedroomAbvGr', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtUnfSF',
       'EnclosedPorch', 'Fireplaces', 'FullBath', 'GarageArea', 'GarageCars',
       'GarageYrBlt', 'GrLivArea', 'HalfBath', 'Id', 'KitchenAbvGr', 'LotArea',
       'LotFrontage', 'LowQualFinSF', 'MSSubClass', 'MasVnrArea', 'MiscVal',
       'MoSold', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PoolArea',
       'SalePrice', 'ScreenPorch', 'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF',
       'YearBuilt', 'YearRemodAdd', 'YrSold'],
      dtype='object')
In [14]:
df.select_dtypes(include=['object']).columns
Out[14]:
Index(['Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd',
       'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond',
       'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC',
       'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig',
       'LotShape', 'MSZoning', 'MasVnrType', 'MiscFeature', 'Neighborhood',
       'PavedDrive', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition',
       'SaleType', 'Street', 'Utilities'],
      dtype='object')
In [15]:
# Set index as Id column
df = df.set_index("Id")
In [16]:
df.head(6)
Out[16]:
  1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF CentralAir Condition1 Condition2 Electrical EnclosedPorch ExterCond ExterQual Exterior1st Exterior2nd Fence FireplaceQu Fireplaces Foundation FullBath Functional GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType GarageYrBlt GrLivArea HalfBath Heating HeatingQC HouseStyle KitchenAbvGr KitchenQual LandContour LandSlope LotArea LotConfig LotFrontage LotShape LowQualFinSF MSSubClass MSZoning MasVnrArea MasVnrType MiscFeature MiscVal MoSold Neighborhood OpenPorchSF OverallCond OverallQual PavedDrive PoolArea PoolQC RoofMatl RoofStyle SaleCondition SalePrice SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
Id                                                                                                                                                                
1 856 854 0 NaN 3 1Fam TA No 706.0 0.0 GLQ Unf 1.0 0.0 Gd 150.0 Y Norm Norm SBrkr 0 TA Gd VinylSd VinylSd NaN NaN 0 PConc 2 Typ 548.0 2.0 TA RFn TA Attchd 2003.0 1710 1 GasA Ex 2Story 1 Gd Lvl Gtl 8450 Inside 65.0 Reg 0 60 RL 196.0 BrkFace NaN 0 2 CollgCr 61 5 7 Y 0 NaN CompShg Gable Normal 208500.0 WD 0 Pave 8 856.0 AllPub 0 2003 2003 2008
2 1262 0 0 NaN 3 1Fam TA Gd 978.0 0.0 ALQ Unf 0.0 1.0 Gd 284.0 Y Feedr Norm SBrkr 0 TA TA MetalSd MetalSd NaN TA 1 CBlock 2 Typ 460.0 2.0 TA RFn TA Attchd 1976.0 1262 0 GasA Ex 1Story 1 TA Lvl Gtl 9600 FR2 80.0 Reg 0 20 RL 0.0 None NaN 0 5 Veenker 0 8 6 Y 0 NaN CompShg Gable Normal 181500.0 WD 0 Pave 6 1262.0 AllPub 298 1976 1976 2007
3 920 866 0 NaN 3 1Fam TA Mn 486.0 0.0 GLQ Unf 1.0 0.0 Gd 434.0 Y Norm Norm SBrkr 0 TA Gd VinylSd VinylSd NaN TA 1 PConc 2 Typ 608.0 2.0 TA RFn TA Attchd 2001.0 1786 1 GasA Ex 2Story 1 Gd Lvl Gtl 11250 Inside 68.0 IR1 0 60 RL 162.0 BrkFace NaN 0 9 CollgCr 42 5 7 Y 0 NaN CompShg Gable Normal 223500.0 WD 0 Pave 6 920.0 AllPub 0 2001 2002 2008
4 961 756 0 NaN 3 1Fam Gd No 216.0 0.0 ALQ Unf 1.0 0.0 TA 540.0 Y Norm Norm SBrkr 272 TA TA Wd Sdng Wd Shng NaN Gd 1 BrkTil 1 Typ 642.0 3.0 TA Unf TA Detchd 1998.0 1717 0 GasA Gd 2Story 1 Gd Lvl Gtl 9550 Corner 60.0 IR1 0 70 RL 0.0 None NaN 0 2 Crawfor 35 5 7 Y 0 NaN CompShg Gable Abnorml 140000.0 WD 0 Pave 7 756.0 AllPub 0 1915 1970 2006
5 1145 1053 0 NaN 4 1Fam TA Av 655.0 0.0 GLQ Unf 1.0 0.0 Gd 490.0 Y Norm Norm SBrkr 0 TA Gd VinylSd VinylSd NaN TA 1 PConc 2 Typ 836.0 3.0 TA RFn TA Attchd 2000.0 2198 1 GasA Ex 2Story 1 Gd Lvl Gtl 14260 FR2 84.0 IR1 0 60 RL 350.0 BrkFace NaN 0 12 NoRidge 84 5 8 Y 0 NaN CompShg Gable Normal 250000.0 WD 0 Pave 9 1145.0 AllPub 192 2000 2000 2008
6 796 566 320 NaN 1 1Fam TA No 732.0 0.0 GLQ Unf 1.0 0.0 Gd 64.0 Y Norm Norm SBrkr 0 TA TA VinylSd VinylSd MnPrv NaN 0 Wood 1 Typ 480.0 2.0 TA Unf TA Attchd 1993.0 1362 1 GasA Ex 1.5Fin 1 TA Lvl Gtl 14115 Inside 85.0 IR1 0 50 RL 0.0 None Shed 700 10 Mitchel 30 5 5 Y 0 NaN CompShg Gable Normal 143000.0 WD 0 Pave 5 796.0 AllPub 40 1993 1995 2009
In [17]:
# Show the null values using heatmap
plt.figure(figsize=(16,9))
sns.heatmap(df.isnull())
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bb7ac45908>
 
In [18]:
# Get the percentages of null value
null_percent = df.isnull().sum()/df.shape[0]*100
null_percent
Out[18]:
1stFlrSF          0.000000
2ndFlrSF          0.000000
3SsnPorch         0.000000
Alley            93.216855
BedroomAbvGr      0.000000
BldgType          0.000000
BsmtCond          2.809181
BsmtExposure      2.809181
BsmtFinSF1        0.034258
BsmtFinSF2        0.034258
BsmtFinType1      2.706406
BsmtFinType2      2.740665
BsmtFullBath      0.068517
BsmtHalfBath      0.068517
BsmtQual          2.774923
BsmtUnfSF         0.034258
CentralAir        0.000000
Condition1        0.000000
Condition2        0.000000
Electrical        0.034258
EnclosedPorch     0.000000
ExterCond         0.000000
ExterQual         0.000000
Exterior1st       0.034258
Exterior2nd       0.034258
Fence            80.438506
FireplaceQu      48.646797
Fireplaces        0.000000
Foundation        0.000000
FullBath          0.000000
Functional        0.068517
GarageArea        0.034258
GarageCars        0.034258
GarageCond        5.447071
GarageFinish      5.447071
GarageQual        5.447071
GarageType        5.378554
GarageYrBlt       5.447071
GrLivArea         0.000000
HalfBath          0.000000
Heating           0.000000
HeatingQC         0.000000
HouseStyle        0.000000
KitchenAbvGr      0.000000
KitchenQual       0.034258
LandContour       0.000000
LandSlope         0.000000
LotArea           0.000000
LotConfig         0.000000
LotFrontage      16.649538
LotShape          0.000000
LowQualFinSF      0.000000
MSSubClass        0.000000
MSZoning          0.137033
MasVnrArea        0.787941
MasVnrType        0.822199
MiscFeature      96.402878
MiscVal           0.000000
MoSold            0.000000
Neighborhood      0.000000
OpenPorchSF       0.000000
OverallCond       0.000000
OverallQual       0.000000
PavedDrive        0.000000
PoolArea          0.000000
PoolQC           99.657417
RoofMatl          0.000000
RoofStyle         0.000000
SaleCondition     0.000000
SalePrice        49.982871
SaleType          0.034258
ScreenPorch       0.000000
Street            0.000000
TotRmsAbvGrd      0.000000
TotalBsmtSF       0.034258
Utilities         0.068517
WoodDeckSF        0.000000
YearBuilt         0.000000
YearRemodAdd      0.000000
YrSold            0.000000
dtype: float64
In [19]:
col_for_drop = null_percent[null_percent > 20].keys() # if the null value % 20 or > 20 so need to drop it
In [20]:
# drop columns
df = df.drop(col_for_drop, "columns")
df.shape
Out[20]:
(2919, 74)
In [21]:
# find the unique value count
for i in df.columns:
    print(i + "\t" + str(len(df[i].unique())))
 
1stFlrSF	1083
2ndFlrSF	635
3SsnPorch	31
BedroomAbvGr	8
BldgType	5
BsmtCond	5
BsmtExposure	5
BsmtFinSF1	992
BsmtFinSF2	273
BsmtFinType1	7
BsmtFinType2	7
BsmtFullBath	5
BsmtHalfBath	4
BsmtQual	5
BsmtUnfSF	1136
CentralAir	2
Condition1	9
Condition2	8
Electrical	6
EnclosedPorch	183
ExterCond	5
ExterQual	4
Exterior1st	16
Exterior2nd	17
Fireplaces	5
Foundation	6
FullBath	5
Functional	8
GarageArea	604
GarageCars	7
GarageCond	6
GarageFinish	4
GarageQual	6
GarageType	7
GarageYrBlt	104
GrLivArea	1292
HalfBath	3
Heating	6
HeatingQC	5
HouseStyle	8
KitchenAbvGr	4
KitchenQual	5
LandContour	4
LandSlope	3
LotArea	1951
LotConfig	5
LotFrontage	129
LotShape	4
LowQualFinSF	36
MSSubClass	16
MSZoning	6
MasVnrArea	445
MasVnrType	5
MiscVal	38
MoSold	12
Neighborhood	25
OpenPorchSF	252
OverallCond	9
OverallQual	10
PavedDrive	3
PoolArea	14
RoofMatl	8
RoofStyle	6
SaleCondition	6
SaleType	10
ScreenPorch	121
Street	2
TotRmsAbvGrd	14
TotalBsmtSF	1059
Utilities	3
WoodDeckSF	379
YearBuilt	118
YearRemodAdd	61
YrSold	5
In [22]:
# find unique values of each column
for i in df.columns:
    print("Unique value of:>>> {} ({})\n{}\n".format(i, len(df[i].unique()), df[i].unique()))
 
Unique value of:>>> 1stFlrSF (1083)
[ 856 1262  920 ... 1778 1650 1960]

Unique value of:>>> 2ndFlrSF (635)
[ 854    0  866  756 1053  566  983  752 1142 1218  668 1320  631  716
  676  860 1519  530  808  977 1330  833  765  462  213  548  960  670
 1116  876  612 1031  881  790  755  592  939  520  639  656 1414  884
  729 1523  728  351  688  941 1032  848  836  475  739 1151  448  896
  524 1194  956 1070 1096  467  547  551  880  703  901  720  316 1518
  704 1178  754  601 1360  929  445  564  882  920  518  817 1257  741
  672 1306  504 1304 1100  730  689  591  888 1020  828  700  842 1286
  864  829 1092  709  844 1106  596  807  625  649  698  840  780  568
  795  648  975  702 1242 1818 1121  371  804  325  809 1200  871 1274
 1347 1332 1177 1080  695  167  915  576  605  862  495  403  838  517
 1427  784  711  468 1081  886  793  665  858  874  526  590  406 1157
  299  936  438 1098  766 1101 1028 1017 1254  378 1160  682  110  600
  678  834  384  512  930  868  224 1103  560  811  878  574  910  620
  687  546  902 1000  846 1067  914  660 1538 1015 1237  611  707  527
 1288  832  806 1182 1040  439  717  511 1129 1370  636  533  745  584
  812  684  595  988  800  677  573 1066  778  661 1440  872  788  843
  713  567  651  762  482  738  586  679  644  900  887 1872 1281  472
 1312  319  978 1093  473  664 1540 1276  441  348 1060  714  744 1203
  783 1097  734  767 1589  742  686 1128 1111 1174  787 1072 1088 1063
  545  966  623  432  581  540  769 1051  761  779  514  455 1426  785
  521  252  813 1120 1037 1169 1001 1215  928 1140 1243  571 1196 1038
  561  979  701  332  368  883 1336 1141  634  912  798  985  826  831
  750  456  602  855  336  408  980  998 1168 1208  797  850  898 1054
  895  954  772 1230  727  454  370  628  304  582 1122 1134  885  640
  580 1112  653  220  240 1362  534  539  650  918  933  712 1796  971
 1175  743  523 1216 2065  272  685  776  630  984  875  913  464 1039
 1259  940  892  725  924  764  925 1479  192  589  992  903  430  748
  587  994  950 1323  732 1357  557 1296  390 1185  873 1611  457  796
  908  550  989  932  358 1392  349  691 1349  768  208  622  857  556
 1044  708  626  904  510 1104  830  981  870  694 1152  563  823  604
  715  532  537  505  424  606  185  498  492  608 1074  662  499  180
  942  558  614  328 1788 1075  380  615  645  663 1275  816  839 1325
 1012 1295  683 1126 1089 1221  967  841 1209  897  786 1629  782 1369
  972 1315  726  322  760  629  496  690  646  917  624  320  588  425
  747 1114 1619  718  815  926  444  436 1240  516 1420 1158 1162 1139
 1285 1061 1250  919  861  794  825  893 1319  959  792 1345  453  412
  182  501  375  680  658  552  396  308  973  363  594  554  428  536
  486 1721 1099  735  899 1198  343  673  442  890  943  330  420  770
 1342 1377  845 1402 1036  570 1238  923  757 1048 1131 1407 1171 1277
  995  528  863 1232  976 1008 1309  228  500  544 1778  616  494  642
  659  671  144  525  423 1164  356  245 1042  477 1005 1087  638  400
  376  916  927  869  753  450 1133  674  125  531  585  775  851  957
 1340  955  990 1384 1862 1371 1405 1358  465  466 1335  814  488 1321
 1029 1368 1567 1189 1234 1248  821 1007  476  502  867  297  810  434
  583  341 1836  541 1246 1124 1045  827 1150  312  218  493  736  818
  610  549  697  360 1004]

Unique value of:>>> 3SsnPorch (31)
[  0 320 407 130 180 168 140 508 238 245 196 144 182 162  23 216  96 153
 290 304 224 255 225 360 150 174 120 219 176  86 323]

Unique value of:>>> BedroomAbvGr (8)
[3 4 1 2 0 5 6 8]

Unique value of:>>> BldgType (5)
['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']

Unique value of:>>> BsmtCond (5)
['TA' 'Gd' nan 'Fa' 'Po']

Unique value of:>>> BsmtExposure (5)
['No' 'Gd' 'Mn' 'Av' nan]

Unique value of:>>> BsmtFinSF1 (992)
[7.060e+02 9.780e+02 4.860e+02 2.160e+02 6.550e+02 7.320e+02 1.369e+03
 8.590e+02 0.000e+00 8.510e+02 9.060e+02 9.980e+02 7.370e+02 7.330e+02
 5.780e+02 6.460e+02 5.040e+02 8.400e+02 1.880e+02 2.340e+02 1.218e+03
 1.277e+03 1.018e+03 1.153e+03 1.213e+03 7.310e+02 6.430e+02 9.670e+02
 7.470e+02 2.800e+02 1.790e+02 4.560e+02 1.351e+03 2.400e+01 7.630e+02
 1.820e+02 1.040e+02 1.810e+03 3.840e+02 4.900e+02 6.490e+02 6.320e+02
 9.410e+02 7.390e+02 9.120e+02 1.013e+03 6.030e+02 1.880e+03 5.650e+02
 3.200e+02 4.620e+02 2.280e+02 3.360e+02 4.480e+02 1.201e+03 3.300e+01
 5.880e+02 6.000e+02 7.130e+02 1.046e+03 6.480e+02 3.100e+02 1.162e+03
 5.200e+02 1.080e+02 5.690e+02 1.200e+03 2.240e+02 7.050e+02 4.440e+02
 2.500e+02 9.840e+02 3.500e+01 7.740e+02 4.190e+02 1.700e+02 1.470e+03
 9.380e+02 5.700e+02 3.000e+02 1.200e+02 1.160e+02 5.120e+02 5.670e+02
 4.450e+02 6.950e+02 4.050e+02 1.005e+03 6.680e+02 8.210e+02 4.320e+02
 1.300e+03 5.070e+02 6.790e+02 1.332e+03 2.090e+02 6.800e+02 7.160e+02
 1.400e+03 4.160e+02 4.290e+02 2.220e+02 5.700e+01 6.600e+02 1.016e+03
 3.700e+02 3.510e+02 3.790e+02 1.288e+03 3.600e+02 6.390e+02 4.950e+02
 2.880e+02 1.398e+03 4.770e+02 8.310e+02 1.904e+03 4.360e+02 3.520e+02
 6.110e+02 1.086e+03 2.970e+02 6.260e+02 5.600e+02 3.900e+02 5.660e+02
 1.126e+03 1.036e+03 1.088e+03 6.410e+02 6.170e+02 6.620e+02 3.120e+02
 1.065e+03 7.870e+02 4.680e+02 3.600e+01 8.220e+02 3.780e+02 9.460e+02
 3.410e+02 1.600e+01 5.500e+02 5.240e+02 5.600e+01 3.210e+02 8.420e+02
 6.890e+02 6.250e+02 3.580e+02 4.020e+02 9.400e+01 1.078e+03 3.290e+02
 9.290e+02 6.970e+02 1.573e+03 2.700e+02 9.220e+02 5.030e+02 1.334e+03
 3.610e+02 6.720e+02 5.060e+02 7.140e+02 4.030e+02 7.510e+02 2.260e+02
 6.200e+02 5.460e+02 3.920e+02 4.210e+02 9.050e+02 9.040e+02 4.300e+02
 6.140e+02 4.500e+02 2.100e+02 2.920e+02 7.950e+02 1.285e+03 8.190e+02
 4.200e+02 8.410e+02 2.810e+02 8.940e+02 1.464e+03 7.000e+02 2.620e+02
 1.274e+03 5.180e+02 1.236e+03 4.250e+02 6.920e+02 9.870e+02 9.700e+02
 2.800e+01 2.560e+02 1.619e+03 4.000e+01 8.460e+02 1.124e+03 7.200e+02
 8.280e+02 1.249e+03 8.100e+02 2.130e+02 5.850e+02 1.290e+02 4.980e+02
 1.270e+03 5.730e+02 1.410e+03 1.082e+03 2.360e+02 3.880e+02 3.340e+02
 8.740e+02 9.560e+02 7.730e+02 3.990e+02 1.620e+02 7.120e+02 6.090e+02
 3.710e+02 5.400e+02 7.200e+01 6.230e+02 4.280e+02 3.500e+02 2.980e+02
 1.445e+03 2.180e+02 9.850e+02 6.310e+02 1.280e+03 2.410e+02 6.900e+02
 2.660e+02 7.770e+02 8.120e+02 7.860e+02 1.116e+03 7.890e+02 1.056e+03
 5.000e+01 1.128e+03 7.750e+02 1.309e+03 1.246e+03 9.860e+02 6.160e+02
 1.518e+03 6.640e+02 3.870e+02 4.710e+02 3.850e+02 3.650e+02 1.767e+03
 1.330e+02 6.420e+02 2.470e+02 3.310e+02 7.420e+02 1.606e+03 9.160e+02
 1.850e+02 5.440e+02 5.530e+02 3.260e+02 7.780e+02 3.860e+02 4.260e+02
 3.680e+02 4.590e+02 1.350e+03 1.196e+03 6.300e+02 9.940e+02 1.680e+02
 1.261e+03 1.567e+03 2.990e+02 8.970e+02 6.070e+02 8.360e+02 5.150e+02
 3.740e+02 1.231e+03 1.110e+02 3.560e+02 4.000e+02 6.980e+02 1.247e+03
 2.570e+02 3.800e+02 2.700e+01 1.410e+02 9.910e+02 6.500e+02 5.210e+02
 1.436e+03 2.260e+03 7.190e+02 3.770e+02 1.330e+03 3.480e+02 1.219e+03
 7.830e+02 9.690e+02 6.730e+02 1.358e+03 1.260e+03 1.440e+02 5.840e+02
 5.540e+02 1.002e+03 6.190e+02 1.800e+02 5.590e+02 3.080e+02 8.660e+02
 8.950e+02 6.370e+02 6.040e+02 1.302e+03 1.071e+03 2.900e+02 7.280e+02
 2.000e+00 1.441e+03 9.430e+02 2.310e+02 4.140e+02 3.490e+02 4.420e+02
 3.280e+02 5.940e+02 8.160e+02 1.460e+03 1.324e+03 1.338e+03 6.850e+02
 1.422e+03 1.283e+03 8.100e+01 4.540e+02 9.030e+02 6.050e+02 9.900e+02
 2.060e+02 1.500e+02 4.570e+02 4.800e+01 8.710e+02 4.100e+01 6.740e+02
 6.240e+02 4.800e+02 1.154e+03 7.380e+02 4.930e+02 1.121e+03 2.820e+02
 5.000e+02 1.310e+02 1.696e+03 8.060e+02 1.361e+03 9.200e+02 1.721e+03
 1.870e+02 1.138e+03 9.880e+02 1.930e+02 5.510e+02 7.670e+02 1.186e+03
 8.920e+02 3.110e+02 8.270e+02 5.430e+02 1.003e+03 1.059e+03 2.390e+02
 9.450e+02 2.000e+01 1.455e+03 9.650e+02 9.800e+02 8.630e+02 5.330e+02
 1.084e+03 1.173e+03 5.230e+02 1.148e+03 1.910e+02 1.234e+03 3.750e+02
 8.080e+02 7.240e+02 1.520e+02 1.180e+03 2.520e+02 8.320e+02 5.750e+02
 9.190e+02 4.390e+02 3.810e+02 4.380e+02 5.490e+02 6.120e+02 1.163e+03
 4.370e+02 3.940e+02 1.416e+03 4.220e+02 7.620e+02 9.750e+02 1.097e+03
 2.510e+02 6.860e+02 6.560e+02 5.680e+02 5.390e+02 8.620e+02 1.970e+02
 5.160e+02 6.630e+02 6.080e+02 1.636e+03 7.840e+02 2.490e+02 1.040e+03
 4.830e+02 1.960e+02 5.720e+02 3.380e+02 3.300e+02 1.560e+02 1.390e+03
 5.130e+02 4.600e+02 6.590e+02 3.640e+02 5.640e+02 3.060e+02 5.050e+02
 9.320e+02 7.500e+02 6.400e+01 6.330e+02 1.170e+03 8.990e+02 9.020e+02
 1.238e+03 5.280e+02 1.024e+03 1.064e+03 2.850e+02 2.188e+03 4.650e+02
 3.220e+02 8.600e+02 5.990e+02 3.540e+02 6.300e+01 2.230e+02 3.010e+02
 4.430e+02 4.890e+02 2.840e+02 2.940e+02 8.140e+02 1.650e+02 5.520e+02
 8.330e+02 4.640e+02 9.360e+02 7.720e+02 1.440e+03 7.480e+02 9.820e+02
 3.980e+02 5.620e+02 4.840e+02 4.170e+02 6.990e+02 6.960e+02 8.960e+02
 5.560e+02 1.106e+03 6.510e+02 8.670e+02 8.540e+02 1.646e+03 1.074e+03
 5.360e+02 1.172e+03 9.150e+02 5.950e+02 1.237e+03 2.730e+02 6.840e+02
 3.240e+02 1.165e+03 1.380e+02 1.513e+03 3.170e+02 1.012e+03 1.022e+03
 5.090e+02 9.000e+02 1.085e+03 1.104e+03 2.400e+02 3.830e+02 6.440e+02
 3.970e+02 7.400e+02 8.370e+02 2.200e+02 5.860e+02 5.350e+02 4.100e+02
 7.500e+01 8.240e+02 5.920e+02 1.039e+03 5.100e+02 4.230e+02 6.610e+02
 2.480e+02 7.040e+02 4.120e+02 1.032e+03 2.190e+02 7.080e+02 4.150e+02
 1.004e+03 3.530e+02 7.020e+02 3.690e+02 6.220e+02 2.120e+02 6.450e+02
 8.520e+02 1.150e+03 1.258e+03 2.750e+02 1.760e+02 2.960e+02 5.380e+02
 1.157e+03 4.920e+02 1.198e+03 1.387e+03 5.220e+02 6.580e+02 1.216e+03
 1.480e+03 2.096e+03 1.159e+03 4.400e+02 1.456e+03 8.830e+02 5.470e+02
 7.880e+02 4.850e+02 3.400e+02 1.220e+03 4.270e+02 3.440e+02 7.560e+02
 1.540e+03 6.660e+02 8.030e+02 1.000e+03 8.850e+02 1.386e+03 3.190e+02
 5.340e+02 1.250e+02 1.314e+03 6.020e+02 1.920e+02 5.930e+02 8.040e+02
 1.053e+03 5.320e+02 1.158e+03 1.014e+03 1.940e+02 1.670e+02 7.760e+02
 5.644e+03 6.940e+02 1.572e+03 7.460e+02 1.406e+03 9.250e+02 4.820e+02
 1.890e+02 7.650e+02 8.000e+01 1.443e+03 2.590e+02 7.350e+02 7.340e+02
 1.447e+03 5.480e+02 3.150e+02 1.282e+03 4.080e+02 3.090e+02 2.030e+02
 8.650e+02 2.040e+02 7.900e+02 1.320e+03 7.690e+02 1.070e+03 2.640e+02
 7.590e+02 1.373e+03 9.760e+02 7.810e+02 2.500e+01 1.110e+03 4.040e+02
 5.800e+02 6.780e+02 9.580e+02 1.336e+03 1.079e+03 4.900e+01 8.300e+02
 9.230e+02 7.910e+02 2.630e+02 9.350e+02 1.051e+03 5.140e+02 1.100e+02
 1.414e+03 1.260e+02 1.129e+03 1.298e+03 3.760e+02 4.660e+02 2.440e+02
 1.137e+03 6.870e+02 1.010e+03 1.500e+03 6.700e+02 9.440e+02 1.188e+03
 8.560e+02 3.390e+02 4.810e+02 7.170e+02 5.790e+02 2.740e+02 7.800e+02
 2.830e+02 4.740e+02 4.520e+02 2.760e+02 9.600e+02 7.660e+02 1.026e+03
 7.300e+01 7.360e+02 1.319e+03 2.670e+02 1.092e+03 9.640e+02 9.540e+02
 1.346e+03 1.433e+03 8.700e+02 1.980e+02 1.682e+03 2.380e+02 3.430e+02
 7.600e+01 6.150e+02 7.800e+01 4.200e+01 4.690e+02 2.070e+02 4.580e+02
 4.760e+02 1.341e+03 8.440e+02 8.470e+02 8.500e+02 1.965e+03 7.410e+02
 3.630e+02 2.250e+02 1.333e+03 8.880e+02 6.360e+02 7.260e+02 2.540e+02
 4.350e+02 3.890e+02 2.790e+02 1.360e+03 1.232e+03 2.288e+03 1.531e+03
 1.230e+03 1.015e+03 1.037e+03 1.142e+03 1.262e+03 1.972e+03 8.810e+02
 8.760e+02 2.146e+03 1.557e+03 8.000e+02 6.520e+02 4.940e+02 6.830e+02
 9.130e+02 1.294e+03 2.158e+03 6.820e+02 1.430e+03 7.710e+02 5.400e+01
 5.200e+01 6.800e+01 8.640e+02 1.400e+02 1.733e+03 6.010e+02 9.620e+02
 1.252e+03 1.210e+02 9.550e+02 1.000e+02 1.312e+03 1.720e+02 1.550e+02
 9.310e+02 8.720e+02 7.450e+02 6.210e+02 4.330e+02 8.260e+02 1.340e+02
 1.690e+02 7.490e+02 1.152e+03 5.270e+02 3.420e+02 1.730e+02 7.000e+01
 1.094e+03 8.200e+02 1.021e+03 1.359e+03 7.550e+02 9.500e+02 6.060e+02
 1.259e+03 7.100e+02 1.111e+03 1.478e+03 3.320e+02 7.930e+02 2.460e+02
 1.540e+02 6.500e+01 1.476e+03 5.500e+01 1.758e+03 1.115e+03 1.640e+03
 1.140e+02 7.180e+02 4.960e+02 1.337e+03 1.034e+03 9.830e+02 1.206e+03
 8.900e+02 1.023e+03 1.190e+02 2.860e+02 1.728e+03 1.375e+03 1.420e+03
 2.257e+03 1.149e+03 1.075e+03 3.720e+02 1.204e+03 1.073e+03 1.087e+03
 1.660e+03 1.096e+03 7.290e+02 3.620e+02 5.370e+02 4.720e+02 5.300e+01
 7.640e+02 1.900e+02 1.027e+03 1.141e+03 6.810e+02 8.130e+02 1.280e+02
 1.044e+03 2.600e+02 5.830e+02 3.200e+01 5.310e+02 1.480e+02 7.440e+02
 9.600e+01 5.900e+02 2.000e+02 4.060e+02 1.750e+02 2.010e+02       nan
 7.580e+02 2.210e+02 6.340e+02 1.035e+03 7.790e+02 1.271e+03 3.550e+02
 2.085e+03 7.700e+02 7.220e+02 1.308e+03 6.880e+02 8.800e+01 1.194e+03
 1.538e+03 1.593e+03 1.033e+03 3.660e+02 1.474e+03 1.383e+03 8.930e+02
 1.029e+03 1.223e+03 1.011e+03 1.571e+03 3.180e+02 5.010e+02 7.850e+02
 6.380e+02 6.470e+02 8.380e+02 1.860e+02 9.260e+02 1.101e+03 1.047e+03
 7.970e+02 1.558e+03 1.328e+03 3.140e+02 9.300e+02 7.250e+02 1.151e+03
 1.304e+03 1.812e+03 1.684e+03 6.690e+02 1.178e+03 1.030e+03 8.480e+02
 9.180e+02 5.740e+02 1.181e+03 1.048e+03 3.350e+02 1.225e+03 7.270e+02
 9.680e+02 6.000e+01 9.370e+02 9.010e+02 1.732e+03 1.632e+03 9.730e+02
 9.100e+02 3.460e+02 7.920e+02 6.540e+02 1.300e+02 8.730e+02 9.080e+02
 4.410e+02 8.500e+01 2.420e+02 9.520e+02 1.098e+03 7.820e+02 1.220e+02
 3.160e+02 2.580e+02 5.870e+02 4.910e+02 4.530e+02 5.570e+02 1.080e+03
 4.970e+02 5.100e+01 5.020e+02 6.710e+02 1.412e+03 7.090e+02 1.320e+02
 4.010e+03 4.670e+02 7.700e+01 1.130e+02 5.770e+02 4.340e+02 1.001e+03
 1.392e+03 1.239e+03 9.240e+02 9.490e+02 2.150e+02 1.329e+03 1.112e+03
 7.960e+02 8.110e+02 1.090e+03 5.960e+02 1.127e+03 2.050e+02 1.191e+03
 9.510e+02 3.820e+02 3.730e+02 1.505e+03 1.290e+03 8.800e+02 1.038e+03
 1.182e+03 1.562e+03 1.836e+03 2.780e+02 1.810e+02 1.118e+03 7.600e+02
 7.990e+02 9.960e+02 9.390e+02 9.140e+02 2.710e+02 4.880e+02 7.010e+02
 4.550e+02 8.090e+02 9.530e+02 2.080e+02 1.430e+02 5.760e+02 3.470e+02
 7.940e+02 2.300e+02 2.610e+02 3.930e+02 1.576e+03 1.122e+03 8.530e+02
 4.750e+02 6.910e+02 4.240e+02 3.050e+02 5.260e+02 1.564e+03 9.090e+02
 1.136e+03 1.243e+03 1.490e+02 1.224e+03 3.370e+02]

Unique value of:>>> BsmtFinSF2 (273)
[   0.   32.  668.  486.   93.  491.  506.  712.  362.   41.  169.  869.
  150.  670.   28. 1080.  181.  768.  215.  374.  208.  441.  184.  279.
  306.  180.  580.  690.  692.  228.  125. 1063.  620.  175.  820. 1474.
  264.  479.  147.  232.  380.  544.  294.  258.  121.  391.  531.  344.
  539.  713.  210.  311. 1120.  165.  532.   96.  495.  174. 1127.  139.
  202.  645.  123.  551.  219.  606.  612.  480.  182.  132.  336.  468.
  287.   35.  499.  723.  119.   40.  117.  239.   80.  472.   64. 1057.
  127.  630.  128.  377.  764.  345. 1085.  435.  823.  500.  290.  324.
  634.  411.  841. 1061.  466.  396.  354.  149.  193.  273.  465.  400.
  682.  557.  230.  106.  791.  240.  547.  469.  177.  108.  600.  492.
  211.  168. 1031.  438.  375.  144.   81.  906.  608.  276.  661.   68.
  173.  972.  105.  420.  546.  334.  352.  872.  110.  627.  163. 1029.
   78.  859.  981.   42.   46.  162.  350.  263. 1073.   12.  159.  474.
  453.  684.  387.  688.  252.  590.  284.  622.  113. 1526.  360.  774.
  364.  596.  884.   92.  216.  136.  201.  512.  247.  483.  750.   60.
  102.   95.   63.  262.  393.  286.  450.   72.  243.  694.  875.  507.
  419.  250.  116.  624.   76.  270.  288.  186.  449.   48.  613.  852.
  555.  799.  811.  842.  382.  456.  308.   52.  196.  488.  319.   nan
  956.  120.  679.  604.  153.  619.    6.  351. 1037.  829.   38.  206.
  167.  543.  259.  404.  138.  955.  691.   66.  154.  442.  448.  227.
  398.  722.  761.  529.  522.  873.  891.  755.  321.  915.  417.  432.
  831.  278. 1020.  530.  904.  156. 1393. 1039.  497.  402.  748.  281.
  912.  373.  982.  826.  850. 1164. 1083.  337.  297.]

Unique value of:>>> BsmtFinType1 (7)
['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' nan 'LwQ']

Unique value of:>>> BsmtFinType2 (7)
['Unf' 'BLQ' nan 'ALQ' 'Rec' 'LwQ' 'GLQ']

Unique value of:>>> BsmtFullBath (5)
[ 1.  0.  2.  3. nan]

Unique value of:>>> BsmtHalfBath (4)
[ 0.  1.  2. nan]

Unique value of:>>> BsmtQual (5)
['Gd' 'TA' 'Ex' nan 'Fa']

Unique value of:>>> BsmtUnfSF (1136)
[ 150.  284.  434. ...  129.   45. 1503.]

Unique value of:>>> CentralAir (2)
['Y' 'N']

Unique value of:>>> Condition1 (9)
['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']

Unique value of:>>> Condition2 (8)
['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']

Unique value of:>>> Electrical (6)
['SBrkr' 'FuseF' 'FuseA' 'FuseP' 'Mix' nan]

Unique value of:>>> EnclosedPorch (183)
[   0  272  228  205  176   87  172  102   37  144   64  114  202  128
  156   44   77  192  140  180  183   39  184   40  552   30  126   96
   60  150  120  112  252   52  224  234  244  268  137   24  108  294
  177  218  242   91  160  130  169  105   34  248  236   32   80  115
  291  116  158  210   36  200   84  148  136  240   54  100  189  293
  164  216  239   67   90   56  129   98  143   70  386  154  185  134
  196  264  275  230  254   68  194  318   48   94  138  226  174   19
  170  220  214  280  190  330  208  145  259   81   42  123  162  286
  168   20  301  198  221  212   50   99  186  113  135  334  246   18
   41   35  364   45   86  265  222  209  260  203  432   25  238   51
  213  288  211   55   57   78   72  368  165   92   16   66  109  139
  219  101  117  204  122  231  121  207  249  290  175   26   88 1012
   43  584  133  324  161   75  167   28  104  296  256  225  429  132
   23]

Unique value of:>>> ExterCond (5)
['TA' 'Gd' 'Fa' 'Po' 'Ex']

Unique value of:>>> ExterQual (4)
['Gd' 'TA' 'Ex' 'Fa']

Unique value of:>>> Exterior1st (16)
['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd'
 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc'
 'CBlock' nan]

Unique value of:>>> Exterior2nd (17)
['VinylSd' 'MetalSd' 'Wd Shng' 'HdBoard' 'Plywood' 'Wd Sdng' 'CmentBd'
 'BrkFace' 'Stucco' 'AsbShng' 'Brk Cmn' 'ImStucc' 'AsphShn' 'Stone'
 'Other' 'CBlock' nan]

Unique value of:>>> Fireplaces (5)
[0 1 2 3 4]

Unique value of:>>> Foundation (6)
['PConc' 'CBlock' 'BrkTil' 'Wood' 'Slab' 'Stone']

Unique value of:>>> FullBath (5)
[2 1 3 0 4]

Unique value of:>>> Functional (8)
['Typ' 'Min1' 'Maj1' 'Min2' 'Mod' 'Maj2' 'Sev' nan]

Unique value of:>>> GarageArea (604)
[ 548.  460.  608.  642.  836.  480.  636.  484.  468.  205.  384.  736.
  352.  840.  576.  516.  294.  853.  280.  534.  572.  270.  890.  772.
  319.  240.  250.  271.  447.  556.  691.  672.  498.  246.    0.  440.
  308.  504.  300.  670.  826.  386.  388.  528.  894.  565.  641.  288.
  645.  852.  558.  220.  667.  360.  427.  490.  379.  297.  283.  509.
  405.  758.  461.  400.  462.  420.  432.  506.  684.  472.  366.  476.
  410.  740.  648.  273.  546.  325.  792.  450.  180.  430.  594.  390.
  540.  264.  530.  435.  453.  750.  487.  624.  471.  318.  766.  660.
  470.  720.  577.  380.  434.  866.  495.  564.  312.  625.  680.  678.
  726.  532.  216.  303.  789.  511.  616.  521.  451. 1166.  252.  497.
  682.  666.  786.  795.  856.  473.  398.  500.  349.  454.  644.  299.
  210.  431.  438.  675.  968.  721.  336.  810.  494.  457.  818.  463.
  604.  389.  538.  520.  309.  429.  673.  884.  868.  492.  413.  924.
 1053.  439.  671.  338.  573.  732.  505.  575.  626.  898.  529.  685.
  281.  539.  418.  588.  282.  375.  683.  843.  552.  870.  888.  746.
  708.  513. 1025.  656.  872.  292.  441.  189.  880.  676.  301.  474.
  706.  617.  445.  200.  592.  566.  514.  296.  244.  610.  834.  639.
  501.  846.  560.  596.  600.  373.  947.  350.  396.  864.  304.  784.
  696.  569.  628.  550.  493.  578.  198.  422.  228.  526.  525.  908.
  499.  508.  694.  874.  164.  402.  515.  286.  603.  900.  583.  889.
  858.  502.  392.  403.  527.  765.  367.  426.  615.  871.  570.  406.
  590.  612.  650. 1390.  275.  452.  842.  816.  621.  544.  486.  230.
  261.  531.  393.  774.  749.  364.  627.  260.  256.  478.  442.  562.
  512.  839.  330.  711. 1134.  416.  779.  702.  567.  832.  326.  551.
  606.  739.  408.  475.  704.  983.  768.  632.  541.  320.  800.  831.
  554.  878.  752.  614.  481.  496.  423.  841.  895.  412.  865.  630.
  605.  602.  618.  444.  397.  455.  409.  820. 1020.  598.  857.  595.
  433.  776. 1220.  458.  613.  456.  436.  812.  686.  611.  425.  343.
  479.  619.  902.  574.  523.  414.  738.  354.  483.  327.  756.  690.
  284.  833.  601.  533.  522.  788.  555.  689.  796.  808.  510.  255.
  424.  305.  368.  824.  328.  160.  437.  665.  290.  912.  905.  542.
  716.  586.  467.  582. 1248. 1043.  254.  712.  719.  862.  928.  782.
  466.  714. 1052.  225.  234.  324.  306.  830.  807.  358.  186.  693.
  482.  813.  995.  757. 1356.  459.  701.  322.  315.  668.  404.  543.
  954.  850.  477.  276.  518. 1014.  753. 1418.  213.  844.  860.  748.
  248.  287.  825.  647.  342.  770.  663.  377.  804.  936.  722.  208.
  662.  754.  622.  620.  370. 1069.  372.  923.  192.  730.  751.  958.
  962.  762.  713.  535.  517.  263.  780.  363.  365.  231.  591.  209.
 1017.  580.  399.  741.  253.  581.  345.  896.  932.  640.  927.  700.
  886.  949.  649.  394.  658.  815.  623.  972.  984.  692.  845.  559.
  465.  524.  561.  549.  907.  162.  357.  207. 1184.  316.  226.  340.
  266. 1138.  904. 1231.  195.  313.  215.  307.  295.  351.  885.  920.
  698.  557.  489. 1314.  787. 1150. 1003.  944.  428.  687.  938.  783.
  851.  545.  469.  464.  267. 1488.  401.  311.  828.  869.  355.  249.
 1348.  811.  725.  715.  814.  369.  599.  344.  356.  185.  892.  257.
  729. 1110.  724.  585.  488. 1040. 1174.  728.  916.  876.  631.  925.
  806.  933. 1092.  859.  744. 1105.  310.  293.  371. 1200.  184.  374.
  331.  224.  217.  323.  638.  332.  674.  747.  242.  597.  579. 1154.
   nan  100.  571. 1041.  963.  443.  773.  485. 1085.  899.  959.  803.
  760.  584.  449.  688.  568.  353.  791. 1008.  378.  258.  848.  317.
  646.  265.  609.  272.]

Unique value of:>>> GarageCars (7)
[ 2.  3.  1.  0.  4.  5. nan]

Unique value of:>>> GarageCond (6)
['TA' 'Fa' nan 'Gd' 'Po' 'Ex']

Unique value of:>>> GarageFinish (4)
['RFn' 'Unf' 'Fin' nan]

Unique value of:>>> GarageQual (6)
['TA' 'Fa' 'Gd' nan 'Ex' 'Po']

Unique value of:>>> GarageType (7)
['Attchd' 'Detchd' 'BuiltIn' 'CarPort' nan 'Basment' '2Types']

Unique value of:>>> GarageYrBlt (104)
[2003. 1976. 2001. 1998. 2000. 1993. 2004. 1973. 1931. 1939. 1965. 2005.
 1962. 2006. 1960. 1991. 1970. 1967. 1958. 1930. 2002. 1968. 2007. 2008.
 1957. 1920. 1966. 1959. 1995. 1954. 1953.   nan 1983. 1977. 1997. 1985.
 1963. 1981. 1964. 1999. 1935. 1990. 1945. 1987. 1989. 1915. 1956. 1948.
 1974. 2009. 1950. 1961. 1921. 1900. 1979. 1951. 1969. 1936. 1975. 1971.
 1923. 1984. 1926. 1955. 1986. 1988. 1916. 1932. 1972. 1918. 1980. 1924.
 1996. 1940. 1949. 1994. 1910. 1978. 1982. 1992. 1925. 1941. 2010. 1927.
 1947. 1937. 1942. 1938. 1952. 1928. 1922. 1934. 1906. 1914. 1946. 1908.
 1929. 1933. 1917. 1896. 1895. 2207. 1943. 1919.]

Unique value of:>>> GrLivArea (1292)
[1710 1262 1786 ... 2315  641 1778]

Unique value of:>>> HalfBath (3)
[1 0 2]

Unique value of:>>> Heating (6)
['GasA' 'GasW' 'Grav' 'Wall' 'OthW' 'Floor']

Unique value of:>>> HeatingQC (5)
['Ex' 'Gd' 'TA' 'Fa' 'Po']

Unique value of:>>> HouseStyle (8)
['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin']

Unique value of:>>> KitchenAbvGr (4)
[1 2 3 0]

Unique value of:>>> KitchenQual (5)
['Gd' 'TA' 'Ex' 'Fa' nan]

Unique value of:>>> LandContour (4)
['Lvl' 'Bnk' 'Low' 'HLS']

Unique value of:>>> LandSlope (3)
['Gtl' 'Mod' 'Sev']

Unique value of:>>> LotArea (1951)
[ 8450  9600 11250 ...  1894 20000 10441]

Unique value of:>>> LotConfig (5)
['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']

Unique value of:>>> LotFrontage (129)
[ 65.  80.  68.  60.  84.  85.  75.  nan  51.  50.  70.  91.  72.  66.
 101.  57.  44. 110.  98.  47. 108. 112.  74. 115.  61.  48.  33.  52.
 100.  24.  89.  63.  76.  81.  95.  69.  21.  32.  78. 121. 122.  40.
 105.  73.  77.  64.  94.  34.  90.  55.  88.  82.  71. 120. 107.  92.
 134.  62.  86. 141.  97.  54.  41.  79. 174.  99.  67.  83.  43. 103.
  93.  30. 129. 140.  35.  37. 118.  87. 116. 150. 111.  49.  96.  59.
  36.  56. 102.  58.  38. 109. 130.  53. 137.  45. 106. 104.  42.  39.
 144. 114. 128. 149. 313. 168. 182. 138. 160. 152. 124. 153.  46.  26.
  25. 119.  31.  28. 117. 113. 125. 135. 136.  22. 123. 195. 155. 126.
 200. 131. 133.]

Unique value of:>>> LotShape (4)
['Reg' 'IR1' 'IR2' 'IR3']

Unique value of:>>> LowQualFinSF (36)
[   0  360  513  234  528  572  144  392  371  390  420  473  156  515
   80   53  232  481  120  514  397  479  205  384  362 1064  431  436
  259  312  108  697  512  114  140  450]

Unique value of:>>> MSSubClass (16)
[ 60  20  70  50 190  45  90 120  30  85  80 160  75 180  40 150]

Unique value of:>>> MSZoning (6)
['RL' 'RM' 'C (all)' 'FV' 'RH' nan]

Unique value of:>>> MasVnrArea (445)
[1.960e+02 0.000e+00 1.620e+02 3.500e+02 1.860e+02 2.400e+02 2.860e+02
 3.060e+02 2.120e+02 1.800e+02 3.800e+02 2.810e+02 6.400e+02 2.000e+02
 2.460e+02 1.320e+02 6.500e+02 1.010e+02 4.120e+02 2.720e+02 4.560e+02
 1.031e+03 1.780e+02 5.730e+02 3.440e+02 2.870e+02 1.670e+02 1.115e+03
 4.000e+01 1.040e+02 5.760e+02 4.430e+02 4.680e+02 6.600e+01 2.200e+01
 2.840e+02 7.600e+01 2.030e+02 6.800e+01 1.830e+02 4.800e+01 2.800e+01
 3.360e+02 6.000e+02 7.680e+02 4.800e+02 2.200e+02 1.840e+02 1.129e+03
 1.160e+02 1.350e+02 2.660e+02 8.500e+01 3.090e+02 1.360e+02 2.880e+02
 7.000e+01 3.200e+02 5.000e+01 1.200e+02 4.360e+02 2.520e+02 8.400e+01
 6.640e+02 2.260e+02 3.000e+02 6.530e+02 1.120e+02 4.910e+02 2.680e+02
 7.480e+02 9.800e+01 2.750e+02 1.380e+02 2.050e+02 2.620e+02 1.280e+02
 2.600e+02 1.530e+02 6.400e+01 3.120e+02 1.600e+01 9.220e+02 1.420e+02
 2.900e+02 1.270e+02 5.060e+02 2.970e+02       nan 6.040e+02 2.540e+02
 3.600e+01 1.020e+02 4.720e+02 4.810e+02 1.080e+02 3.020e+02 1.720e+02
 3.990e+02 2.700e+02 4.600e+01 2.100e+02 1.740e+02 3.480e+02 3.150e+02
 2.990e+02 3.400e+02 1.660e+02 7.200e+01 3.100e+01 3.400e+01 2.380e+02
 1.600e+03 3.650e+02 5.600e+01 1.500e+02 2.780e+02 2.560e+02 2.250e+02
 3.700e+02 3.880e+02 1.750e+02 2.960e+02 1.460e+02 1.130e+02 1.760e+02
 6.160e+02 3.000e+01 1.060e+02 8.700e+02 3.620e+02 5.300e+02 5.000e+02
 5.100e+02 2.470e+02 3.050e+02 2.550e+02 1.250e+02 1.000e+02 4.320e+02
 1.260e+02 4.730e+02 7.400e+01 1.450e+02 2.320e+02 3.760e+02 4.200e+01
 1.610e+02 1.100e+02 1.800e+01 2.240e+02 2.480e+02 8.000e+01 3.040e+02
 2.150e+02 7.720e+02 4.350e+02 3.780e+02 5.620e+02 1.680e+02 8.900e+01
 2.850e+02 3.600e+02 9.400e+01 3.330e+02 9.210e+02 7.620e+02 5.940e+02
 2.190e+02 1.880e+02 4.790e+02 5.840e+02 1.820e+02 2.500e+02 2.920e+02
 2.450e+02 2.070e+02 8.200e+01 9.700e+01 3.350e+02 2.080e+02 4.200e+02
 1.700e+02 4.590e+02 2.800e+02 9.900e+01 1.920e+02 2.040e+02 2.330e+02
 1.560e+02 4.520e+02 5.130e+02 2.610e+02 1.640e+02 2.590e+02 2.090e+02
 2.630e+02 2.160e+02 3.510e+02 6.600e+02 3.810e+02 5.400e+01 5.280e+02
 2.580e+02 4.640e+02 5.700e+01 1.470e+02 1.170e+03 2.930e+02 6.300e+02
 4.660e+02 1.090e+02 4.100e+01 1.600e+02 2.890e+02 6.510e+02 1.690e+02
 9.500e+01 4.420e+02 2.020e+02 3.380e+02 8.940e+02 3.280e+02 6.730e+02
 6.030e+02 1.000e+00 3.750e+02 9.000e+01 3.800e+01 1.570e+02 1.100e+01
 1.400e+02 1.300e+02 1.480e+02 8.600e+02 4.240e+02 1.047e+03 2.430e+02
 8.160e+02 3.870e+02 2.230e+02 1.580e+02 1.370e+02 1.150e+02 1.890e+02
 2.740e+02 1.170e+02 6.000e+01 1.220e+02 9.200e+01 4.150e+02 7.600e+02
 2.700e+01 7.500e+01 3.610e+02 1.050e+02 3.420e+02 2.980e+02 5.410e+02
 2.360e+02 1.440e+02 4.230e+02 4.400e+01 1.510e+02 9.750e+02 4.500e+02
 2.300e+02 5.710e+02 2.400e+01 5.300e+01 2.060e+02 1.400e+01 3.240e+02
 2.950e+02 3.960e+02 6.700e+01 1.540e+02 4.250e+02 4.500e+01 1.378e+03
 3.370e+02 1.490e+02 1.430e+02 5.100e+01 1.710e+02 2.340e+02 6.300e+01
 7.660e+02 3.200e+01 8.100e+01 1.630e+02 5.540e+02 2.180e+02 6.320e+02
 1.140e+02 5.670e+02 3.590e+02 4.510e+02 6.210e+02 7.880e+02 8.600e+01
 7.960e+02 3.910e+02 2.280e+02 8.800e+01 1.650e+02 4.280e+02 4.100e+02
 5.640e+02 3.680e+02 3.180e+02 5.790e+02 6.500e+01 7.050e+02 4.080e+02
 2.440e+02 1.230e+02 3.660e+02 7.310e+02 4.480e+02 2.940e+02 3.100e+02
 2.370e+02 4.260e+02 9.600e+01 4.380e+02 1.940e+02 1.190e+02 2.000e+01
 5.040e+02 4.920e+02 6.150e+02 1.095e+03 1.159e+03 2.650e+02 9.100e+01
 7.710e+02 4.700e+01 1.770e+02 3.710e+02 4.300e+02 4.400e+02 2.290e+02
 7.260e+02 4.180e+02 7.240e+02 3.830e+02 7.300e+02 4.700e+02 3.080e+02
 6.340e+02 3.720e+02 1.980e+02 1.210e+02 2.640e+02 1.410e+02 2.830e+02
 5.090e+02 2.170e+02 3.000e+00 6.570e+02 1.240e+02 4.440e+02 2.300e+01
 2.420e+02 3.640e+02 3.520e+02 4.060e+02 4.020e+02 4.220e+02 3.560e+02
 6.800e+02 1.110e+03 2.210e+02 7.140e+02 6.470e+02 1.290e+03 4.950e+02
 5.680e+02 1.790e+02 1.050e+03 1.870e+02 5.200e+01 2.760e+02 3.900e+01
 1.900e+02 2.510e+02 2.270e+02 1.340e+02 2.220e+02 5.800e+01 6.680e+02
 6.740e+02 1.970e+02 7.100e+02 9.450e+02 5.490e+02 2.530e+02 4.000e+02
 9.700e+02 5.020e+02 3.940e+02 2.350e+02 5.150e+02 5.260e+02 7.540e+02
 3.530e+02 5.250e+02 8.700e+01 2.910e+02 6.900e+01 2.790e+02 3.230e+02
 2.140e+02 5.190e+02 1.224e+03 6.520e+02 8.860e+02 9.020e+02 4.340e+02
 6.620e+02 7.340e+02 5.500e+02 5.140e+02 3.850e+02 5.180e+02 5.720e+02
 3.220e+02 8.770e+02 3.970e+02 7.380e+02 5.010e+02 1.180e+02 6.920e+02
 3.320e+02 5.220e+02 3.790e+02 5.320e+02 6.200e+01 1.990e+02 3.550e+02
 4.050e+02 3.270e+02 2.570e+02 3.820e+02]

Unique value of:>>> MasVnrType (5)
['BrkFace' 'None' 'Stone' 'BrkCmn' nan]

Unique value of:>>> MiscVal (38)
[    0   700   350   500   400   480   450 15500  1200   800  2000   600
  3500  1300    54   620   560  1400  8300  1150  2500 12500  1500   300
    80   490   650   900   750  6500  1000  4500  3000 17000  1512   455
   460   420]

Unique value of:>>> MoSold (12)
[ 2  5  9 12 10  8 11  4  1  7  3  6]

Unique value of:>>> Neighborhood (25)
['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']

Unique value of:>>> OpenPorchSF (252)
[ 61   0  42  35  84  30  57 204   4  21  33 213 112 102 154 159 110  90
  56  32  50 258  54  65  38  47  64  52 138 104  82  43 146  75  72  70
  49  11  36 151  29  94 101 199  99 234 162  63  68  46  45 122 184 120
  20  24 130 205 108  80  66  48  25  96 111 106  40 114   8 136 132  62
 228  60 238 260  27  74  16 198  26  83  34  55  22  98 172 119 208 105
 140 168  28  39 148  12  51 150 117 250  10  81  44 144 175 195 128  76
  17  59 214 121  53 231 134 192 123  78 187  85 133 176 113 137 125 523
 100 285  88 406 155  73 182 502 274 158 142 243 235 312 124 267 265  87
 288  23 152 341 116 160 174 247 291  18 170 156 166 129 418 240  77 364
 188 207  67  69 131 191  41 118 252 189 282 135  95 224 169 319  58  93
 244 185 200  92 180 263 304 229 103 211 287 292 241 547  91  86 262 210
 141  15 126 236 278 197 273 190 183 165 226 178 177 254 215 222 193 201
 173 153 251 230 299 365 139 216  89 372 217 276 164 368 203 127 256 194
 324 171 570 484 742 444 266  97  37 246  31 382   6 115 253 245 107 225]

Unique value of:>>> OverallCond (9)
[5 8 6 7 4 2 3 9 1]

Unique value of:>>> OverallQual (10)
[ 7  6  8  5  9  4 10  3  1  2]

Unique value of:>>> PavedDrive (3)
['Y' 'N' 'P']

Unique value of:>>> PoolArea (14)
[  0 512 648 576 555 480 519 738 144 368 444 228 561 800]

Unique value of:>>> RoofMatl (8)
['CompShg' 'WdShngl' 'Metal' 'WdShake' 'Membran' 'Tar&Grv' 'Roll'
 'ClyTile']

Unique value of:>>> RoofStyle (6)
['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed']

Unique value of:>>> SaleCondition (6)
['Normal' 'Abnorml' 'Partial' 'AdjLand' 'Alloca' 'Family']

Unique value of:>>> SaleType (10)
['WD' 'New' 'COD' 'ConLD' 'ConLI' 'CWD' 'ConLw' 'Con' 'Oth' nan]

Unique value of:>>> ScreenPorch (121)
[  0 176 198 291 252  99 184 168 130 142 192 410 224 266 170 154 153 144
 128 259 160 271 234 374 185 182  90 396 140 276 180 161 145 200 122  95
 120  60 126 189 260 147 385 287 156 100 216 210 197 204 225 152 175 312
 222 265 322 190 233  63  53 143 273 288 263  80 163 116 480 178 440 155
 220 119 165  40 256 240 148 166 108 490 196 121  92 342 255 111 112 231
 110 117 195 115 141 208  94 164  64 576 227 221 171 135 174 217 201 109
 150  84 228 138  88 280 123 264 270 162 348 113 104]

Unique value of:>>> Street (2)
['Pave' 'Grvl']

Unique value of:>>> TotRmsAbvGrd (14)
[ 8  6  7  9  5 11  4 10 12  3  2 14 13 15]

Unique value of:>>> TotalBsmtSF (1059)
[ 856. 1262.  920. ...  498.  432. 1381.]

Unique value of:>>> Utilities (3)
['AllPub' 'NoSeWa' nan]

Unique value of:>>> WoodDeckSF (379)
[   0  298  192   40  255  235   90  147  140  160   48  240  171  100
  406  222  288   49  203  113  392  145  196  168  112  106  857  115
  120   12  576  301  144  300   74  127  232  158  352  182  180  166
  224   80  367   53  188  105   24   98  276  200  409  239  400  476
  178  574  237  210  441  116  280  104   87  132  238  149  355   60
  139  108  351  209  216  248  143  365  370   58  197  263  123  138
  333  250  292   95  262   81  289  124  172  110  208  468  256  302
  190  340  233  184  201  142  122  155  670  135  495  536  306   64
  364  353   66  159  146  296  125   44  215  264   88   89   96  414
  519  206  141  260  324  156  220   38  261  126   85  466  270   78
  169  320  268   72  349   42   35  326  382  161  179  103  253  148
  335  176  390  328  312  185  269  195   57  236  517  304  198  426
   28  316  322  307  257  219  416  344  380   68  114  327  165  187
  181   92  228  245  503  315  241  303  133  403   36   52  265  207
  150  290  486  278   70  418  234   26  342   97  272  121  243  511
  154  164  173  384  202   56  321   86  194  421  305  117  550  509
  153  394  371   63  252  136  186  170  474  214  199  728  436   55
  431  448  361  362  162  229  439  379  356   84  635  325   33  212
  314  242  294   30  128   45  177  227  218  309  404  500  668  402
  283  183  175  586  295   32  366  736  393  360  157  483  275   23
  277  657   51   54  221  226  496  336  450   71  331  375  174   22
  287  129  225  319   99  230  231  297  205  462  502  501  266  244
  189  131   73  329  279  467  119  308  152   16  411  358  385   20
   25  490   76  204  311  102   50  424  339  211  259  134  213  318
  428  282  167  407  130  460  286  193  455  284  285   14  521  646
  386  405  546  118  291  274 1424  690  330  246  444  354  247  870
  432    4  641   94  191   75  631  345  520   27   77  684  453  413
  530]

Unique value of:>>> YearBuilt (118)
[2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006
 1960 1929 1970 1967 1958 1930 2002 1968 2007 1951 1957 1927 1920 1966
 1959 1994 1954 1953 1955 1983 1975 1997 1934 1963 1981 1964 1999 1972
 1921 1945 1982 1998 1956 1948 1910 1995 1991 2009 1950 1961 1977 1985
 1979 1885 1919 1990 1969 1935 1988 1971 1952 1936 1923 1924 1984 1926
 1940 1941 1987 1986 2008 1908 1892 1916 1932 1918 1912 1947 1925 1900
 1980 1989 1992 1949 1880 1928 1978 1922 1996 2010 1946 1913 1937 1942
 1938 1974 1893 1914 1906 1890 1898 1904 1882 1875 1911 1917 1872 1905
 1907 1896 1902 1895 1879 1901]

Unique value of:>>> YearRemodAdd (61)
[2003 1976 2002 1970 2000 1995 2005 1973 1950 1965 2006 1962 2007 1960
 2001 1967 2004 2008 1997 1959 1990 1955 1983 1980 1966 1963 1987 1964
 1972 1996 1998 1989 1953 1956 1968 1981 1992 2009 1982 1961 1993 1999
 1985 1979 1977 1969 1958 1991 1971 1952 1975 2010 1984 1986 1994 1988
 1954 1957 1951 1978 1974]

Unique value of:>>> YrSold (5)
[2008 2007 2006 2009 2010]

In [23]:
# Describe the target 
train["SalePrice"].describe()
Out[23]:
count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64
In [24]:
# Plot the distplot of target
plt.figure(figsize=(10,8))
bar = sns.distplot(train["SalePrice"])
bar.legend(["Skewness: {:.2f}".format(train['SalePrice'].skew())])
Out[24]:
<matplotlib.legend.Legend at 0x1bb7d6acbe0>
 
In [25]:
# correlation heatmap
plt.figure(figsize=(25,25))
ax = sns.heatmap(train.corr(), cmap = "coolwarm", annot=True, linewidth=2)

# to fix the bug "first and last row cut in half of heatmap plot"
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Out[25]:
(38.0, 0.0)
 
In [26]:
# correlation heatmap of higly correlated features with SalePrice
hig_corr = train.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["SalePrice"]) >= 0.5]
hig_corr_features
Out[26]:
Index(['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF',
       'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea',
       'SalePrice'],
      dtype='object')
In [27]:
plt.figure(figsize=(10,8))
ax = sns.heatmap(train[hig_corr_features].corr(), cmap = "coolwarm", annot=True, linewidth=3)
# to fix the bug "first and last row cut in half of heatmap plot"
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Out[27]:
(11.0, 0.0)
 
In [28]:
# Plot regplot to get the nature of highly correlated data
plt.figure(figsize=(16,9))
for i in range(len(hig_corr_features)):
    if i <= 9:
        plt.subplot(3,4,i+1)
        plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
        sns.regplot(data=train, x = hig_corr_features[i], y = 'SalePrice')
 
 

Handling Missing Value

In [29]:
missing_col = df.columns[df.isnull().any()]
missing_col
Out[29]:
Index(['BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF',
       'Electrical', 'Exterior1st', 'Exterior2nd', 'Functional', 'GarageArea',
       'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType',
       'GarageYrBlt', 'KitchenQual', 'LotFrontage', 'MSZoning', 'MasVnrArea',
       'MasVnrType', 'SaleType', 'TotalBsmtSF', 'Utilities'],
      dtype='object')
 

Handling missing value of Bsmt feature

In [30]:
bsmt_col = ['BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'TotalBsmtSF']
bsmt_feat = df[bsmt_col]
bsmt_feat
Out[30]:
  BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF TotalBsmtSF
Id                      
1 TA No 706.0 0.0 GLQ Unf 1.0 0.0 Gd 150.0 856.0
2 TA Gd 978.0 0.0 ALQ Unf 0.0 1.0 Gd 284.0 1262.0
3 TA Mn 486.0 0.0 GLQ Unf 1.0 0.0 Gd 434.0 920.0
4 Gd No 216.0 0.0 ALQ Unf 1.0 0.0 TA 540.0 756.0
5 TA Av 655.0 0.0 GLQ Unf 1.0 0.0 Gd 490.0 1145.0
2915 TA No 0.0 0.0 Unf Unf 0.0 0.0 TA 546.0 546.0
2916 TA No 252.0 0.0 Rec Unf 0.0 0.0 TA 294.0 546.0
2917 TA No 1224.0 0.0 ALQ Unf 1.0 0.0 TA 0.0 1224.0
2918 TA Av 337.0 0.0 GLQ Unf 0.0 1.0 Gd 575.0 912.0
2919 TA Av 758.0 0.0 LwQ Unf 0.0 0.0 Gd 238.0 996.0

2919 rows × 11 columns

In [31]:
bsmt_feat.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 1 to 2919
Data columns (total 11 columns):
BsmtCond        2837 non-null object
BsmtExposure    2837 non-null object
BsmtFinSF1      2918 non-null float64
BsmtFinSF2      2918 non-null float64
BsmtFinType1    2840 non-null object
BsmtFinType2    2839 non-null object
BsmtFullBath    2917 non-null float64
BsmtHalfBath    2917 non-null float64
BsmtQual        2838 non-null object
BsmtUnfSF       2918 non-null float64
TotalBsmtSF     2918 non-null float64
dtypes: float64(6), object(5)
memory usage: 273.7+ KB
In [32]:
bsmt_feat.isnull().sum()
Out[32]:
BsmtCond        82
BsmtExposure    82
BsmtFinSF1       1
BsmtFinSF2       1
BsmtFinType1    79
BsmtFinType2    80
BsmtFullBath     2
BsmtHalfBath     2
BsmtQual        81
BsmtUnfSF        1
TotalBsmtSF      1
dtype: int64
In [33]:
bsmt_feat = bsmt_feat[bsmt_feat.isnull().any(axis=1)]
bsmt_feat
Out[33]:
  BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF TotalBsmtSF
Id                      
18 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
40 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
91 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
103 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
157 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2804 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2805 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2825 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2892 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2905 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0

88 rows × 11 columns

In [34]:
bsmt_feat_all_nan = bsmt_feat[(bsmt_feat.isnull() | bsmt_feat.isin([0])).all(1)]
bsmt_feat_all_nan
Out[34]:
  BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF TotalBsmtSF
Id                      
18 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
40 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
91 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
103 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
157 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
183 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
260 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
343 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
363 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
372 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
393 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
521 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
533 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
534 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
554 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
647 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
706 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
737 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
750 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
779 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
869 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
895 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
898 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
985 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1001 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1012 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1036 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1046 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1049 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1050 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1091 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1180 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1217 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1219 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1233 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1322 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1413 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1586 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1594 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1730 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1779 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1815 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1848 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1849 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1857 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1858 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1859 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1861 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
1916 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2051 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2067 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2069 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2121 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2123 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2189 NaN NaN 0.0 0.0 NaN NaN NaN NaN NaN 0.0 0.0
2190 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2191 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2194 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2217 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2225 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2388 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2436 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2453 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2454 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2491 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2499 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2548 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2553 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2565 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2579 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2600 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2703 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2764 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2767 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2804 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2805 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2825 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2892 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
2905 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN 0.0 0.0
In [35]:
bsmt_feat_all_nan.shape
Out[35]:
(79, 11)
In [36]:
qual = list(df.loc[:, df.dtypes == 'object'].columns.values)
qual
Out[36]:
['BldgType',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'BsmtQual',
 'CentralAir',
 'Condition1',
 'Condition2',
 'Electrical',
 'ExterCond',
 'ExterQual',
 'Exterior1st',
 'Exterior2nd',
 'Foundation',
 'Functional',
 'GarageCond',
 'GarageFinish',
 'GarageQual',
 'GarageType',
 'Heating',
 'HeatingQC',
 'HouseStyle',
 'KitchenQual',
 'LandContour',
 'LandSlope',
 'LotConfig',
 'LotShape',
 'MSZoning',
 'MasVnrType',
 'Neighborhood',
 'PavedDrive',
 'RoofMatl',
 'RoofStyle',
 'SaleCondition',
 'SaleType',
 'Street',
 'Utilities']
In [37]:
# Fillinf the mising value in bsmt features
for i in bsmt_col:
    if i in qual:
        bsmt_feat_all_nan[i] = bsmt_feat_all_nan[i].replace(np.nan, 'NA') # replace the NAN value by 'NA'
    else:
        bsmt_feat_all_nan[i] = bsmt_feat_all_nan[i].replace(np.nan, 0) # replace the NAN value inplace of 0

bsmt_feat.update(bsmt_feat_all_nan) # update bsmt_feat df by bsmt_feat_all_nan
df.update(bsmt_feat_all_nan) # update df by bsmt_feat_all_nan

"""
>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, 5, 6],
...                        'C': [7, 8, 9]})
>>> df.update(new_df)
>>> df
   A  B
0  1  4
1  2  5
2  3  6
"""
 
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:5819: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = expressions.where(mask, this, that)
Out[37]:
"\n>>> df = pd.DataFrame({'A': [1, 2, 3],\n...                    'B': [400, 500, 600]})\n>>> new_df = pd.DataFrame({'B': [4, 5, 6],\n...                        'C': [7, 8, 9]})\n>>> df.update(new_df)\n>>> df\n   A  B\n0  1  4\n1  2  5\n2  3  6\n"
In [38]:
bsmt_feat = bsmt_feat[bsmt_feat.isin([np.nan]).any(axis=1)]
bsmt_feat
Out[38]:
  BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF TotalBsmtSF
Id                      
333 TA No 1124.0 479.0 GLQ NaN 1.0 0.0 Gd 1603.0 3206.0
949 TA NaN 0.0 0.0 Unf Unf 0.0 0.0 Gd 936.0 936.0
1488 TA NaN 0.0 0.0 Unf Unf 0.0 0.0 Gd 1595.0 1595.0
2041 NaN Mn 1044.0 382.0 GLQ Rec 1.0 0.0 Gd 0.0 1426.0
2186 NaN No 1033.0 0.0 BLQ Unf 0.0 1.0 TA 94.0 1127.0
2218 Fa No 0.0 0.0 Unf Unf 0.0 0.0 NaN 173.0 173.0
2219 TA No 0.0 0.0 Unf Unf 0.0 0.0 NaN 356.0 356.0
2349 TA NaN 0.0 0.0 Unf Unf 0.0 0.0 Gd 725.0 725.0
2525 NaN Av 755.0 0.0 ALQ Unf 0.0 0.0 TA 240.0 995.0
In [39]:
bsmt_feat.shape
Out[39]:
(9, 11)
In [40]:
print(df['BsmtFinSF2'].max())
print(df['BsmtFinSF2'].min())
 
1526.0
0.0
In [41]:
pd.cut(range(0,1526),5) # create a bucket
Out[41]:
[(-1.525, 305.0], (-1.525, 305.0], (-1.525, 305.0], (-1.525, 305.0], (-1.525, 305.0], ..., (1220.0, 1525.0], (1220.0, 1525.0], (1220.0, 1525.0], (1220.0, 1525.0], (1220.0, 1525.0]]
Length: 1526
Categories (5, interval[float64]): [(-1.525, 305.0] < (305.0, 610.0] < (610.0, 915.0] < (915.0, 1220.0] < (1220.0, 1525.0]]
In [42]:
df_slice = df[(df['BsmtFinSF2'] >= 305) & (df['BsmtFinSF2'] <= 610)]
df_slice
Out[42]:
  1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF CentralAir Condition1 Condition2 Electrical EnclosedPorch ExterCond ExterQual Exterior1st Exterior2nd Fireplaces Foundation FullBath Functional GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType GarageYrBlt GrLivArea HalfBath Heating HeatingQC HouseStyle KitchenAbvGr KitchenQual LandContour LandSlope LotArea LotConfig LotFrontage LotShape LowQualFinSF MSSubClass MSZoning MasVnrArea MasVnrType MiscVal MoSold Neighborhood OpenPorchSF OverallCond OverallQual PavedDrive PoolArea RoofMatl RoofStyle SaleCondition SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
Id                                                                                                                                                    
27 900 0 0 3 1Fam TA Mn 234.0 486.0 BLQ Rec 0.0 1.0 TA 180.0 Y Norm Norm SBrkr 0 TA TA Wd Sdng Wd Sdng 0 CBlock 1 Typ 576.0 2.0 TA Unf TA Detchd 2005.0 900 0 GasA TA 1Story 1 Gd Lvl Gtl 7200 Corner 60.0 Reg 0 20 RL 0.0 None 0 5 NAmes 32 7 5 Y 0 CompShg Gable Normal WD 0 Pave 5 900.0 AllPub 222 1951 2000 2010
44 938 0 0 3 1Fam TA Av 280.0 491.0 LwQ BLQ 1.0 0.0 Gd 167.0 Y Norm Norm SBrkr 0 TA TA VinylSd VinylSd 0 CBlock 1 Typ 308.0 1.0 TA Unf TA Detchd 1977.0 938 0 GasA TA 1Story 1 TA Lvl Gtl 9200 CulDSac NaN IR1 0 20 RL 0.0 None 0 7 CollgCr 0 6 5 Y 0 CompShg Hip Normal WD 0 Pave 5 938.0 AllPub 145 1975 1980 2008
45 1150 0 0 3 1Fam TA No 179.0 506.0 ALQ BLQ 1.0 0.0 TA 465.0 Y Norm Norm FuseA 0 TA TA BrkFace Wd Sdng 0 CBlock 1 Typ 300.0 1.0 TA RFn TA Attchd 1959.0 1150 0 GasA Ex 1Story 1 TA Lvl Gtl 7945 Inside 70.0 Reg 0 20 RL 0.0 None 0 5 NAmes 0 6 5 Y 0 CompShg Gable Normal WD 0 Pave 6 1150.0 AllPub 0 1959 1959 2006
74 1086 0 0 3 1Fam TA No 320.0 362.0 ALQ BLQ 1.0 0.0 TA 404.0 Y Norm Norm SBrkr 0 TA TA Wd Sdng Wd Sdng 0 CBlock 1 Typ 490.0 2.0 TA Unf TA Attchd 1989.0 1086 0 GasA Gd 1Story 1 TA Lvl Gtl 10200 Inside 85.0 Reg 0 20 RL 104.0 BrkFace 0 5 NAmes 0 7 5 Y 0 CompShg Gable Normal WD 0 Pave 6 1086.0 AllPub 0 1954 2003 2010
174 1362 0 0 3 1Fam TA No 288.0 374.0 ALQ Rec 1.0 0.0 TA 700.0 Y Norm Norm SBrkr 0 TA TA WdShing Wd Shng 1 CBlock 1 Typ 504.0 2.0 TA Unf TA Attchd 1961.0 1362 1 GasA TA 1Story 1 TA Lvl Gtl 10197 Inside 80.0 IR1 0 20 RL 491.0 BrkCmn 0 6 NAmes 20 5 6 Y 0 CompShg Gable Normal COD 0 Pave 6 1362.0 AllPub 0 1961 1961 2008
2726 1141 0 0 3 1Fam TA Av 602.0 402.0 ALQ Rec 1.0 0.0 TA 137.0 Y Norm Norm SBrkr 0 TA TA MetalSd MetalSd 0 PConc 1 Typ 568.0 1.0 TA Unf TA Attchd 1967.0 1141 0 GasA Gd SLvl 1 TA Lvl Gtl 9600 Inside 80.0 Reg 0 80 RL 140.0 BrkFace 0 7 NAmes 78 7 5 Y 0 CompShg Gable Normal WD 0 Pave 6 1141.0 AllPub 0 1967 1967 2006
2807 1073 0 0 2 1Fam Gd Mn 510.0 373.0 GLQ LwQ 1.0 0.0 Gd 190.0 Y Norm Norm SBrkr 0 TA TA VinylSd VinylSd 0 PConc 2 Typ 246.0 1.0 TA Unf TA Detchd 2004.0 1073 0 GasA Ex 1Story 1 TA Lvl Gtl 5500 Inside 50.0 Reg 0 20 RL 0.0 None 0 5 SWISU 120 5 7 Y 0 CompShg Shed Normal WD 0 Pave 4 1073.0 AllPub 0 2004 2004 2006
2844 1282 0 0 3 1Fam Gd Av 595.0 400.0 ALQ LwQ 0.0 1.0 TA 0.0 Y Norm Norm SBrkr 0 TA TA HdBoard HdBoard 0 CBlock 2 Typ 672.0 3.0 TA Unf Fa Detchd 1989.0 1282 0 GasA TA SLvl 1 TA Lvl Gtl 10385 CulDSac 42.0 IR1 0 80 RL 123.0 BrkFace 0 4 CollgCr 0 6 6 Y 0 CompShg Gable Normal WD 0 Pave 6 995.0 AllPub 386 1978 1978 2006
2859 760 676 0 3 1Fam TA No 173.0 337.0 Rec BLQ 1.0 0.0 Gd 166.0 Y Feedr Norm SBrkr 0 Gd TA Plywood Plywood 0 CBlock 2 Min1 528.0 2.0 TA Unf TA Attchd 1950.0 1436 0 GasA Gd 2Story 1 TA Bnk Gtl 8777 Inside 67.0 Reg 0 70 RL 0.0 None 420 10 Edwards 0 6 4 Y 0 CompShg Gable Normal WD 0 Pave 6 676.0 AllPub 147 1910 2000 2006
2912 1360 0 0 3 1Fam TA Av 119.0 344.0 Rec BLQ 1.0 0.0 TA 641.0 Y Norm Norm SBrkr 0 TA TA Plywood Plywood 1 PConc 1 Typ 336.0 1.0 TA RFn TA Attchd 1969.0 1360 0 GasA Fa 1Story 1 TA Lvl Mod 13384 Inside 80.0 Reg 0 20 RL 194.0 BrkFace 0 5 Mitchel 0 5 5 Y 0 CompShg Gable Normal WD 0 Pave 8 1104.0 AllPub 160 1969 1979 2006

107 rows × 74 columns

In [43]:
bsmt_feat.at[333,'BsmtFinType2'] = df_slice['BsmtFinType2'].mode()[0] # replace NAN value of BsmtFinType2 by mode of buet ((305.0, 610.0)
In [44]:
bsmt_feat
Out[44]:
  BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 BsmtFinType2 BsmtFullBath BsmtHalfBath BsmtQual BsmtUnfSF TotalBsmtSF
Id                      
333 TA No 1124.0 479.0 GLQ Rec 1.0 0.0 Gd 1603.0 3206.0
949 TA NaN 0.0 0.0 Unf Unf 0.0 0.0 Gd 936.0 936.0
1488 TA NaN 0.0 0.0 Unf Unf 0.0 0.0 Gd 1595.0 1595.0
2041 NaN Mn 1044.0 382.0 GLQ Rec 1.0 0.0 Gd 0.0 1426.0
2186 NaN No 1033.0 0.0 BLQ Unf 0.0 1.0 TA 94.0 1127.0
2218 Fa No 0.0 0.0 Unf Unf 0.0 0.0 NaN 173.0 173.0
2219 TA No 0.0 0.0 Unf Unf 0.0 0.0 NaN 356.0 356.0
2349 TA NaN 0.0 0.0 Unf Unf 0.0 0.0 Gd 725.0 725.0
2525 NaN Av 755.0 0.0 ALQ Unf 0.0 0.0 TA 240.0 995.0
In [ ]:
 
In [45]:
bsmt_feat['BsmtExposure'] = bsmt_feat['BsmtExposure'].replace(np.nan, df[df['BsmtQual'] =='Gd']['BsmtExposure'].mode()[0])
 
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [46]:
bsmt_feat['BsmtCond'] = bsmt_feat['BsmtCond'].replace(np.nan, df['BsmtCond'].mode()[0])
bsmt_feat['BsmtQual'] = bsmt_feat['BsmtQual'].replace(np.nan, df['BsmtQual'].mode()[0])
 
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [47]:
df.update(bsmt_feat)
In [48]:
bsmt_feat.isnull().sum()
Out[48]:
BsmtCond        0
BsmtExposure    0
BsmtFinSF1      0
BsmtFinSF2      0
BsmtFinType1    0
BsmtFinType2    0
BsmtFullBath    0
BsmtHalfBath    0
BsmtQual        0
BsmtUnfSF       0
TotalBsmtSF     0
dtype: int64
 

Handling missing value of Garage feature

In [49]:
df.columns[df.isnull().any()]
Out[49]:
Index(['Electrical', 'Exterior1st', 'Exterior2nd', 'Functional', 'GarageArea',
       'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType',
       'GarageYrBlt', 'KitchenQual', 'LotFrontage', 'MSZoning', 'MasVnrArea',
       'MasVnrType', 'SaleType', 'Utilities'],
      dtype='object')
In [50]:
garage_col = ['GarageArea', 'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt',]
garage_feat = df[garage_col]
garage_feat = garage_feat[garage_feat.isnull().any(axis=1)]
garage_feat
Out[50]:
  GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType GarageYrBlt
Id              
40 0.0 0.0 NaN NaN NaN NaN NaN
49 0.0 0.0 NaN NaN NaN NaN NaN
79 0.0 0.0 NaN NaN NaN NaN NaN
89 0.0 0.0 NaN NaN NaN NaN NaN
90 0.0 0.0 NaN NaN NaN NaN NaN
2894 0.0 0.0 NaN NaN NaN NaN NaN
2910 0.0 0.0 NaN NaN NaN NaN NaN
2914 0.0 0.0 NaN NaN NaN NaN NaN
2915 0.0 0.0 NaN NaN NaN NaN NaN
2918 0.0 0.0 NaN NaN NaN NaN NaN

159 rows × 7 columns

In [51]:
garage_feat.shape
Out[51]:
(159, 7)
In [52]:
garage_feat_all_nan = garage_feat[(garage_feat.isnull() | garage_feat.isin([0])).all(1)]
garage_feat_all_nan.shape
Out[52]:
(157, 7)
In [53]:
for i in garage_feat:
    if i in qual:
        garage_feat_all_nan[i] = garage_feat_all_nan[i].replace(np.nan, 'NA')
    else:
        garage_feat_all_nan[i] = garage_feat_all_nan[i].replace(np.nan, 0)
        
garage_feat.update(garage_feat_all_nan)
df.update(garage_feat_all_nan)
 
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [54]:
garage_feat = garage_feat[garage_feat.isnull().any(axis=1)]
garage_feat
Out[54]:
  GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType GarageYrBlt
Id              
2127 360.0 1.0 NaN NaN NaN Detchd NaN
2577 NaN NaN NaN NaN NaN Detchd NaN
In [55]:
for i in garage_col:
    garage_feat[i] = garage_feat[i].replace(np.nan, df[df['GarageType'] == 'Detchd'][i].mode()[0])
 
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [56]:
garage_feat.isnull().any()
Out[56]:
GarageArea      False
GarageCars      False
GarageCond      False
GarageFinish    False
GarageQual      False
GarageType      False
GarageYrBlt     False
dtype: bool
In [57]:
df.update(garage_feat)
 

Handling missing value of remain feature

In [58]:
df.columns[df.isnull().any()]
Out[58]:
Index(['Electrical', 'Exterior1st', 'Exterior2nd', 'Functional', 'KitchenQual',
       'LotFrontage', 'MSZoning', 'MasVnrArea', 'MasVnrType', 'SaleType',
       'Utilities'],
      dtype='object')
In [59]:
df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])
df['Exterior1st'] = df['Exterior1st'].fillna(df['Exterior1st'].mode()[0])
df['Exterior2nd'] = df['Exterior2nd'].fillna(df['Exterior2nd'].mode()[0])
df['Functional'] = df['Functional'].fillna(df['Functional'].mode()[0])
df['KitchenQual'] = df['KitchenQual'].fillna(df['KitchenQual'].mode()[0])
df['MSZoning'] = df['MSZoning'].fillna(df['MSZoning'].mode()[0])
df['SaleType'] = df['SaleType'].fillna(df['SaleType'].mode()[0])
df['Utilities'] = df['Utilities'].fillna(df['Utilities'].mode()[0])
df['MasVnrType'] = df['MasVnrType'].fillna(df['MasVnrType'].mode()[0])
In [60]:
df.columns[df.isnull().any()]
Out[60]:
Index(['LotFrontage', 'MasVnrArea'], dtype='object')
In [61]:
df[df['MasVnrArea'].isnull() == True]['MasVnrType'].unique()
Out[61]:
array(['None'], dtype=object)
In [62]:
df.loc[(df['MasVnrType'] == 'None') & (df['MasVnrArea'].isnull() == True), 'MasVnrArea'] = 0
In [63]:
df.isnull().sum()/df.shape[0] * 100
Out[63]:
1stFlrSF          0.000000
2ndFlrSF          0.000000
3SsnPorch         0.000000
BedroomAbvGr      0.000000
BldgType          0.000000
BsmtCond          0.000000
BsmtExposure      0.000000
BsmtFinSF1        0.000000
BsmtFinSF2        0.000000
BsmtFinType1      0.000000
BsmtFinType2      0.000000
BsmtFullBath      0.000000
BsmtHalfBath      0.000000
BsmtQual          0.000000
BsmtUnfSF         0.000000
CentralAir        0.000000
Condition1        0.000000
Condition2        0.000000
Electrical        0.000000
EnclosedPorch     0.000000
ExterCond         0.000000
ExterQual         0.000000
Exterior1st       0.000000
Exterior2nd       0.000000
Fireplaces        0.000000
Foundation        0.000000
FullBath          0.000000
Functional        0.000000
GarageArea        0.000000
GarageCars        0.000000
GarageCond        0.000000
GarageFinish      0.000000
GarageQual        0.000000
GarageType        0.000000
GarageYrBlt       0.000000
GrLivArea         0.000000
HalfBath          0.000000
Heating           0.000000
HeatingQC         0.000000
HouseStyle        0.000000
KitchenAbvGr      0.000000
KitchenQual       0.000000
LandContour       0.000000
LandSlope         0.000000
LotArea           0.000000
LotConfig         0.000000
LotFrontage      16.649538
LotShape          0.000000
LowQualFinSF      0.000000
MSSubClass        0.000000
MSZoning          0.000000
MasVnrArea        0.000000
MasVnrType        0.000000
MiscVal           0.000000
MoSold            0.000000
Neighborhood      0.000000
OpenPorchSF       0.000000
OverallCond       0.000000
OverallQual       0.000000
PavedDrive        0.000000
PoolArea          0.000000
RoofMatl          0.000000
RoofStyle         0.000000
SaleCondition     0.000000
SaleType          0.000000
ScreenPorch       0.000000
Street            0.000000
TotRmsAbvGrd      0.000000
TotalBsmtSF       0.000000
Utilities         0.000000
WoodDeckSF        0.000000
YearBuilt         0.000000
YearRemodAdd      0.000000
YrSold            0.000000
dtype: float64
 

Handling missing value of LotFrontage feature

In [64]:
lotconfig = ['Corner', 'Inside', 'CulDSac', 'FR2', 'FR3']
for i in lotconfig:
    df['LotFrontage'] = pd.np.where((df['LotFrontage'].isnull() == True) & (df['LotConfig'] == i) , df[df['LotConfig'] == i] ['LotFrontage'].mean(), df['LotFrontage'])
In [65]:
df.isnull().sum()
Out[65]:
1stFlrSF         0
2ndFlrSF         0
3SsnPorch        0
BedroomAbvGr     0
BldgType         0
BsmtCond         0
BsmtExposure     0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtFinType1     0
BsmtFinType2     0
BsmtFullBath     0
BsmtHalfBath     0
BsmtQual         0
BsmtUnfSF        0
CentralAir       0
Condition1       0
Condition2       0
Electrical       0
EnclosedPorch    0
ExterCond        0
ExterQual        0
Exterior1st      0
Exterior2nd      0
Fireplaces       0
Foundation       0
FullBath         0
Functional       0
GarageArea       0
GarageCars       0
GarageCond       0
GarageFinish     0
GarageQual       0
GarageType       0
GarageYrBlt      0
GrLivArea        0
HalfBath         0
Heating          0
HeatingQC        0
HouseStyle       0
KitchenAbvGr     0
KitchenQual      0
LandContour      0
LandSlope        0
LotArea          0
LotConfig        0
LotFrontage      0
LotShape         0
LowQualFinSF     0
MSSubClass       0
MSZoning         0
MasVnrArea       0
MasVnrType       0
MiscVal          0
MoSold           0
Neighborhood     0
OpenPorchSF      0
OverallCond      0
OverallQual      0
PavedDrive       0
PoolArea         0
RoofMatl         0
RoofStyle        0
SaleCondition    0
SaleType         0
ScreenPorch      0
Street           0
TotRmsAbvGrd     0
TotalBsmtSF      0
Utilities        0
WoodDeckSF       0
YearBuilt        0
YearRemodAdd     0
YrSold           0
dtype: int64
 

Feature Transformation

In [66]:
df.columns
Out[66]:
Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BedroomAbvGr', 'BldgType',
       'BsmtCond', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType1',
       'BsmtFinType2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF',
       'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'EnclosedPorch',
       'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fireplaces',
       'Foundation', 'FullBath', 'Functional', 'GarageArea', 'GarageCars',
       'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt',
       'GrLivArea', 'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle',
       'KitchenAbvGr', 'KitchenQual', 'LandContour', 'LandSlope', 'LotArea',
       'LotConfig', 'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass',
       'MSZoning', 'MasVnrArea', 'MasVnrType', 'MiscVal', 'MoSold',
       'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', 'PoolArea', 'RoofMatl', 'RoofStyle', 'SaleCondition',
       'SaleType', 'ScreenPorch', 'Street', 'TotRmsAbvGrd', 'TotalBsmtSF',
       'Utilities', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'YrSold'],
      dtype='object')
In [67]:
# converting columns in str which have categorical nature but in int64
feat_dtype_convert = ['MSSubClass', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
for i in feat_dtype_convert:
    df[i] = df[i].astype(str)
In [68]:
df['MoSold'].unique() # MoSold = Month of sold
Out[68]:
array([ 2,  5,  9, 12, 10,  8, 11,  4,  1,  7,  3,  6], dtype=int64)
In [69]:
# conver in month abbrevation
import calendar
df['MoSold'] = df['MoSold'].apply(lambda x : calendar.month_abbr[x])
In [70]:
df['MoSold'].unique()
Out[70]:
array(['Feb', 'May', 'Sep', 'Dec', 'Oct', 'Aug', 'Nov', 'Apr', 'Jan',
       'Jul', 'Mar', 'Jun'], dtype=object)
In [71]:
quan = list(df.loc[:, df.dtypes != 'object'].columns.values)
In [72]:
quan
Out[72]:
['1stFlrSF',
 '2ndFlrSF',
 '3SsnPorch',
 'BedroomAbvGr',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtUnfSF',
 'EnclosedPorch',
 'Fireplaces',
 'FullBath',
 'GarageArea',
 'GarageCars',
 'GrLivArea',
 'HalfBath',
 'KitchenAbvGr',
 'LotArea',
 'LotFrontage',
 'LowQualFinSF',
 'MasVnrArea',
 'MiscVal',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'PoolArea',
 'ScreenPorch',
 'TotRmsAbvGrd',
 'TotalBsmtSF',
 'WoodDeckSF']
In [73]:
len(quan)
Out[73]:
30
In [74]:
obj_feat = list(df.loc[:, df.dtypes == 'object'].columns.values)
obj_feat
Out[74]:
['BldgType',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'BsmtQual',
 'CentralAir',
 'Condition1',
 'Condition2',
 'Electrical',
 'ExterCond',
 'ExterQual',
 'Exterior1st',
 'Exterior2nd',
 'Foundation',
 'Functional',
 'GarageCond',
 'GarageFinish',
 'GarageQual',
 'GarageType',
 'GarageYrBlt',
 'Heating',
 'HeatingQC',
 'HouseStyle',
 'KitchenQual',
 'LandContour',
 'LandSlope',
 'LotConfig',
 'LotShape',
 'MSSubClass',
 'MSZoning',
 'MasVnrType',
 'MoSold',
 'Neighborhood',
 'PavedDrive',
 'RoofMatl',
 'RoofStyle',
 'SaleCondition',
 'SaleType',
 'Street',
 'Utilities',
 'YearBuilt',
 'YearRemodAdd',
 'YrSold']
 

Conver categorical code into order

In [75]:
from pandas.api.types import CategoricalDtype
df['BsmtCond'] = df['BsmtCond'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
In [76]:
df['BsmtCond'].unique()
Out[76]:
array([3, 4, 0, 2, 1], dtype=int64)
In [77]:
df['BsmtExposure'] = df['BsmtExposure'].astype(CategoricalDtype(categories=['NA', 'Mn', 'Av', 'Gd'], ordered = True)).cat.codes
In [78]:
df['BsmtExposure'].unique()
Out[78]:
array([-1,  3,  1,  2,  0], dtype=int64)
In [79]:
df['BsmtFinType1'] = df['BsmtFinType1'].astype(CategoricalDtype(categories=['NA', 'Unf', 'LwQ', 'Rec', 'BLQ','ALQ', 'GLQ'], ordered = True)).cat.codes
df['BsmtFinType2'] = df['BsmtFinType2'].astype(CategoricalDtype(categories=['NA', 'Unf', 'LwQ', 'Rec', 'BLQ','ALQ', 'GLQ'], ordered = True)).cat.codes
df['BsmtQual'] = df['BsmtQual'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['ExterQual'] = df['ExterQual'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['ExterCond'] = df['ExterCond'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['Functional'] = df['Functional'].astype(CategoricalDtype(categories=['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod','Min2','Min1', 'Typ'], ordered = True)).cat.codes
df['GarageCond'] = df['GarageCond'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['GarageQual'] = df['GarageQual'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['GarageFinish'] = df['GarageFinish'].astype(CategoricalDtype(categories=['NA', 'Unf', 'RFn', 'Fin'], ordered = True)).cat.codes
df['HeatingQC'] = df['HeatingQC'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['KitchenQual'] = df['KitchenQual'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df['PavedDrive'] = df['PavedDrive'].astype(CategoricalDtype(categories=['N', 'P', 'Y'], ordered = True)).cat.codes
df['Utilities'] = df['Utilities'].astype(CategoricalDtype(categories=['ELO', 'NASeWa', 'NASeWr', 'AllPub'], ordered = True)).cat.codes
In [80]:
df['Utilities'].unique()
Out[80]:
array([ 3, -1], dtype=int64)
 

Show skewness of feature with distplot

In [81]:
skewed_features = ['1stFlrSF',
 '2ndFlrSF',
 '3SsnPorch',
 'BedroomAbvGr',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtUnfSF',
 'EnclosedPorch',
 'Fireplaces',
 'FullBath',
 'GarageArea',
 'GarageCars',
 'GrLivArea',
 'HalfBath',
 'KitchenAbvGr',
 'LotArea',
 'LotFrontage',
 'LowQualFinSF',
 'MasVnrArea',
 'MiscVal',
 'OpenPorchSF',
 'PoolArea',
 'ScreenPorch',
 'TotRmsAbvGrd',
 'TotalBsmtSF',
 'WoodDeckSF']
In [82]:
quan == skewed_features
Out[82]:
False
In [83]:
plt.figure(figsize=(25,20))
for i in range(len(skewed_features)):
    if i <= 28:
        plt.subplot(7,4,i+1)
        plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
        ax = sns.distplot(df[skewed_features[i]])
        ax.legend(["Skewness: {:.2f}".format(df[skewed_features[i]].skew())], fontsize = 'xx-large')
 
In [84]:
df_back = df
In [85]:
# decrease the skewnwnes of the data
for i in skewed_features:
    df[i] = np.log(df[i] + 1)
In [86]:
plt.figure(figsize=(25,20))
for i in range(len(skewed_features)):
    if i <= 28:
        plt.subplot(7,4,i+1)
        plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
        ax = sns.distplot(df[skewed_features[i]])
        ax.legend(["Skewness: {:.2f}".format(df[skewed_features[i]].skew())], fontsize = 'xx-large')
 
In [87]:
SalePrice = np.log(train['SalePrice'] + 1)
In [88]:
# get object feature to conver in numeric using dummy variable
obj_feat = list(df.loc[:,df.dtypes == 'object'].columns.values)
len(obj_feat)
Out[88]:
29
In [89]:
# dummy varaibale
dummy_drop = []
clean_df = df
for i in obj_feat:
    dummy_drop += [i + '_' + str(df[i].unique()[-1])]

df = pd.get_dummies(df, columns = obj_feat)
df = df.drop(dummy_drop, axis = 1)
In [90]:
df.shape
Out[90]:
(2919, 500)
In [91]:
#sns.pairplot(df)
In [92]:
# scaling dataset with robust scaler
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(df)
df = scaler.transform(df)
 

Machine Learning Model Building

In [93]:
train_len = len(train)
In [94]:
X_train = df[:train_len]
X_test = df[train_len:]
y_train = SalePrice

print(X_train.shape)
print(X_test.shape)
print(len(y_train))
 
(1460, 500)
(1459, 500)
1460
 

Cross Validation

In [95]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import make_scorer, r2_score

def test_model(model, X_train=X_train, y_train=y_train):
    cv = KFold(n_splits = 3, shuffle=True, random_state = 45)
    r2 = make_scorer(r2_score)
    r2_val_score = cross_val_score(model, X_train, y_train, cv=cv, scoring = r2)
    score = [r2_val_score.mean()]
    return score
 

Linear Regression

In [96]:
import sklearn.linear_model as linear_model
LR = linear_model.LinearRegression()
test_model(LR)
Out[96]:
[-4.499253758245961e+19]
In [97]:
# Cross validation
cross_validation = cross_val_score(estimator = LR, X = X_train, y = y_train, cv = 10)
print("Cross validation accuracy of LR model = ", cross_validation)
print("\nCross validation mean accuracy of LR model = ", cross_validation.mean())
 
Cross validation accuracy of XGBoost model =  [-3.59049263e+18 -2.69794256e+16 -5.01430840e+20 -1.24195688e+20
 -1.56157918e+20 -3.80303041e+20 -6.92624737e+20 -1.81535501e+20
 -1.18431954e+19 -8.96500637e+19]

Cross validation mean accuracy of XGBoost model =  -2.1413584565212335e+20
In [98]:
rdg = linear_model.Ridge()
test_model(rdg)
Out[98]:
[0.8646898178967032]
In [99]:
lasso = linear_model.Lasso(alpha=1e-4)
test_model(lasso)
Out[99]:
[0.8677128206058571]
 

Fitting Polynomial Regression to the dataset

from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree = 2) X_poly = poly_reg.fit_transform(X_train) poly_reg.fit(X_poly, y_train) lin_reg_2 = LinearRegression()

lin_reg_2.fit(X_poly, y_train)

test_model(lin_reg_2,X_poly)

 

import sklearn.linear_model as linear_model lin_reg_2 = linear_model.LinearRegression()

lin_reg_2.fit(X_poly, y_train)

test_model(lin_reg_2,X_poly)

 

Support Vector Machine

In [100]:
from sklearn.svm import SVR
svr_reg = SVR(kernel='rbf')
test_model(svr_reg)
 
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[100]:
[0.8897490696206058]
 

Decision Tree Regressor

In [101]:
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor(random_state=21)
test_model(dt_reg)
Out[101]:
[0.6977699373506714]
 

Random Forest Regressor

In [102]:
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators = 1000, random_state=51)
test_model(rf_reg)
Out[102]:
[0.8562626036810235]
 

Bagging & boosting

In [103]:
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor
br_reg = BaggingRegressor(n_estimators=1000, random_state=51)
gbr_reg = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.1, loss='ls', random_state=51)
In [104]:
test_model(br_reg)
Out[104]:
[0.8566634227077645]
In [105]:
test_model(gbr_reg)
Out[105]:
[0.8814693894754249]
 

XGBoost

In [106]:
import xgboost
#xgb_reg=xgboost.XGBRegressor()
xgb_reg = xgboost.XGBRegressor(bbooster='gbtree', random_state=51)
test_model(xgb_reg)
 
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
 
[18:12:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
 
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
 
[18:12:45] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
 
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
 
[18:12:46] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[106]:
[0.8841700820661896]
 

SVM Model Bulding

In [107]:
svr_reg.fit(X_train,y_train)
y_pred = np.exp(svr_reg.predict(X_test)).round(2)
 
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [108]:
y_pred
Out[108]:
array([116235.81, 159145.5 , 184603.73, ..., 175995.91, 115960.91,
       228554.78])
In [109]:
submit_test1 = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test1.columns=['Id', 'SalePrice']
In [110]:
submit_test1
Out[110]:
  Id SalePrice
0 1461 116235.81
1 1462 159145.50
2 1463 184603.73
3 1464 193132.57
4 1465 187963.37
1454 2915 89542.75
1455 2916 86121.81
1456 2917 175995.91
1457 2918 115960.91
1458 2919 228554.78

1459 rows × 2 columns

In [111]:
submit_test1.to_csv('sample_submission.csv', index=False )
 

SVM Model Bulding Hyperparameter Tuning

 

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV params = {‘kernel’: [‘linear’, ‘rbf’, ‘sigmoid’], ‘gamma’: [1, 0.1, 0.01, 0.001, 0.0001], ‘C’: [0.1, 1, 10, 100, 1000], ‘epsilon’: [1, 0.2, 0.1, 0.01, 0.001, 0.0001]}

 

rand_search = RandomizedSearchCV(svr_reg, param_distributions=params, n_jobs=-1, cv=11) rand_search.fit(X_train, y_train) rand_search.bestparams

In [112]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
params = {'kernel': ['rbf'],
         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
         'C': [0.1, 1, 10, 100, 1000],
         'epsilon': [1, 0.2, 0.1, 0.01, 0.001, 0.0001]}
rand_search = RandomizedSearchCV(svr_reg, param_distributions=params, n_jobs=-1, cv=11)
rand_search.fit(X_train, y_train)
rand_search.best_score_
Out[112]:
0.8931459336116102
In [113]:
svr_reg= SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.01, gamma=0.0001,
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
test_model(svr_reg)
Out[113]:
[0.8937335862549801]
In [114]:
svr_reg.fit(X_train,y_train)
y_pred = np.exp(svr_reg.predict(X_test)).round(2)
In [115]:
y_pred
Out[115]:
array([113161.6 , 161976.13, 183930.61, ..., 175456.87, 118566.68,
       213315.75])
In [116]:
submit_test3 = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test3.columns=['Id', 'SalePrice']
In [117]:
submit_test3.to_csv('sample_submission.csv', index=False)
submit_test3
Out[117]:
  Id SalePrice
0 1461 113161.60
1 1462 161976.13
2 1463 183930.61
3 1464 194015.66
4 1465 189543.69
1454 2915 89490.38
1455 2916 80618.90
1456 2917 175456.87
1457 2918 118566.68
1458 2919 213315.75

1459 rows × 2 columns

 

Name Submitted Wait time Execution time Score sample_submission.csv 3 days ago 0 seconds 0 seconds 0.12612

 

XGBoost parameter tuning

 

xgb2_reg = xgboost.XGBRegressor() params_xgb = { ‘max_depth’: range(2, 20, 2), ‘n_estimators’: range(99, 2001, 80), ‘learning_rate’: [0.2, 0.1, 0.01, 0.05], ‘booster’: [‘gbtree’], ‘mon_child_weight’: range(1, 8, 1) } rand_search_xgb = RandomizedSearchCV(estimator = xgb2_reg, param_distributions=params_xgb, n_iter=100, n_jobs=-1, cv=11, verbose=11, random_state=51, return_train_score =True, scoring=’neg_mean_absolute_error’) rand_search_xgb.fit(X_train,y_train)

 

rand_search_xgb.bestscore

 

rand_search_xgb.bestparams

In [118]:
xgb2_reg=xgboost.XGBRegressor(n_estimators= 899,
 mon_child_weight= 2,
 max_depth= 4,
 learning_rate= 0.05,
 booster= 'gbtree')

test_model(xgb2_reg)
 
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
 
[18:13:53] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
 
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
 
[18:14:09] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
 
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
 
[18:14:25] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[118]:
[0.8899316609591396]
In [119]:
xgb2_reg.fit(X_train,y_train)
y_pred_xgb_rs=xgb2_reg.predict(X_test)
 
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
 
[18:14:42] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [120]:
np.exp(y_pred_xgb_rs).round(2)
Out[120]:
array([123535.19, 169676.48, 190203.95, ..., 154335.52, 118554.99,
       211244.77], dtype=float32)
In [121]:
y_pred_xgb_rs = np.exp(xgb2_reg.predict(X_test)).round(2)
xgb_rs_solution = pd.concat([test['Id'], pd.DataFrame(y_pred_xgb_rs)], axis=1)
xgb_rs_solution.columns=['Id', 'SalePrice']
xgb_rs_solution.to_csv('sample_submission.csv', index=False)
In [122]:
xgb_rs_solution
Out[122]:
  Id SalePrice
0 1461 123535.187500
1 1462 169676.484375
2 1463 190203.953125
3 1464 192006.062500
4 1465 191240.359375
1454 2915 81919.148438
1455 2916 81223.296875
1456 2917 154335.515625
1457 2918 118554.992188
1458 2919 211244.765625

1459 rows × 2 columns

 

1603 0.12484 2 1d Your Best Entry Your submission scored 0.12484, which is an improvement of your previous score of 0.12612. Great job! Tweet this!

 

Feature Engineering / Selection to improve accuracy

In [123]:
# correlation Barplot
plt.figure(figsize=(9,16))
corr_feat_series = pd.Series.sort_values(train.corrwith(train.SalePrice))
sns.barplot(x=corr_feat_series, y=corr_feat_series.index, orient='h')
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bb01fb41d0>
 
In [124]:
df_back1 = df_back
In [127]:
df_back1.to_csv('df_for_feature_engineering.csv', index=False)
In [130]:
list(corr_feat_series.index)
Out[130]:
['KitchenAbvGr',
 'EnclosedPorch',
 'MSSubClass',
 'OverallCond',
 'YrSold',
 'LowQualFinSF',
 'Id',
 'MiscVal',
 'BsmtHalfBath',
 'BsmtFinSF2',
 '3SsnPorch',
 'MoSold',
 'PoolArea',
 'ScreenPorch',
 'BedroomAbvGr',
 'BsmtUnfSF',
 'BsmtFullBath',
 'LotArea',
 'HalfBath',
 'OpenPorchSF',
 '2ndFlrSF',
 'WoodDeckSF',
 'LotFrontage',
 'BsmtFinSF1',
 'Fireplaces',
 'MasVnrArea',
 'GarageYrBlt',
 'YearRemodAdd',
 'YearBuilt',
 'TotRmsAbvGrd',
 'FullBath',
 '1stFlrSF',
 'TotalBsmtSF',
 'GarageArea',
 'GarageCars',
 'GrLivArea',
 'OverallQual',
 'SalePrice']
In [ ]:
 
v1-fs-house-prices-advanced-regression-techniques-feature_selection-v1

House Prices: Advanced Regression Techniques

Feature Selection / Engineering

Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv('df_for_feature_engineering.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
df
Out[2]:
1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 BsmtFinType1 SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 6.753438 6.751101 0.0 1.386294 1Fam 3 -1 6.561031 0.0 6 WD 0.0 Pave 2.197225 6.753438 3 0.000000 2003 2003 2008
1 7.141245 0.000000 0.0 1.386294 1Fam 3 3 6.886532 0.0 5 WD 0.0 Pave 1.945910 7.141245 3 5.700444 1976 1976 2007
2 6.825460 6.765039 0.0 1.386294 1Fam 3 1 6.188264 0.0 6 WD 0.0 Pave 1.945910 6.825460 3 0.000000 2001 2002 2008
3 6.869014 6.629363 0.0 1.386294 1Fam 4 -1 5.379897 0.0 5 WD 0.0 Pave 2.079442 6.629363 3 0.000000 1915 1970 2006
4 7.044033 6.960348 0.0 1.609438 1Fam 3 2 6.486161 0.0 6 WD 0.0 Pave 2.302585 7.044033 3 5.262690 2000 2000 2008
2914 6.304449 6.304449 0.0 1.386294 Twnhs 3 -1 0.000000 0.0 1 WD 0.0 Pave 1.791759 6.304449 3 0.000000 1970 1970 2006
2915 6.304449 6.304449 0.0 1.386294 TwnhsE 3 -1 5.533389 0.0 3 WD 0.0 Pave 1.945910 6.304449 3 0.000000 1970 1970 2006
2916 7.110696 0.000000 0.0 1.609438 1Fam 3 -1 7.110696 0.0 5 WD 0.0 Pave 2.079442 7.110696 3 6.163315 1960 1996 2006
2917 6.878326 0.000000 0.0 1.386294 1Fam 3 2 5.823046 0.0 6 WD 0.0 Pave 1.945910 6.816736 3 4.394449 1992 1992 2006
2918 6.904751 6.912743 0.0 1.386294 1Fam 3 2 6.632002 0.0 2 WD 0.0 Pave 2.302585 6.904751 3 5.252273 1993 1994 2006

2919 rows × 74 columns

In [3]:
#df = df.set_index('Id')

Drop feature

In [4]:
df = df.drop(['YrSold',
 'LowQualFinSF',
 'MiscVal',
 'BsmtHalfBath',
 'BsmtFinSF2',
 '3SsnPorch',
 'MoSold'],axis=1)
In [5]:
quan = list(df.loc[:,df.dtypes != 'object'].columns.values)
quan
Out[5]:
['1stFlrSF',
 '2ndFlrSF',
 'BedroomAbvGr',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinSF1',
 'BsmtFinType1',
 'BsmtFinType2',
 'BsmtFullBath',
 'BsmtQual',
 'BsmtUnfSF',
 'EnclosedPorch',
 'ExterCond',
 'ExterQual',
 'Fireplaces',
 'FullBath',
 'Functional',
 'GarageArea',
 'GarageCars',
 'GarageCond',
 'GarageFinish',
 'GarageQual',
 'GarageYrBlt',
 'GrLivArea',
 'HalfBath',
 'HeatingQC',
 'KitchenAbvGr',
 'KitchenQual',
 'LotArea',
 'LotFrontage',
 'MSSubClass',
 'MasVnrArea',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'PavedDrive',
 'PoolArea',
 'ScreenPorch',
 'TotRmsAbvGrd',
 'TotalBsmtSF',
 'Utilities',
 'WoodDeckSF',
 'YearBuilt',
 'YearRemodAdd']
In [6]:
skewd_feat = ['1stFlrSF',
 '2ndFlrSF',
 'BedroomAbvGr',
 'BsmtFinSF1',
 'BsmtFullBath',
 'BsmtUnfSF',
 'EnclosedPorch',
 'Fireplaces',
 'FullBath',
 'GarageArea',
 'GarageCars',
 'GrLivArea',
 'HalfBath',
 'KitchenAbvGr',
 'LotArea',
 'LotFrontage',
 'MasVnrArea',
 'OpenPorchSF',
 'PoolArea',
 'ScreenPorch',
 'TotRmsAbvGrd',
 'TotalBsmtSF',
 'WoodDeckSF']
#  '3SsnPorch',  'BsmtFinSF2',  'BsmtHalfBath',  'LowQualFinSF', 'MiscVal'
In [7]:
# Decrease the skewness of the data
for i in skewd_feat:
    df[i] = np.log(df[i] + 1)
    
SalePrice = np.log(train['SalePrice'] + 1)

decrease the skewnwnes of the data

for i in skewed_features: df[i] = np.log(df[i] + 1)

In [8]:
df
Out[8]:
1stFlrSF 2ndFlrSF BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinType1 BsmtFinType2 BsmtFullBath SaleCondition SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd
0 2.048136 2.047835 0.869742 1Fam 3 -1 2.023008 6 1 0.526589 Normal WD 0.0 Pave 1.162283 2.048136 3 0.000000 2003 2003
1 2.096943 0.000000 0.869742 1Fam 3 3 2.065156 5 1 0.000000 Normal WD 0.0 Pave 1.080418 2.096943 3 1.902174 1976 1976
2 2.057383 2.049631 0.869742 1Fam 3 1 1.972450 6 1 0.526589 Normal WD 0.0 Pave 1.080418 2.057383 3 0.000000 2001 2002
3 2.062933 2.032004 0.869742 1Fam 4 -1 1.853152 5 1 0.526589 Abnorml WD 0.0 Pave 1.124748 2.032004 3 0.000000 1915 1970
4 2.084931 2.074473 0.959135 1Fam 3 2 2.013056 6 1 0.526589 Normal WD 0.0 Pave 1.194706 2.084931 3 1.834610 2000 2000
2914 1.988484 1.988484 0.869742 Twnhs 3 -1 0.000000 1 1 0.000000 Normal WD 0.0 Pave 1.026672 1.988484 3 0.000000 1970 1970
2915 1.988484 1.988484 0.869742 TwnhsE 3 -1 1.876926 3 1 0.000000 Abnorml WD 0.0 Pave 1.080418 1.988484 3 0.000000 1970 1970
2916 2.093184 0.000000 0.959135 1Fam 3 -1 2.093184 5 1 0.526589 Abnorml WD 0.0 Pave 1.124748 2.093184 3 1.968973 1960 1996
2917 2.064116 0.000000 0.869742 1Fam 3 2 1.920306 6 1 0.000000 Normal WD 0.0 Pave 1.080418 2.056267 3 1.685370 1992 1992
2918 2.067464 2.068474 0.869742 1Fam 3 2 2.032350 2 1 0.000000 Normal WD 0.0 Pave 1.194706 2.067464 3 1.832945 1993 1994

2919 rows × 67 columns

In [9]:
obj_feat = list(df.loc[:, df.dtypes == 'object'].columns.values)
print(len(obj_feat))

obj_feat
23
Out[9]:
['BldgType',
 'CentralAir',
 'Condition1',
 'Condition2',
 'Electrical',
 'Exterior1st',
 'Exterior2nd',
 'Foundation',
 'GarageType',
 'Heating',
 'HouseStyle',
 'LandContour',
 'LandSlope',
 'LotConfig',
 'LotShape',
 'MSZoning',
 'MasVnrType',
 'Neighborhood',
 'RoofMatl',
 'RoofStyle',
 'SaleCondition',
 'SaleType',
 'Street']
In [10]:
# dummy varaibale
dummy_drop = []
for i in obj_feat:
    dummy_drop += [i + '_' + str(df[i].unique()[-1])]

df = pd.get_dummies(df, columns = obj_feat)
df = df.drop(dummy_drop, axis = 1)
In [11]:
df.shape
Out[11]:
(2919, 188)
In [12]:
# scaling dataset with robust scaler
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(df)
df = scaler.transform(df)

Model Bulding

In [14]:
train_len = len(train)
X_train = df[:train_len]
X_test = df[train_len:]
y_train = SalePrice

print("Shape of X_train: ", len(X_train))
print("Shape of X_test: ", len(X_test))
print("Shape of y_train: ", len(y_train))
Shape of X_train:  1460
Shape of X_test:  1459
Shape of y_train:  1460

Cross Validation

In [15]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import make_scorer, r2_score

def test_model(model, X_train=X_train, y_train=y_train):
    cv = KFold(n_splits = 3, shuffle=True, random_state = 45)
    r2 = make_scorer(r2_score)
    r2_val_score = cross_val_score(model, X_train, y_train, cv=cv, scoring = r2)
    score = [r2_val_score.mean()]
    return score
In [16]:
# first cross validation with df with log second without log

Linear Model

In [17]:
import sklearn.linear_model as linear_model
LR = linear_model.LinearRegression()
test_model(LR)
Out[17]:
[0.854658732209586]
In [18]:
rdg = linear_model.Ridge()
test_model(rdg)
Out[18]:
[0.888064926116905]
In [19]:
lasso = linear_model.Lasso(alpha=1e-4)
test_model(lasso)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.8388317725258476, tolerance: 0.015031309701037674
  positive)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.164636704643625, tolerance: 0.015746248308600008
  positive)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.8516198073708159, tolerance: 0.015776900600358995
  positive)
Out[19]:
[0.8838984300386702]

Support vector machine

In [20]:
from sklearn.svm import SVR
svr = SVR(kernel='rbf')
test_model(svr)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[20]:
[0.8829505588692946]

svm hyper parameter tuning

In [22]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
params = {'kernel': ['rbf'],
         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
         'C': [0.1, 1, 10, 100, 1000],
         'epsilon': [1, 0.2, 0.1, 0.01, 0.001, 0.0001]}
rand_search = RandomizedSearchCV(svr_reg, param_distributions=params, n_jobs=-1, cv=11)
rand_search.fit(X_train, y_train)
rand_search.best_score_
Out[22]:
0.896113151557455
In [26]:
rand_search.best_estimator_
Out[26]:
SVR(C=10, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.001,
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [27]:
svr_reg1=SVR(C=10, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.001,
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
test_model(svr_reg1)
Out[27]:
[0.8879819130252286]
In [36]:
svr_reg= SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.01, gamma=0.0001,
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
test_model(svr_reg)
Out[36]:
[0.9002022271963562]

XGBoost

In [23]:
import xgboost
#xgb_reg=xgboost.XGBRegressor()
xgb_reg = xgboost.XGBRegressor(bbooster='gbtree', random_state=51)
test_model(xgb_reg)
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
[18:30:03] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
[18:30:04] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
[18:30:05] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[23]:
[0.8869273850385838]
In [24]:
xgb2_reg=xgboost.XGBRegressor(n_estimators= 899,
 mon_child_weight= 2,
 max_depth= 4,
 learning_rate= 0.05,
 booster= 'gbtree')

test_model(xgb2_reg)
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
[18:30:05] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
[18:30:12] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
[18:30:19] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[24]:
[0.895543397957716]

Solution

In [29]:
xgb2_reg.fit(X_train,y_train)
y_pred = np.exp(xgb2_reg.predict(X_test)).round(2)
submit_test = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test.columns=['Id', 'SalePrice']
submit_test.to_csv('sample_submission.csv', index=False)
submit_test

"""
Rank: 1444
Red AI Productionnovice 
tier
0.12278
5
now
Your Best Entry 
Your submission scored 0.13481, which is not an improvement of your best score. Keep trying!"""
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \
[18:43:20] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[29]:
Id SalePrice
0 1461 124437.382812
1 1462 168348.343750
2 1463 192742.093750
3 1464 194002.046875
4 1465 182459.687500
1454 2915 80093.421875
1455 2916 83680.359375
1456 2917 146870.906250
1457 2918 114270.882812
1458 2919 216138.593750

1459 rows × 2 columns

In [ ]:
svr_reg.fit(X_train,y_train)
y_pred = np.exp(svr_reg.predict(X_test)).round(2)
submit_test = pd.concat([test['Id'],pd.DataFrame(y_pred)], axis=1)
submit_test.columns=['Id', 'SalePrice']
submit_test.to_csv('sample_submission.csv', index=False)
submit_test

"""
file: sample_submission-v1-fs
rank: 1444
Red AI Productionnovice tier
0.12278
4
3m
Your Best Entry 
You advanced 140 places on the leaderboard!

Your submission scored 0.12278, which is an improvement of your previous score of 0.12484. Great job!"""

Model Save

In [34]:
import pickle

pickle.dump(svr_reg, open('model_house_price_prediction.csv', 'wb'))
model_house_price_prediction = pickle.load(open('model_house_price_prediction.csv', 'rb'))
model_house_price_prediction.predict(X_test)
Out[34]:
array([11.66072156, 12.01923057, 12.1316775 , ..., 12.03046738,
       11.72310123, 12.29459311])
In [35]:
test_model(model_house_price_prediction)
Out[35]:
[0.9002022271963562]

SVM Accuracy = 90%

Machine Learning Model Building Never End Until And Unless App Not Stop

Congratulation!!!!!!!

We have completed the Machine learning Project successfully with 90% accuracy which is great for the ‘House Price Prediction: Advance Regression Technique’ project. Now, we are ready to deploy our ML model in the real estate domain.

Click on the below button to download the ‘House Price Prediction: Advance Regression Technique’  Machine Learning end to end project in the Jupyter Notebook file.

Download Project

Conclusion

To get more accuracy, we trained all top supervised regression algorithms but you can try out a few of them which are always popular. After training all algorithms, we found that SVR and XGBoost regressor have given high accuracy than remain but we have chosen SVR.

As ML Engineer, we always retrain the deployed model after some period of time to sustain the accuracy of the model. We hope our efforts will help to predict the price of a house for the buyer and seller.

Please share your feedback and doubt regarding this ML project, so we can update it.

I hope you enjoy the Machine Learning End to End project. Thank you….. -:)

Click here to learn more Machine learning end to end projects.

2 thoughts on “ML Project: House Prices Prediction Advanced Regression Techniques | Kaggle Competition”

Leave a Reply