In this Machine Learning end to end project, we are working on financial application data and predict the customer who will take a premium version app subscription or not. Then the company will take action on the customers to give the offers or not. The data contain the customer’s behavior and our job to find the insights from it. To complete this project, we use python programming and it’s libraries NumPy, Pandas, Matplotlib, and Seaborn. So let’s start……….
- Bussines Problem
- Machine Learning End to End Project
Bussines Problem
The Financial Technology company (Fin-Tech Company) launch there a mobile app. This app used for financial purposes like bank loans, savings, etc. in one place. It has two versions free and premium. The free version app contains basic features and customer wants to use the premium feature then they have to pay some amount to unlock it.
The main goal of the company is to sell the premium version app with low advertisement cost but they don’t know how to do it. That’s a reason they are provided the premium feature in the free version app for 24 hours to collect the customer’s behavior. After that, the company hired the Machine Learning Engineer to find insight from the collected data (customer’s behavior).
The job of the ML engineer is to find or predict new customer who is interested to buy the product or not. If the customers will buy a product anyway so no need to give an offer to that customer and loss the business. Only give offers to those customers who are interested to use premium version app but they can’t afford its cost. So the company will give offers to those customers and earn more money.
So before staring the ML project, we need to import some required libraries.
Follow the “Directing Customers to Subscription Through Financial App Behavior Analysis Machine Learning End to End Project” step by step to get 4 Bonus.
1. Raw Dataset
2. Ready to use Clean Dataset for ML project
3. Full Project in Jupyter Notebook File
4. Full Project in Python File
Machine Learning End to End Project
Import essential libraries
import numpy as np # for numeric calculation
import pandas as pd # for data analysis and manupulation
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for data visualization
from dateutil import parser # convert time in date time data type
Import dataset & explore
It is time to import business data and see how it looks like. To import dataset and convert it to pandas DataFrame we used pandas pd.read_csv() function because of the dataset in CSV file format. The dataset file name is “FineTech_appData.csv”.
To download the “FineTech_appData.csv” dataset file, click on the below button.
fineTech_appData = pd.read_csv("Dataset/FineTech appData/FineTech_appData.csv")
fineTech_appData.shape # get shape of dataset
Output >>> (50000, 12)
The business datadet contains 50,000 customers information with 12 features .
I think you are very curious to know how the business data look like for that we used DataFrame.head() method get to head of fineTech_appData DataFrame and DataFrame.tail() method to get the tail of the fineTech_appData DataFrame.
fineTech_appData.head(6) # show fisrt 6 rows of fineTech_appData DataFrame
Output >>>
fineTech_appData.tail(6) # show last 6 rows of fineTech_appData DataFrame
Output >>>
The 6th number column’s (screen_list) full information not visible, so for that we used below python code snippet. We print only 5 rows from index 1 to 5 from the screen_list.
for i in [1,2,3,4,5]:
print(fineTech_appData.loc[i,'screen_list'],'\n')
Output >>>
joinscreen,product_review,product_review2,ScanPreview,VerifyDateOfBirth,location,VerifyCountry,VerifyPhone,VerifyToken,Institutions,Loan2
Splash,Cycle,Loan
product_review,Home,product_review,Loan3,Finances,Credit3,ReferralContainer,Leaderboard,Rewards,RewardDetail,ScanPreview,location,VerifyDateOfBirth,VerifyPhone,VerifySSN,Credit1,Credit2
idscreen,joinscreen,Cycle,Credit3Container,ScanPreview,VerifyPhone,VerifySSN,Credit1,Loan2,Home,Institutions,SelectInstitution,BankVerification,ReferralContainer,product_review,product_review2,VerifyCountry,VerifyToken,product_review
idscreen,Cycle,Home,ScanPreview,VerifyPhone,VerifySSN,Credit1,Credit3Dashboard,Loan2,Institutions,product_review,product_review,product_review3
Know about dataset
As you can see in fineTech_appData DataFrame, there are 50,000 users data with 12 different features. Let’s know each and every feature in brief.
1. user: Unique ID for each user.
2. first_open: Date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) of login on app first time.
3. dayofweek: On which day user logon.
- 0: Sunday
- 1: Monday
- 2: Tuesday
- 3: Wednesday
- 4: Thursday
- 5: Friday
- 6: Saturday
4. Hour: Time of a day in 24-hour format customer logon. It is correlated with dayofweek column.
5. age: The age of the registered user.
6. screen_list: The name of multiple screens seen by customers, which are separated by a comma.
7. numscreens: The total number of screens seen by customers.
8. minigame: Tha app contains small games related to finance. If the customer played mini-game then 1 otherwise 0.
9. used_premium_feature: If the customer used the premium feature of the app then 1 otherwise 0.
10. enrolled: If the user bought a premium feature app then 1 otherwise 0.
11. enrolled_date: On the date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) the user bought a premium features app.
12. liked: The each screen of the app has a like button if the customer likes it then 1 otherwise 0.
Find the null value in DataFrame using DataFrame.isnull() method and take summation by sum() method.
fineTech_appData.isnull().sum() # take summation of null values
Output >>>
user 0
first_open 0
dayofweek 0
hour 0
age 0
screen_list 0
numscreens 0
minigame 0
used_premium_feature 0
enrolled 0
enrolled_date 18926
liked 0
dtype: int64
All columns contain 0 null value except enrolled_date. The enrolled_date column has total 18926 null values.
Take brief information about the dataset using DataFrame.info() method.
fineTech_appData.info() # brief inforamtion about Dataset
Output >>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
user 50000 non-null int64
first_open 50000 non-null object
dayofweek 50000 non-null int64
hour 50000 non-null object
age 50000 non-null int64
screen_list 50000 non-null object
numscreens 50000 non-null int64
minigame 50000 non-null int64
used_premium_feature 50000 non-null int64
enrolled 50000 non-null int64
enrolled_date 31074 non-null object
liked 50000 non-null int64
dtypes: int64(8), object(4)
memory usage: 4.6+ MB
We can see in the output provided by DataFrame.info() method, there are 50,000 entries (rows) from 0 to 49999 and a total of 12 columns.
All columns have 50,000 non-null values except enrolled_date. It has 31,074 non-null. There is a total of 8 columns that contain integer 64 bit (int64) values and the remaining 4 are object type.
The size of fineTech_appData DataFrame is 4.6 MB.
To know how the numeric variable distributed, we used DataFrame.describe() method. It gives total number count, mean value, std (standard deviation), min and max value, and values are below 25%, 50%, 75% of each column.
fineTech_appData.describe() # give the distribution of numerical variables
Output >>>
From the output, we can know more about the dataset. The mean age of the customer is 31.72. Only 10.7% of customers played minigame and 17.2% customer used premium features of the app, likes 16.5 %. The 62.1% customer enrolled in the premium app.
If you observe the description of ‘dayofweek’ column then you can not get proper information. To solve this issue we print unique values of each column and its length.
# Get the unique value of each columns and it's length
features = fineTech_appData.columns
for i in features:
print("""Unique value of {}\n{}\nlen is {} \n........................\n
""".format(i, fineTech_appData[i].unique(), len(fineTech_appData[i].unique())))
Output >>>
Unique value of user
[235136 333588 254414 ... 302367 324905 27047]
len is 49874
........................
Unique value of first_open
['2012-12-27 02:14:51.273' '2012-12-02 01:16:00.905'
'2013-03-19 19:19:09.157' ... '2013-02-20 22:41:51.165'
'2013-04-28 12:33:04.288' '2012-12-14 01:22:44.638']
len is 49747
........................
Unique value of dayofweek
[3 6 1 4 2 0 5]
len is 7
........................
Unique value of hour
[' 02:00:00' ' 01:00:00' ' 19:00:00' ' 16:00:00' ' 18:00:00' ' 09:00:00'
' 03:00:00' ' 14:00:00' ' 04:00:00' ' 11:00:00' ' 06:00:00' ' 21:00:00'
' 05:00:00' ' 17:00:00' ' 20:00:00' ' 00:00:00' ' 22:00:00' ' 10:00:00'
' 08:00:00' ' 15:00:00' ' 13:00:00' ' 23:00:00' ' 12:00:00' ' 07:00:00']
len is 24
........................
Unique value of age
[ 23 24 28 31 20 35 26 29 39 32 25 17 21 55 38 27 48 37
22 36 30 58 40 33 57 19 45 34 46 56 42 43 41 47 18 53
44 49 60 50 52 62 63 16 54 70 51 69 68 59 76 75 66 61
72 65 90 64 67 73 77 71 74 89 78 86 80 82 79 87 81 85
101 88 83 100 84 98]
len is 78
........................
Unique value of screen_list
['idscreen,joinscreen,Cycle,product_review,ScanPreview,VerifyDateOfBirth,VerifyPhone,VerifyToken,ProfileVerifySSN,Loan2,Settings,ForgotPassword,Login'
'joinscreen,product_review,product_review2,ScanPreview,VerifyDateOfBirth,location,VerifyCountry,VerifyPhone,VerifyToken,Institutions,Loan2'
'Splash,Cycle,Loan' ...
'joinscreen,product_review,product_review2,ScanPreview,VerifyCountry,VerifyPhone,VerifyToken,VerifyDateOfBirth,location,Home'
'Cycle,Home,product_review,product_review,product_review3,ScanPreview,VerifyDateOfBirth,location,VerifyCountry,VerifyPhone,VerifyToken,product_review,product_review,VerifySSN,product_review,SelectInstitution,BankVerification,product_review,product_review'
'product_review,ScanPreview,VerifyDateOfBirth,VerifyCountry,ProfileVerifySSN,ProfilePage,ProfileEducation,ProfileEducationMajor,Saving2Amount,Saving8,ProfileMaritalStatus,ProfileChildren,Saving2,Saving9,Saving7,Saving6,Saving5,Home,Loan2']
len is 38799
........................
Unique value of numscreens
[ 15 13 3 40 32 14 41 33 19 25 11 4 9 26 6 20 5 8
42 1 38 49 35 10 52 50 76 37 16 47 90 24 45 31 39 17
28 27 57 23 21 12 7 18 48 29 136 34 59 89 22 43 36 56
30 2 44 92 51 70 58 66 46 55 61 75 71 78 85 62 53 54
73 68 69 63 64 88 106 80 127 74 72 137 83 77 65 104 60 67
94 81 110 91 82 96 165 79 86 116 99 98 187 84 111 109 107 162
97 100 95 87 122 216 115 102 128 234 112 108 114 125 119 93 185 192
189 153 243 103 101 118 325 141 129 133 126 120 123 134 121 105 113 117
200 247 179 132 144 130 148]
len is 151
........................
Unique value of minigame
[0 1]
len is 2
........................
Unique value of used_premium_feature
[0 1]
len is 2
........................
Unique value of enrolled
[0 1]
len is 2
........................
Unique value of enrolled_date
[nan '2013-07-05 16:11:49.513' '2013-02-26 18:56:37.841' ...
'2013-02-25 19:36:56.082' '2013-05-09 13:47:52.875'
'2013-04-28 12:35:38.709']
len is 31002
........................
Unique value of liked
[0 1]
len is 2
........................
In the above output, we got information about the ‘dayofweek’ and ‘hour’ columns. The customer registers the app each day of the week and 24 hours.
The ‘hour’ column contains object data type, so we converted into integer data type format.
# hour data convert string to int
fineTech_appData['hour'] = fineTech_appData.hour.str.slice(1,3).astype(int)
# get data type of each columns
fineTech_appData.dtypes
Output >>>
user int64
first_open object
dayofweek int64
hour int32
age int64
screen_list object
numscreens int64
minigame int64
used_premium_feature int64
enrolled int64
enrolled_date object
liked int64
dtype: object
To visualize the data need numeric values for that we drop some columns that datatype is the object.
# drop object dtype columns
fineTech_appData2 = fineTech_appData.drop(['user', 'first_open', 'screen_list', 'enrolled_date'], axis = 1)
fineTech_appData2.head(6) # head of numeric dataFrame
Output >>>
Click here to learn more Machine learning Projects.
Data visualization
Heatmap using the correlation matrix
Heatmap uses to find the correlation between each and every features using the correlation matrix.
# Heatmap
plt.figure(figsize=(16,9)) # heatmap size in ratio 16:9
sns.heatmap(fineTech_appData2.corr(), annot = True, cmap ='coolwarm') # show heatmap
plt.title("Heatmap using correlation matrix of fineTech_appData2", fontsize = 25) # title of heatmap
Output >>>
In the fineTech_appData2 dataset, there is no strong correlation between any features. There is little correlation between ‘numscreens’ and ‘enrolled’. It means that those customers saw more screen they are taking premium app. There is a slight correlation between ‘minigame’ with ‘anrolled’ and ‘used_premium_feature’. The slightly negative correlation between ‘age’ with ‘enrolled’ and ‘numscreens’. It means that older customers do not use the premium app and they don’t see multiple screens.
Pair plot of fineTech_appData2
The pair plot helps to visualize the distribution of data and scatter plot.
# Pailplot of fineTech_appData2 Dataset
#%matplotlib qt5 # for show graph in seperate window
sns.pairplot(fineTech_appData2, hue = 'enrolled')
Output >>>
In pair plot we can see, the maximum features have two values like 0 and 1 and orange dots show the enrolled customer’s features. So we visualize the counterplot of enrolled data.
Countplot of enrolled
# Show counterplot of 'enrolled' feature
sns.countplot(fineTech_appData.enrolled)
Output >>>
Here you can see the exact value of enrolled & not enrolled customers.
# value enrolled and not enrolled customers
print("Not enrolled user = ", (fineTech_appData.enrolled < 1).sum(), "out of 50000")
print("Enrolled user = ",50000-(fineTech_appData.enrolled < 1).sum(), "out of 50000")
Output >>>
Not enrolled user = 18926 out of 50000
Enrolled user = 31074 out of 50000
Histogram of each feature of fineTech_appData2
In pair plot, we saw the distribution of each feature but here we visualize in the histogram to understand easily.
# plot histogram
plt.figure(figsize = (16,9)) # figure size in ratio 16:9
features = fineTech_appData2.columns # list of columns name
for i,j in enumerate(features):
plt.subplot(3,3,i+1) # create subplot for histogram
plt.title("Histogram of {}".format(j), fontsize = 15) # title of histogram
bins = len(fineTech_appData2[j].unique()) # bins for histogram
plt.hist(fineTech_appData2[j], bins = bins, rwidth = 0.8, edgecolor = "y", linewidth = 2, ) # plot histogram
plt.subplots_adjust(hspace=0.5) # space between horixontal axes (subplots)
Output >>>
In the above histogram, we can see minigame, used_primium_feature, enrolled, and like they have only two values and how they distributed.
The histogram of ‘dayofweek’ shows, on Tuesday and Wednesday slightly fewer customer registered the app.
The histogram of ‘hour’ shows the less customer register on the app around 10 AM.
The ‘age’ histogram shows, the maximum customers are younger.
The ‘numsreens’ histogram shows the few customers saw more than 40 screens.
Correlation barplot with ‘enrolled’ feature
Now we are trying to know which feature is strongly correlated with ‘enrolled’ feature with positive or negative through barplot.
# show corelation barplot
sns.set() # set background dark grid
plt.figure(figsize = (14,5))
plt.title("Correlation all features with 'enrolled' ", fontsize = 20)
fineTech_appData3 = fineTech_appData2.drop(['enrolled'], axis = 1) # drop 'enrolled' feature
ax =sns.barplot(fineTech_appData3.columns,fineTech_appData3.corrwith(fineTech_appData2.enrolled)) # plot barplot
ax.tick_params(labelsize=15, labelrotation = 20, color ="k") # decorate x & y ticks font
Output >>>
We saw the heatmap correlation matrix but this was not showing correlation clearly but you can easily understand which feature is how much correlated with ‘enrolled’ feature using the above barplot.
The ‘numscreens’ and ‘minigame’ is strongly positively correlated with ‘enrolled’ feature than other feature.
The ‘hour’, ‘age’ and ‘used_premium_feature’ are strongly negatively correlated with the ‘enrolled’ feature.
Now, we are parsing ‘first_open’ and ‘enrolled_date’ object data in data and time format.
# parsinf object data into data time format
fineTech_appData['first_open'] =[parser.parse(i) for i in fineTech_appData['first_open']]
fineTech_appData['enrolled_date'] =[parser.parse(i) if isinstance(i, str) else i for i in fineTech_appData['enrolled_date']]
fineTech_appData.dtypes
Output >>>
user int64
first_open datetime64[ns]
dayofweek int64
hour int32
age int64
screen_list object
numscreens int64
minigame int64
used_premium_feature int64
enrolled int64
enrolled_date datetime64[ns]
liked int64
dtype: object
We are finding how much time the customer takes to get enrolled in the premium feature app after registration. For that subtract ‘fineTech_appData.first_open’ from ‘fineTech_appData.enrolled_date’ and set data type as timedelta64 in hours.
fineTech_appData['time_to_enrolled'] = (fineTech_appData.enrolled_date - fineTech_appData.first_open).astype('timedelta64[h]')
Showing the distribution of time taken to enrolled the app.
# Plot histogram
plt.hist(fineTech_appData['time_to_enrolled'].dropna())
Output >>>
let’s try to show the distribution in range 0 to 100 hours.
# Plot histogram
plt.hist(fineTech_appData['time_to_enrolled'].dropna(), range = (0,100))
Output >>>
In the above histogram, we know the maximum customers have enrolled the app in 10 hours from the registration.
Feature selection
We are considering those customers have enrolled after 48 hours as 0.
# Those customers have enrolled after 48 hours set as 0
fineTech_appData.loc[fineTech_appData.time_to_enrolled > 48, 'enrolled'] = 0
Drop some ‘time_to_enrolled’, ‘enrolled_date’, ‘first_open’ feature they are not strongly correlated to the result.
fineTech_appData.drop(columns = ['time_to_enrolled', 'enrolled_date', 'first_open'], inplace=True)
Read another CSV file that contains the top screens name.
To download this file, click on the below button.
# read csv file and convert it into numpy array
fineTech_app_screen_Data = pd.read_csv("Dataset/FineTech appData/top_screens.csv").top_screens.values
fineTech_app_screen_Data
Output >>>
array(['Loan2', 'location', 'Institutions', 'Credit3Container',
'VerifyPhone', 'BankVerification', 'VerifyDateOfBirth',
'ProfilePage', 'VerifyCountry', 'Cycle', 'idscreen',
'Credit3Dashboard', 'Loan3', 'CC1Category', 'Splash', 'Loan',
'CC1', 'RewardsContainer', 'Credit3', 'Credit1', 'EditProfile',
'Credit2', 'Finances', 'CC3', 'Saving9', 'Saving1', 'Alerts',
'Saving8', 'Saving10', 'Leaderboard', 'Saving4', 'VerifyMobile',
'VerifyHousing', 'RewardDetail', 'VerifyHousingAmount',
'ProfileMaritalStatus', 'ProfileChildren ', 'ProfileEducation',
'Saving7', 'ProfileEducationMajor', 'Rewards', 'AccountView',
'VerifyAnnualIncome', 'VerifyIncomeType', 'Saving2', 'Saving6',
'Saving2Amount', 'Saving5', 'ProfileJobTitle', 'Login',
'ProfileEmploymentLength', 'WebView', 'SecurityModal', 'Loan4',
'ResendToken', 'TransactionList', 'NetworkFailure', 'ListPicker'],
dtype=object)
Add ‘,’ at the end of each string of ‘screen_list’ for further operation.
fineTech_appData['screen_list'] = fineTech_appData.screen_list.astype(str) + ','
The ‘Screen_list’ contains string values but we can’t use it directly. So to solve this problem we are taking each screen name from ‘fineTech_app_screen_Data’ and append as a column by the same name to ‘fineTech_appData’. Then check this screen name is available in ‘screen_list’ if it is available then add value 1 else 0 in the appended column.
# string into to number
for screen_name in fineTech_app_screen_Data:
fineTech_appData[screen_name] = fineTech_appData.screen_list.str.contains(screen_name).astype(int)
fineTech_appData['screen_list'] = fineTech_appData.screen_list.str.replace(screen_name+",", "")
# get shape
fineTech_appData.shape
Output >>> (50000, 68)
You can see the shape of the dataset has changed from 12 to 68 columns.
# head of DataFrame
fineTech_appData.head(6)
Output >>>
Those screens are not availble in ‘fineTech_app_screen_Data’ that are counted and conted number store in new column by name ‘remain_screen_list’.
# remain screen in 'screen_list'
fineTech_appData.loc[0,'screen_list']
Output >>>
'joinscreen,product_review,ScanPreview,VerifyToken,ProfileVerifySSN,Settings,ForgotPassword,'
# count remain screen list and store counted number in 'remain_screen_list'
fineTech_appData['remain_screen_list'] = fineTech_appData.screen_list.str.count(",")
Droping ‘screen_list’ column.
# Drop the 'screen_list'
fineTech_appData.drop(columns = ['screen_list'], inplace=True)
We have total columns 68
# total columns
fineTech_appData.columns
Output >>>
Index(['user', 'dayofweek', 'hour', 'age', 'numscreens', 'minigame',
'used_premium_feature', 'enrolled', 'liked', 'Loan2', 'location',
'Institutions', 'Credit3Container', 'VerifyPhone', 'BankVerification',
'VerifyDateOfBirth', 'ProfilePage', 'VerifyCountry', 'Cycle',
'idscreen', 'Credit3Dashboard', 'Loan3', 'CC1Category', 'Splash',
'Loan', 'CC1', 'RewardsContainer', 'Credit3', 'Credit1', 'EditProfile',
'Credit2', 'Finances', 'CC3', 'Saving9', 'Saving1', 'Alerts', 'Saving8',
'Saving10', 'Leaderboard', 'Saving4', 'VerifyMobile', 'VerifyHousing',
'RewardDetail', 'VerifyHousingAmount', 'ProfileMaritalStatus',
'ProfileChildren ', 'ProfileEducation', 'Saving7',
'ProfileEducationMajor', 'Rewards', 'AccountView', 'VerifyAnnualIncome',
'VerifyIncomeType', 'Saving2', 'Saving6', 'Saving2Amount', 'Saving5',
'ProfileJobTitle', 'Login', 'ProfileEmploymentLength', 'WebView',
'SecurityModal', 'Loan4', 'ResendToken', 'TransactionList',
'NetworkFailure', 'ListPicker', 'remain_screen_list'],
dtype='object')
All the saving screens correlated with each other that’s we are taking the sum of all saving screens in each row and store in a single row for all customers.
# take sum of all saving screen in one place
saving_screens = ['Saving1',
'Saving2',
'Saving2Amount',
'Saving4',
'Saving5',
'Saving6',
'Saving7',
'Saving8',
'Saving9',
'Saving10',
]
fineTech_appData['saving_screens_count'] = fineTech_appData[saving_screens].sum(axis = 1)
fineTech_appData.drop(columns = saving_screens, inplace = True)
similarly for credit, CC1 and loan screens.
credit_screens = ['Credit1',
'Credit2',
'Credit3',
'Credit3Container',
'Credit3Dashboard',
]
fineTech_appData['credit_screens_count'] = fineTech_appData[credit_screens].sum(axis = 1)
fineTech_appData.drop(columns = credit_screens, axis = 1, inplace = True)
cc_screens = ['CC1',
'CC1Category',
'CC3',
]
fineTech_appData['cc_screens_count'] = fineTech_appData[cc_screens].sum(axis = 1)
fineTech_appData.drop(columns = cc_screens, inplace = True)
loan_screens = ['Loan',
'Loan2',
'Loan3',
'Loan4',
]
fineTech_appData['loan_screens_count'] = fineTech_appData[loan_screens].sum(axis = 1)
fineTech_appData.drop(columns = loan_screens, inplace = True)
Now, you can see the shape of DataFrame is reduced.
fineTech_appData.shape
Output >>> (50000, 51)
See information of fineTech_appData
fineTech_appData.info()
Output >>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 50 columns):
user 50000 non-null int64
dayofweek 50000 non-null int64
hour 50000 non-null int32
age 50000 non-null int64
numscreens 50000 non-null int64
minigame 50000 non-null int64
used_premium_feature 50000 non-null int64
enrolled 50000 non-null int64
liked 50000 non-null int64
location 50000 non-null int32
Institutions 50000 non-null int32
VerifyPhone 50000 non-null int32
BankVerification 50000 non-null int32
VerifyDateOfBirth 50000 non-null int32
ProfilePage 50000 non-null int32
VerifyCountry 50000 non-null int32
Cycle 50000 non-null int32
idscreen 50000 non-null int32
Splash 50000 non-null int32
RewardsContainer 50000 non-null int32
EditProfile 50000 non-null int32
Finances 50000 non-null int32
Alerts 50000 non-null int32
Leaderboard 50000 non-null int32
VerifyMobile 50000 non-null int32
VerifyHousing 50000 non-null int32
RewardDetail 50000 non-null int32
VerifyHousingAmount 50000 non-null int32
ProfileMaritalStatus 50000 non-null int32
ProfileChildren 50000 non-null int32
ProfileEducation 50000 non-null int32
ProfileEducationMajor 50000 non-null int32
Rewards 50000 non-null int32
AccountView 50000 non-null int32
VerifyAnnualIncome 50000 non-null int32
VerifyIncomeType 50000 non-null int32
ProfileJobTitle 50000 non-null int32
Login 50000 non-null int32
ProfileEmploymentLength 50000 non-null int32
WebView 50000 non-null int32
SecurityModal 50000 non-null int32
ResendToken 50000 non-null int32
TransactionList 50000 non-null int32
NetworkFailure 50000 non-null int32
ListPicker 50000 non-null int32
remain_screen_list 50000 non-null int64
saving_screens_count 50000 non-null int64
credit_screens_count 50000 non-null int64
cc_screens_count 50000 non-null int64
loan_screens_count 50000 non-null int64
dtypes: int32(37), int64(13)
memory usage: 12.0 MB
# Numerical distribution of fineTech_appData
fineTech_appData.describe()
Output >>>
To download the clean fineTech_appData dataset, click the below button.
Heatmap with the correlation matrix
# Heatmap with correlation matrix of new fineTech_appData
plt.figure(figsize = (25,16))
sns.heatmap(fineTech_appData.corr(), annot = True, linewidth =2)
Output >>>
Data preprocessing
Split dataset in Train and Test
clean_fineTech_appData = fineTech_appData
target = fineTech_appData['enrolled']
fineTech_appData.drop(columns = 'enrolled', inplace = True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(fineTech_appData, target, test_size = 0.2, random_state = 0)
print('Shape of X_train = ', X_train.shape)
print('Shape of X_test = ', X_test.shape)
print('Shape of y_train = ', y_train.shape)
print('Shape of y_test = ', y_test.shape)
Output >>>
Shape of X_train = (40000, 49)
Shape of X_test = (10000, 49)
Shape of y_train = (40000,)
Shape of y_test = (10000,)
# take User ID in another variable
train_userID = X_train['user']
X_train.drop(columns= 'user', inplace =True)
test_userID = X_test['user']
X_test.drop(columns= 'user', inplace =True)
print('Shape of X_train = ', X_train.shape)
print('Shape of X_test = ', X_test.shape)
print('Shape of train_userID = ', train_userID.shape)
print('Shape of test_userID = ', test_userID.shape)
Output >>>
Shape of X_train = (40000, 48)
Shape of X_test = (10000, 48)
Shape of train_userID = (40000,)
Shape of test_userID = (10000,)
Feature Scaling
The multiple features in the different units so for the best accuracy need to convert all features in a single unit.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)
Click here to learn more Machine learning Projects.
Machine Learning Model Building
The target variable is categorical type 0 and 1, so we have to use supervised classification algorithms.
To build the best model, we have to train and test the dataset with multiple Machine Learning algorithms then we can find the best ML model. So let’s try.
First, we import the required packages.
# impoer required packages
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
Decision Tree Classifier
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(criterion= 'entropy', random_state=0)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
accuracy_score(y_test, y_pred_dt)
Output >>> 0.6936
# train with Standert Scaling dataset
dt_model2 = DecisionTreeClassifier(criterion= 'entropy', random_state=0)
dt_model2.fit(X_train_sc, y_train)
y_pred_dt_sc = dt_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_dt_sc)
Output >>> 0.6932
K – Nearest Neighbor Classifier
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2,)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)
accuracy_score(y_test, y_pred_knn)
Output >>> 0.6994
# train with Standert Scaling dataset
knn_model2 = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2,)
knn_model2.fit(X_train_sc, y_train)
y_pred_knn_sc = knn_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_knn_sc)
Output >>> 0.7314
Naive Bayes Classifier
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)
accuracy_score(y_test, y_pred_nb)
Output >>> 0.7114
# train with Standert Scaling dataset
nb_model2 = GaussianNB()
nb_model2.fit(X_train_sc, y_train)
y_pred_nb_sc = nb_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_nb_sc)
Output >>> 0.7114
Random Forest Classifier
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
accuracy_score(y_test, y_pred_rf)
Output >>> 0.7621
# train with Standert Scaling dataset
rf_model2 = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
rf_model2.fit(X_train_sc, y_train)
y_pred_rf_sc = rf_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_rf_sc)
Output >>> 0.7616
Logistic Regression
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(random_state = 0, penalty = 'l1')
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
accuracy_score(y_test, y_pred_lr)
Output >>> 0.7684
# train with Standert Scaling dataset
lr_model2 = LogisticRegression(random_state = 0, penalty = 'l1')
lr_model2.fit(X_train_sc, y_train)
y_pred_lr_sc = lr_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_lr_sc)
Output >>> 0.7681
Support Vector Classifier
# Support Vector Machine
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)
y_pred_svc = svc_model.predict(X_test)
accuracy_score(y_test, y_pred_svc)
Output >>> 0.7616
# train with Standert Scaling dataset
svc_model2 = SVC()
svc_model2.fit(X_train_sc, y_train)
y_pred_svc_sc = svc_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_svc_sc)
Output >>> 0.779
XGBoost Classifier
# XGBoost Classifier
from xgboost import XGBClassifier
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
accuracy_score(y_test, y_pred_xgb)
Output >>> 0.7748
# train with Standert Scaling dataset
xgb_model2 = XGBClassifier()
xgb_model2.fit(X_train_sc, y_train)
y_pred_xgb_sc = xgb_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_xgb_sc)
Output >>> 0.7748
# XGB classifier with parameter tuning
xgb_model_pt1 = XGBClassifier(
learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
xgb_model_pt1.fit(X_train, y_train)
y_pred_xgb_pt1 = xgb_model_pt1.predict(X_test)
accuracy_score(y_test, y_pred_xgb_pt1)
Output >>> 0.7887
# XGB classifier with parameter tuning
# train with Stander Scaling dataset
xgb_model_pt2 = XGBClassifier(
learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.005,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
xgb_model_pt2.fit(X_train_sc, y_train)
y_pred_xgb_sc_pt2 = xgb_model_pt2.predict(X_test_sc)
accuracy_score(y_test, y_pred_xgb_sc_pt2)
Output >>> 0.7887
We observ that Support Vector Classifier and XGBoost Classifier give best accuracy than ohter ML algorithm. But we will continue with XGBoost classifier because the accuracy is slightly higher than SVC.
Confusion Matrix
# confussion matrix
cm_xgb_pt2 = confusion_matrix(y_test, y_pred_xgb_sc_pt2)
sns.heatmap(cm_xgb_pt2, annot = True, fmt = 'g')
plt.title("Confussion Matrix", fontsize = 20)
Output >>>
The model is giving type II error higher than type I.
Classification report of ML model
# Clasification Report
cr_xgb_pt2 = classification_report(y_test, y_pred_xgb_sc_pt2)
print("Classification report >>> \n", cr_xgb_pt2)
Output >>>
Classification report >>>
precision recall f1-score support
0 0.78 0.82 0.80 5072
1 0.80 0.76 0.78 4928
micro avg 0.79 0.79 0.79 10000
macro avg 0.79 0.79 0.79 10000
weighted avg 0.79 0.79 0.79 10000
Cross-validation of the ML model
To find the ML model is overfitted, under fitted or generalize doing cross-validation.
# Cross validation
from sklearn.model_selection import cross_val_score
cross_validation = cross_val_score(estimator = xgb_model_pt2, X = X_train_sc, y = y_train, cv = 10)
print("Cross validation of XGBoost model = ",cross_validation)
print("Cross validation of XGBoost model (in mean) = ",cross_validation.mean())
Output >>>
Cross validation of XGBoost model =
[0.79255186 0.77855536 0.78875 0.78625 0.7795 0.78575
0.79 0.7815 0.78944736 0.77844461]
Cross validation of XGBoost model (in mean) = 0.7850749196187449
The mean value cross-validation and XGBoost model accuracy is 78%. That means our XGBoost model is a generalized model.
Mapping predicted output to the target
In the below output, you can find the predicted output by model and actual target output.
final_result = pd.concat([test_userID, y_test], axis = 1)
final_result['predicted result'] = y_pred_xgb_sc_pt2
print(final_result)
Output >>>
user enrolled predicted result
11841 239786 1 1
19602 279644 1 1
45519 98290 0 0
25747 170150 1 1
42642 237568 1 0
31902 65042 1 0
30346 207226 1 1
12363 363062 0 0
32490 152296 1 1
26128 64484 0 0
14227 38108 1 1
26376 359940 0 0
44173 136089 0 0
12968 14231 1 1
32104 216038 0 0
17844 18918 1 1
43460 316730 1 1
8369 28308 1 0
15055 228387 1 1
6338 69640 1 1
15301 358264 0 0
46250 348059 0 0
45580 178743 1 1
24647 167556 0 0
46712 294101 0 0
4150 192801 0 0
42460 163983 1 1
29079 298830 0 0
19412 151790 1 1
34839 20200 1 1
... ... ... ...
3380 348989 0 0
37623 248593 1 0
24852 316086 1 1
29372 192540 1 1
49639 256833 0 0
2930 273991 1 1
1210 365937 0 0
22652 295129 0 0
32360 255715 1 0
9171 37332 0 1
49037 164886 1 0
17793 309967 0 0
28887 14907 0 0
567 244737 1 1
662 284862 0 0
46038 60719 1 1
16778 262103 1 0
3075 243679 1 1
34793 280000 1 1
6557 255074 0 0
19150 347521 0 0
40096 335029 1 0
7869 37271 1 1
49546 240006 1 1
45202 279449 0 1
25091 143036 1 1
27853 91158 1 1
47278 248318 0 0
37020 142418 1 1
2217 279355 1 0
[10000 rows x 3 columns]
Save the Machine Learning model
After completion of the Machine Learning project or building the ML model need to deploy in an application. To deploy the ML model need to save it first. To save the Machine Learning project we can use the pickle or joblib package.
Here, we will see both ways, Use anyone which is better for you.
Save the ML model with Pickle
## Pickle
import pickle
# save model
pickle.dump(xgb_model_pt2, open('FineTech_app_ML_model.pickle', 'wb'))
# load model
ml_model_pl = pickle.load(open('FineTech_app_ML_model.pickle', 'rb'))
# predict the output
y_pred_pl = ml_model.predict(X_test_sc)
# confusion matrix
cm_pl = confusion_matrix(y_test, y_pred)
print('Confussion matrix = \n', cm_pl)
# show the accuracy
print("Accuracy of model = ",accuracy_score(y_test, y_pred_pl))
Output >>>
Confussion matrix =
[[4156 916]
[1197 3731]]
Accuracy of model = 0.7887
Save the Ml model with Joblib
## Joblib
from sklearn.externals import joblib
# save model
joblib.dump(xgb_model_pt2, 'FineTech_app_ML_model.joblib')
# load model
ml_model_jl = joblib.load('FineTech_app_ML_model.joblib')
# predict the output
y_pred_jl = ml_model.predict(X_test_sc)
cm_jl = confusion_matrix(y_test, y_pred)
print('Confussion matrix = \n', cm_jl)
print("Accuracy of model = ", accuracy_score(y_test, y_pred_jl))
Output >>>
Confussion matrix =
[[4156 916]
[1197 3731]]
Accuracy of model = 0.7887
Note: When we dump the model then model file is store in the disk where the project file is store but we can change path by passing its address.
Congratulation!!!!!!!
We have completed the Machine learning Project successfully with 78.87% accuracy which is great for ‘Directing Customers to Subscription Through Financial App Behavior Analysis’ project. Now, we are ready to deploy our ML model in the Fin_tech company project.
Click on the below button to download ‘Directing Customers to Subscription Through Financial App Behavior Analysis‘ Machine Learning end to end project in the Jupyter Notebook file and Python file format.
Conclusion
To get more accuracy, we train all supervised classification algorithms but you can try out a few of them which are always popular. After training all algorithms, we found that SVC and XGBoost classifiers are given high accuracy than remain but we have chosen XGBoost.
As ML Engineer, we always retrain the deployed model after some period of time to sustain the accuracy of the model. We hope our efforts will give more profit to the fin-tech company.
Please share your feedback and doubt regarding this ML project, so we can update it.
I hope you enjoy the Machine Learning End to End project. Thank you….. -:)
Click here to learn more Machine learning end to end projects.
Wow! The process you took especially in analyzing data was AMAZING!!
Loved it.
Maybe consider starting a YouTube channel about this.
You would help a lot of data scientist out there, particularly beginners, and also receive lots of blessings.
Amazing content sir. Thank you so much for these tutorials
may i know the name of the dataset company
great sir you are best your ml project are very useful .thanks sir