ML Project: Directing Customers to Subscription Through Financial App Behavior Analysis

In this Machine Learning end to end project, we are working on financial application data and predict the customer who will take a premium version app subscription or not. Then the company will take action on the customers to give the offers or not. The data contain the customer’s behavior and our job to find the insights from it. To complete this project, we use python programming and it’s libraries NumPy, Pandas, Matplotlib, and Seaborn. So let’s start……….

Bussines Problem

The Financial Technology company (Fin-Tech Company) launch there a mobile app. This app used for financial purposes like bank loans, savings, etc. in one place. It has two versions free and premium. The free version app contains basic features and customer wants to use the premium feature then they have to pay some amount to unlock it.

Directing Customers to Subscription Through Financial App Behavior Analysis Machine Learning Project
Directing Customers to Subscription Through Financial App Behavior Analysis Machine Learning Project

The main goal of the company is to sell the premium version app with low advertisement cost but they don’t know how to do it. That’s a reason they are provided the premium feature in the free version app for 24 hours to collect the customer’s behavior. After that, the company hired the Machine Learning Engineer to find insight from the collected data (customer’s behavior).

The job of the ML engineer is to find or predict new customer who is interested to buy the product or not. If the customers will buy a product anyway so no need to give an offer to that customer and loss the business. Only give offers to those customers who are interested to use premium version app but they can’t afford its cost. So the company will give offers to those customers and earn more money.

So before staring the ML project, we need to import some required libraries.

Follow the “Directing Customers to Subscription Through Financial App Behavior Analysis Machine Learning End to End Project” step by step to get 4 Bonus.
1. Raw Dataset
2. Ready to use Clean Dataset for ML project
3. Full Project in Jupyter Notebook File
4. Full Project in Python File


Machine Learning End to End Project

Import essential libraries

import numpy as np # for numeric calculation
import pandas as pd # for data analysis and manupulation
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for data visualization
from dateutil import parser # convert time in date time data type

Import dataset & explore

It is time to import business data and see how it looks like. To import dataset and convert it to pandas DataFrame we used pandas pd.read_csv() function because of the dataset in CSV file format. The dataset file name is “FineTech_appData.csv”.

To download the “FineTech_appData.csv” dataset file, click on the below button.

fineTech_appData = pd.read_csv("Dataset/FineTech appData/FineTech_appData.csv")
fineTech_appData.shape # get shape of dataset

Output >>> (50000, 12)

The business datadet contains 50,000 customers information with 12 features .

I think you are very curious to know how the business data look like for that we used DataFrame.head() method get to head of fineTech_appData DataFrame and DataFrame.tail() method to get the tail of the fineTech_appData DataFrame.

fineTech_appData.head(6) # show fisrt 6 rows of fineTech_appData DataFrame

Output >>>

1 fineTech_appData_head ML Project
fineTech_appData.tail(6) # show last 6 rows of fineTech_appData DataFrame

Output >>>

2 fineTech_appData_tail ML Project

The 6th number column’s (screen_list) full information not visible, so for that we used below python code snippet. We print only 5 rows from index 1 to 5 from the screen_list.

for i in [1,2,3,4,5]:
    print(fineTech_appData.loc[i,'screen_list'],'\n')

Output >>>

joinscreen,product_review,product_review2,ScanPreview,VerifyDateOfBirth,location,VerifyCountry,VerifyPhone,VerifyToken,Institutions,Loan2 

Splash,Cycle,Loan 

product_review,Home,product_review,Loan3,Finances,Credit3,ReferralContainer,Leaderboard,Rewards,RewardDetail,ScanPreview,location,VerifyDateOfBirth,VerifyPhone,VerifySSN,Credit1,Credit2 

idscreen,joinscreen,Cycle,Credit3Container,ScanPreview,VerifyPhone,VerifySSN,Credit1,Loan2,Home,Institutions,SelectInstitution,BankVerification,ReferralContainer,product_review,product_review2,VerifyCountry,VerifyToken,product_review 

idscreen,Cycle,Home,ScanPreview,VerifyPhone,VerifySSN,Credit1,Credit3Dashboard,Loan2,Institutions,product_review,product_review,product_review3 

Know about dataset

As you can see in fineTech_appData DataFrame, there are 50,000 users data with 12 different features. Let’s know each and every feature in brief.

1. user: Unique ID for each user.

2. first_open: Date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) of login on app first time.

3. dayofweek: On which day user logon.

  • 0: Sunday
  • 1: Monday
  • 2: Tuesday
  • 3: Wednesday
  • 4: Thursday
  • 5: Friday
  • 6: Saturday

4. Hour: Time of a day in 24-hour format customer logon. It is correlated with dayofweek column.

5. age: The age of the registered user.

6. screen_list: The name of multiple screens seen by customers, which are separated by a comma.

7. numscreens: The total number of screens seen by customers.

8. minigame: Tha app contains small games related to finance. If the customer played mini-game then 1 otherwise 0.

9. used_premium_feature: If the customer used the premium feature of the app then 1 otherwise 0.

10. enrolled: If the user bought a premium feature app then 1 otherwise 0.

11. enrolled_date: On the date (yy-mm-dd) and time (Hour:Minute:Seconds:Milliseconds) the user bought a premium features app.

12. liked: The each screen of the app has a like button if the customer likes it then 1 otherwise 0.

Find the null value in DataFrame using DataFrame.isnull() method and take summation by sum() method.

fineTech_appData.isnull().sum() # take summation of null values

Output >>>

user                        0
first_open                  0
dayofweek                   0
hour                        0
age                         0
screen_list                 0
numscreens                  0
minigame                    0
used_premium_feature        0
enrolled                    0
enrolled_date           18926
liked                       0
dtype: int64

All columns contain 0 null value except enrolled_date. The enrolled_date column has total 18926 null values.

Take brief information about the dataset using DataFrame.info() method.

fineTech_appData.info() # brief inforamtion about Dataset

Output >>>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
user                    50000 non-null int64
first_open              50000 non-null object
dayofweek               50000 non-null int64
hour                    50000 non-null object
age                     50000 non-null int64
screen_list             50000 non-null object
numscreens              50000 non-null int64
minigame                50000 non-null int64
used_premium_feature    50000 non-null int64
enrolled                50000 non-null int64
enrolled_date           31074 non-null object
liked                   50000 non-null int64
dtypes: int64(8), object(4)
memory usage: 4.6+ MB

We can see in the output provided by DataFrame.info() method, there are 50,000 entries (rows) from 0 to 49999 and a total of 12 columns.

All columns have 50,000 non-null values except enrolled_date. It has 31,074 non-null. There is a total of 8 columns that contain integer 64 bit (int64) values and the remaining 4 are object type.

The size of fineTech_appData DataFrame is 4.6 MB.

To know how the numeric variable distributed, we used DataFrame.describe() method. It gives total number count, mean value, std (standard deviation), min and max value, and values are below 25%, 50%, 75% of each column.

fineTech_appData.describe() # give the distribution of numerical variables

Output >>>

3 fineTech_appData_describe Machine Learning Project

From the output, we can know more about the dataset. The mean age of the customer is 31.72. Only 10.7% of customers played minigame and 17.2% customer used premium features of the app, likes 16.5 %. The 62.1% customer enrolled in the premium app.

If you observe the description of ‘dayofweek’ column then you can not get proper information. To solve this issue we print unique values of each column and its length.

# Get the unique value of each columns and it's length
features = fineTech_appData.columns
for i in features:
    print("""Unique value of {}\n{}\nlen is {} \n........................\n
          """.format(i, fineTech_appData[i].unique(), len(fineTech_appData[i].unique())))

Output >>>

Unique value of user
[235136 333588 254414 ... 302367 324905  27047]
len is 49874 
........................
          
Unique value of first_open
['2012-12-27 02:14:51.273' '2012-12-02 01:16:00.905'
 '2013-03-19 19:19:09.157' ... '2013-02-20 22:41:51.165'
 '2013-04-28 12:33:04.288' '2012-12-14 01:22:44.638']
len is 49747 
........................
          
Unique value of dayofweek
[3 6 1 4 2 0 5]
len is 7 
........................
          
Unique value of hour
[' 02:00:00' ' 01:00:00' ' 19:00:00' ' 16:00:00' ' 18:00:00' ' 09:00:00'
 ' 03:00:00' ' 14:00:00' ' 04:00:00' ' 11:00:00' ' 06:00:00' ' 21:00:00'
 ' 05:00:00' ' 17:00:00' ' 20:00:00' ' 00:00:00' ' 22:00:00' ' 10:00:00'
 ' 08:00:00' ' 15:00:00' ' 13:00:00' ' 23:00:00' ' 12:00:00' ' 07:00:00']
len is 24 
........................
          
Unique value of age
[ 23  24  28  31  20  35  26  29  39  32  25  17  21  55  38  27  48  37
  22  36  30  58  40  33  57  19  45  34  46  56  42  43  41  47  18  53
  44  49  60  50  52  62  63  16  54  70  51  69  68  59  76  75  66  61
  72  65  90  64  67  73  77  71  74  89  78  86  80  82  79  87  81  85
 101  88  83 100  84  98]
len is 78 
........................
       
Unique value of screen_list
['idscreen,joinscreen,Cycle,product_review,ScanPreview,VerifyDateOfBirth,VerifyPhone,VerifyToken,ProfileVerifySSN,Loan2,Settings,ForgotPassword,Login'
 'joinscreen,product_review,product_review2,ScanPreview,VerifyDateOfBirth,location,VerifyCountry,VerifyPhone,VerifyToken,Institutions,Loan2'
 'Splash,Cycle,Loan' ...
 'joinscreen,product_review,product_review2,ScanPreview,VerifyCountry,VerifyPhone,VerifyToken,VerifyDateOfBirth,location,Home'
 'Cycle,Home,product_review,product_review,product_review3,ScanPreview,VerifyDateOfBirth,location,VerifyCountry,VerifyPhone,VerifyToken,product_review,product_review,VerifySSN,product_review,SelectInstitution,BankVerification,product_review,product_review'
 'product_review,ScanPreview,VerifyDateOfBirth,VerifyCountry,ProfileVerifySSN,ProfilePage,ProfileEducation,ProfileEducationMajor,Saving2Amount,Saving8,ProfileMaritalStatus,ProfileChildren,Saving2,Saving9,Saving7,Saving6,Saving5,Home,Loan2']
len is 38799 
........................
         
Unique value of numscreens
[ 15  13   3  40  32  14  41  33  19  25  11   4   9  26   6  20   5   8
  42   1  38  49  35  10  52  50  76  37  16  47  90  24  45  31  39  17
  28  27  57  23  21  12   7  18  48  29 136  34  59  89  22  43  36  56
  30   2  44  92  51  70  58  66  46  55  61  75  71  78  85  62  53  54
  73  68  69  63  64  88 106  80 127  74  72 137  83  77  65 104  60  67
  94  81 110  91  82  96 165  79  86 116  99  98 187  84 111 109 107 162
  97 100  95  87 122 216 115 102 128 234 112 108 114 125 119  93 185 192
 189 153 243 103 101 118 325 141 129 133 126 120 123 134 121 105 113 117
 200 247 179 132 144 130 148]
len is 151 
........................
         
Unique value of minigame
[0 1]
len is 2 
........................
          
Unique value of used_premium_feature
[0 1]
len is 2 
........................
         
Unique value of enrolled
[0 1]
len is 2 
........................
          
Unique value of enrolled_date
[nan '2013-07-05 16:11:49.513' '2013-02-26 18:56:37.841' ...
 '2013-02-25 19:36:56.082' '2013-05-09 13:47:52.875'
 '2013-04-28 12:35:38.709']
len is 31002 
........................
          
Unique value of liked
[0 1]
len is 2 
........................

In the above output, we got information about the ‘dayofweek’ and ‘hour’ columns. The customer registers the app each day of the week and 24 hours.

The ‘hour’ column contains object data type, so we converted into integer data type format.

#  hour data convert string to int
fineTech_appData['hour'] = fineTech_appData.hour.str.slice(1,3).astype(int) 

# get data type of each columns
fineTech_appData.dtypes

Output >>>

user                     int64
first_open              object
dayofweek                int64
hour                     int32
age                      int64
screen_list             object
numscreens               int64
minigame                 int64
used_premium_feature     int64
enrolled                 int64
enrolled_date           object
liked                    int64
dtype: object

To visualize the data need numeric values for that we drop some columns that datatype is the object.

# drop object dtype columns
fineTech_appData2 = fineTech_appData.drop(['user', 'first_open', 'screen_list', 'enrolled_date'], axis = 1)
fineTech_appData2.head(6) # head of numeric dataFrame

Output >>>

4 fineTech_appData2_head Machine Learning project

Click here to learn more Machine learning Projects.


Data visualization

Heatmap using the correlation matrix

Heatmap uses to find the correlation between each and every features using the correlation matrix.

# Heatmap
plt.figure(figsize=(16,9)) # heatmap size in ratio 16:9

sns.heatmap(fineTech_appData2.corr(), annot = True, cmap ='coolwarm') # show heatmap

plt.title("Heatmap using correlation matrix of fineTech_appData2", fontsize = 25) # title of heatmap

Output >>>

5 fineTech_appData2 Heatmap

In the fineTech_appData2 dataset, there is no strong correlation between any features. There is little correlation between ‘numscreens’ and ‘enrolled’. It means that those customers saw more screen they are taking premium app. There is a slight correlation between ‘minigame’ with ‘anrolled’ and ‘used_premium_feature’. The slightly negative correlation between ‘age’ with ‘enrolled’ and ‘numscreens’. It means that older customers do not use the premium app and they don’t see multiple screens.

Pair plot of fineTech_appData2

The pair plot helps to visualize the distribution of data and scatter plot.

# Pailplot of fineTech_appData2 Dataset

#%matplotlib qt5 # for show graph in seperate window
sns.pairplot(fineTech_appData2, hue  = 'enrolled')

Output >>>

6 fineTech_appData2_Pairplot Machine Learning project

In pair plot we can see, the maximum features have two values like 0 and 1 and orange dots show the enrolled customer’s features. So we visualize the counterplot of enrolled data.

Countplot of enrolled

# Show counterplot of 'enrolled' feature
sns.countplot(fineTech_appData.enrolled)

Output >>>

7 fineTech_appData_counterplot ML end to end project

Here you can see the exact value of enrolled & not enrolled customers.

# value enrolled and not enrolled customers
print("Not enrolled user = ", (fineTech_appData.enrolled < 1).sum(), "out of 50000")
print("Enrolled user = ",50000-(fineTech_appData.enrolled < 1).sum(),  "out of 50000")

Output >>>

Not enrolled user =  18926 out of 50000
Enrolled user =  31074 out of 50000

Histogram of each feature of fineTech_appData2

In pair plot, we saw the distribution of each feature but here we visualize in the histogram to understand easily.

# plot histogram 

plt.figure(figsize = (16,9)) # figure size in ratio 16:9
features = fineTech_appData2.columns # list of columns name
for i,j in enumerate(features): 
    plt.subplot(3,3,i+1) # create subplot for histogram
    plt.title("Histogram of {}".format(j), fontsize = 15) # title of histogram
    
    bins = len(fineTech_appData2[j].unique()) # bins for histogram
    plt.hist(fineTech_appData2[j], bins = bins, rwidth = 0.8, edgecolor = "y", linewidth = 2, ) # plot histogram
    
plt.subplots_adjust(hspace=0.5) # space between horixontal axes (subplots)

Output >>>

8 fineTech_appData2_histogram_subplot Python ML Project

In the above histogram, we can see minigame, used_primium_feature, enrolled, and like they have only two values and how they distributed.

The histogram of ‘dayofweek’ shows, on Tuesday and Wednesday slightly fewer customer registered the app.

The histogram of ‘hour’ shows the less customer register on the app around 10 AM.

The ‘age’ histogram shows, the maximum customers are younger.

The ‘numsreens’ histogram shows the few customers saw more than 40 screens.

Correlation barplot with ‘enrolled’ feature

Now we are trying to know which feature is strongly correlated with ‘enrolled’ feature with positive or negative through barplot.

# show corelation barplot 

sns.set() # set background dark grid
plt.figure(figsize = (14,5))
plt.title("Correlation all features with 'enrolled' ", fontsize = 20)
fineTech_appData3 = fineTech_appData2.drop(['enrolled'], axis = 1) # drop 'enrolled' feature
ax =sns.barplot(fineTech_appData3.columns,fineTech_appData3.corrwith(fineTech_appData2.enrolled)) # plot barplot 
ax.tick_params(labelsize=15, labelrotation = 20, color ="k") # decorate x &amp; y ticks font

Output >>>

9 Barplot Correlation

We saw the heatmap correlation matrix but this was not showing correlation clearly but you can easily understand which feature is how much correlated with ‘enrolled’ feature using the above barplot.

The ‘numscreens’ and ‘minigame’ is strongly positively correlated with ‘enrolled’ feature than other feature.

The ‘hour’, ‘age’ and ‘used_premium_feature’ are strongly negatively correlated with the ‘enrolled’ feature.


Now, we are parsing ‘first_open’ and ‘enrolled_date’ object data in data and time format.

# parsinf object data into data time format

fineTech_appData['first_open'] =[parser.parse(i) for i in fineTech_appData['first_open']]

fineTech_appData['enrolled_date'] =[parser.parse(i) if isinstance(i, str) else i for i in fineTech_appData['enrolled_date']]

fineTech_appData.dtypes

Output >>>

user                             int64
first_open              datetime64[ns]
dayofweek                        int64
hour                             int32
age                              int64
screen_list                     object
numscreens                       int64
minigame                         int64
used_premium_feature             int64
enrolled                         int64
enrolled_date           datetime64[ns]
liked                            int64
dtype: object

We are finding how much time the customer takes to get enrolled in the premium feature app after registration. For that subtract ‘fineTech_appData.first_open’ from ‘fineTech_appData.enrolled_date’ and set data type as timedelta64 in hours.

fineTech_appData['time_to_enrolled']  = (fineTech_appData.enrolled_date - fineTech_appData.first_open).astype('timedelta64[h]')

Showing the distribution of time taken to enrolled the app.

# Plot histogram
plt.hist(fineTech_appData['time_to_enrolled'].dropna())

Output >>>

10 Histogram (time to enrolled)

let’s try to show the distribution in range 0 to 100 hours.

# Plot histogram
plt.hist(fineTech_appData['time_to_enrolled'].dropna(), range = (0,100)) 

Output >>>

11 Histogram (time to enrolled)

In the above histogram, we know the maximum customers have enrolled the app in 10 hours from the registration.


Feature selection

We are considering those customers have enrolled after 48 hours as 0.

# Those customers have enrolled after 48 hours set as 0
fineTech_appData.loc[fineTech_appData.time_to_enrolled > 48, 'enrolled'] = 0

Drop some ‘time_to_enrolled’, ‘enrolled_date’, ‘first_open’ feature they are not strongly correlated to the result.

fineTech_appData.drop(columns = ['time_to_enrolled', 'enrolled_date', 'first_open'], inplace=True)

Read another CSV file that contains the top screens name.

To download this file, click on the below button.

# read csv file and convert it into numpy array
fineTech_app_screen_Data = pd.read_csv("Dataset/FineTech appData/top_screens.csv").top_screens.values

fineTech_app_screen_Data

Output >>>

array(['Loan2', 'location', 'Institutions', 'Credit3Container',
       'VerifyPhone', 'BankVerification', 'VerifyDateOfBirth',
       'ProfilePage', 'VerifyCountry', 'Cycle', 'idscreen',
       'Credit3Dashboard', 'Loan3', 'CC1Category', 'Splash', 'Loan',
       'CC1', 'RewardsContainer', 'Credit3', 'Credit1', 'EditProfile',
       'Credit2', 'Finances', 'CC3', 'Saving9', 'Saving1', 'Alerts',
       'Saving8', 'Saving10', 'Leaderboard', 'Saving4', 'VerifyMobile',
       'VerifyHousing', 'RewardDetail', 'VerifyHousingAmount',
       'ProfileMaritalStatus', 'ProfileChildren ', 'ProfileEducation',
       'Saving7', 'ProfileEducationMajor', 'Rewards', 'AccountView',
       'VerifyAnnualIncome', 'VerifyIncomeType', 'Saving2', 'Saving6',
       'Saving2Amount', 'Saving5', 'ProfileJobTitle', 'Login',
       'ProfileEmploymentLength', 'WebView', 'SecurityModal', 'Loan4',
       'ResendToken', 'TransactionList', 'NetworkFailure', 'ListPicker'],
      dtype=object)

Add ‘,’ at the end of each string of ‘screen_list’ for further operation.

fineTech_appData['screen_list'] = fineTech_appData.screen_list.astype(str) + ','

The ‘Screen_list’ contains string values but we can’t use it directly. So to solve this problem we are taking each screen name from ‘fineTech_app_screen_Data’ and append as a column by the same name to ‘fineTech_appData’. Then check this screen name is available in ‘screen_list’ if it is available then add value 1 else 0 in the appended column.

# string into to number

for screen_name in fineTech_app_screen_Data:
    fineTech_appData[screen_name] = fineTech_appData.screen_list.str.contains(screen_name).astype(int)
    fineTech_appData['screen_list'] = fineTech_appData.screen_list.str.replace(screen_name+",", "")
# get shape
fineTech_appData.shape

Output >>> (50000, 68)

You can see the shape of the dataset has changed from 12 to 68 columns.

# head of DataFrame
fineTech_appData.head(6)

Output >>>

12 fineTech_appData_head Directing Customers To Subscription Through Financial App Behaviour Analysis

Those screens are not availble in ‘fineTech_app_screen_Data’ that are counted and conted number store in new column by name ‘remain_screen_list’.

# remain screen in 'screen_list'
fineTech_appData.loc[0,'screen_list']

Output >>>

'joinscreen,product_review,ScanPreview,VerifyToken,ProfileVerifySSN,Settings,ForgotPassword,'
# count remain screen list and store counted number in 'remain_screen_list'

fineTech_appData['remain_screen_list'] = fineTech_appData.screen_list.str.count(",")

Droping ‘screen_list’ column.

# Drop the 'screen_list'
fineTech_appData.drop(columns = ['screen_list'], inplace=True)

We have total columns 68

# total columns
fineTech_appData.columns

Output >>>

Index(['user', 'dayofweek', 'hour', 'age', 'numscreens', 'minigame',
       'used_premium_feature', 'enrolled', 'liked', 'Loan2', 'location',
       'Institutions', 'Credit3Container', 'VerifyPhone', 'BankVerification',
       'VerifyDateOfBirth', 'ProfilePage', 'VerifyCountry', 'Cycle',
       'idscreen', 'Credit3Dashboard', 'Loan3', 'CC1Category', 'Splash',
       'Loan', 'CC1', 'RewardsContainer', 'Credit3', 'Credit1', 'EditProfile',
       'Credit2', 'Finances', 'CC3', 'Saving9', 'Saving1', 'Alerts', 'Saving8',
       'Saving10', 'Leaderboard', 'Saving4', 'VerifyMobile', 'VerifyHousing',
       'RewardDetail', 'VerifyHousingAmount', 'ProfileMaritalStatus',
       'ProfileChildren ', 'ProfileEducation', 'Saving7',
       'ProfileEducationMajor', 'Rewards', 'AccountView', 'VerifyAnnualIncome',
       'VerifyIncomeType', 'Saving2', 'Saving6', 'Saving2Amount', 'Saving5',
       'ProfileJobTitle', 'Login', 'ProfileEmploymentLength', 'WebView',
       'SecurityModal', 'Loan4', 'ResendToken', 'TransactionList',
       'NetworkFailure', 'ListPicker', 'remain_screen_list'],
      dtype='object')

All the saving screens correlated with each other that’s we are taking the sum of all saving screens in each row and store in a single row for all customers.

# take sum of all saving screen in one place
saving_screens = ['Saving1',
                  'Saving2',
                  'Saving2Amount',
                  'Saving4',
                  'Saving5',
                  'Saving6',
                  'Saving7',
                  'Saving8',
                  'Saving9',
                  'Saving10',
                 ]
fineTech_appData['saving_screens_count'] = fineTech_appData[saving_screens].sum(axis = 1)
fineTech_appData.drop(columns = saving_screens, inplace = True)

similarly for credit, CC1 and loan screens.

credit_screens = ['Credit1',
                  'Credit2',
                  'Credit3',
                  'Credit3Container',
                  'Credit3Dashboard',
                 ]
fineTech_appData['credit_screens_count'] = fineTech_appData[credit_screens].sum(axis = 1)
fineTech_appData.drop(columns = credit_screens, axis = 1, inplace = True)
cc_screens = ['CC1',
              'CC1Category',
              'CC3',
             ]
fineTech_appData['cc_screens_count'] = fineTech_appData[cc_screens].sum(axis = 1)
fineTech_appData.drop(columns = cc_screens, inplace = True)
loan_screens = ['Loan',
                'Loan2',
                'Loan3',
                'Loan4',
               ]
fineTech_appData['loan_screens_count'] = fineTech_appData[loan_screens].sum(axis = 1)
fineTech_appData.drop(columns = loan_screens, inplace = True)

Now, you can see the shape of DataFrame is reduced.

fineTech_appData.shape

Output >>> (50000, 51)

See information of fineTech_appData

fineTech_appData.info()

Output >>>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 50 columns):
user                       50000 non-null int64
dayofweek                  50000 non-null int64
hour                       50000 non-null int32
age                        50000 non-null int64
numscreens                 50000 non-null int64
minigame                   50000 non-null int64
used_premium_feature       50000 non-null int64
enrolled                   50000 non-null int64
liked                      50000 non-null int64
location                   50000 non-null int32
Institutions               50000 non-null int32
VerifyPhone                50000 non-null int32
BankVerification           50000 non-null int32
VerifyDateOfBirth          50000 non-null int32
ProfilePage                50000 non-null int32
VerifyCountry              50000 non-null int32
Cycle                      50000 non-null int32
idscreen                   50000 non-null int32
Splash                     50000 non-null int32
RewardsContainer           50000 non-null int32
EditProfile                50000 non-null int32
Finances                   50000 non-null int32
Alerts                     50000 non-null int32
Leaderboard                50000 non-null int32
VerifyMobile               50000 non-null int32
VerifyHousing              50000 non-null int32
RewardDetail               50000 non-null int32
VerifyHousingAmount        50000 non-null int32
ProfileMaritalStatus       50000 non-null int32
ProfileChildren            50000 non-null int32
ProfileEducation           50000 non-null int32
ProfileEducationMajor      50000 non-null int32
Rewards                    50000 non-null int32
AccountView                50000 non-null int32
VerifyAnnualIncome         50000 non-null int32
VerifyIncomeType           50000 non-null int32
ProfileJobTitle            50000 non-null int32
Login                      50000 non-null int32
ProfileEmploymentLength    50000 non-null int32
WebView                    50000 non-null int32
SecurityModal              50000 non-null int32
ResendToken                50000 non-null int32
TransactionList            50000 non-null int32
NetworkFailure             50000 non-null int32
ListPicker                 50000 non-null int32
remain_screen_list         50000 non-null int64
saving_screens_count       50000 non-null int64
credit_screens_count       50000 non-null int64
cc_screens_count           50000 non-null int64
loan_screens_count         50000 non-null int64
dtypes: int32(37), int64(13)
memory usage: 12.0 MB
# Numerical distribution of fineTech_appData
fineTech_appData.describe()

Output >>>

12-1 fineTech_appData.describe() Directing Customers To Subscription Through Financial App Behaviour Analysis ML project dataset

To download the clean fineTech_appData dataset, click the below button.


Heatmap with the correlation matrix

# Heatmap with correlation matrix of new fineTech_appData

plt.figure(figsize = (25,16)) 
sns.heatmap(fineTech_appData.corr(), annot = True, linewidth =2)

Output >>>

13 Heatmap with new dataset Directing Customers To Subscription Through Financial App Behaviour Analysis

Data preprocessing

Split dataset in Train and Test

clean_fineTech_appData = fineTech_appData
target = fineTech_appData['enrolled']
fineTech_appData.drop(columns = 'enrolled', inplace = True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(fineTech_appData, target, test_size = 0.2, random_state = 0)
print('Shape of X_train = ', X_train.shape)
print('Shape of X_test = ', X_test.shape)
print('Shape of y_train = ', y_train.shape)
print('Shape of y_test = ', y_test.shape)

Output >>>

Shape of X_train =  (40000, 49)
Shape of X_test =  (10000, 49)
Shape of y_train =  (40000,)
Shape of y_test =  (10000,)
# take User ID in another variable 
train_userID = X_train['user']
X_train.drop(columns= 'user', inplace =True)
test_userID = X_test['user']
X_test.drop(columns= 'user', inplace =True)
print('Shape of X_train = ', X_train.shape)
print('Shape of X_test = ', X_test.shape)
print('Shape of train_userID = ', train_userID.shape)
print('Shape of test_userID = ', test_userID.shape)

Output >>>

Shape of X_train =  (40000, 48)
Shape of X_test =  (10000, 48)
Shape of train_userID =  (40000,)
Shape of test_userID =  (10000,)

Feature Scaling

The multiple features in the different units so for the best accuracy need to convert all features in a single unit.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

Click here to learn more Machine learning Projects.


Machine Learning Model Building

The target variable is categorical type 0 and 1, so we have to use supervised classification algorithms.

To build the best model, we have to train and test the dataset with multiple Machine Learning algorithms then we can find the best ML model. So let’s try.

First, we import the required packages.

# impoer required packages
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

Decision Tree Classifier

# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(criterion= 'entropy', random_state=0)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
accuracy_score(y_test, y_pred_dt)

Output >>> 0.6936

# train with Standert Scaling dataset
dt_model2 = DecisionTreeClassifier(criterion= 'entropy', random_state=0)
dt_model2.fit(X_train_sc, y_train)
y_pred_dt_sc = dt_model2.predict(X_test_sc)
accuracy_score(y_test, y_pred_dt_sc)

Output >>> 0.6932

K – Nearest Neighbor Classifier

from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2,)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)

accuracy_score(y_test, y_pred_knn)

Output >>> 0.6994

# train with Standert Scaling dataset
knn_model2 = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2,)
knn_model2.fit(X_train_sc, y_train)
y_pred_knn_sc = knn_model2.predict(X_test_sc)

accuracy_score(y_test, y_pred_knn_sc)

Output >>> 0.7314

Naive Bayes Classifier

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)

accuracy_score(y_test, y_pred_nb)

Output >>> 0.7114

# train with Standert Scaling dataset
nb_model2 = GaussianNB()
nb_model2.fit(X_train_sc, y_train)
y_pred_nb_sc = nb_model2.predict(X_test_sc)

accuracy_score(y_test, y_pred_nb_sc)

Output >>> 0.7114

Random Forest Classifier

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

accuracy_score(y_test, y_pred_rf)

Output >>> 0.7621

# train with Standert Scaling dataset
rf_model2 = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
rf_model2.fit(X_train_sc, y_train)
y_pred_rf_sc = rf_model2.predict(X_test_sc)

accuracy_score(y_test, y_pred_rf_sc)

Output >>> 0.7616

Logistic Regression

# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(random_state = 0, penalty = 'l1')
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

accuracy_score(y_test, y_pred_lr)

Output >>> 0.7684

# train with Standert Scaling dataset
lr_model2 = LogisticRegression(random_state = 0, penalty = 'l1')
lr_model2.fit(X_train_sc, y_train)
y_pred_lr_sc = lr_model2.predict(X_test_sc)

accuracy_score(y_test, y_pred_lr_sc)

Output >>> 0.7681

Support Vector Classifier

# Support Vector Machine
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)
y_pred_svc = svc_model.predict(X_test)

accuracy_score(y_test, y_pred_svc)

Output >>> 0.7616

# train with Standert Scaling dataset
svc_model2 = SVC()
svc_model2.fit(X_train_sc, y_train)
y_pred_svc_sc = svc_model2.predict(X_test_sc)

accuracy_score(y_test, y_pred_svc_sc)

Output >>> 0.779

XGBoost Classifier

# XGBoost Classifier
from xgboost import XGBClassifier
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
accuracy_score(y_test, y_pred_xgb)

Output >>> 0.7748

# train with Standert Scaling dataset
xgb_model2 = XGBClassifier()
xgb_model2.fit(X_train_sc, y_train)
y_pred_xgb_sc = xgb_model2.predict(X_test_sc)

accuracy_score(y_test, y_pred_xgb_sc)

Output >>> 0.7748

# XGB classifier with parameter tuning
xgb_model_pt1 = XGBClassifier(
 learning_rate =0.01,
 n_estimators=5000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

xgb_model_pt1.fit(X_train, y_train)
y_pred_xgb_pt1 = xgb_model_pt1.predict(X_test)

accuracy_score(y_test, y_pred_xgb_pt1)

Output >>> 0.7887

# XGB classifier with parameter tuning
# train with Stander Scaling dataset
xgb_model_pt2 = XGBClassifier(
 learning_rate =0.01,
 n_estimators=5000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

xgb_model_pt2.fit(X_train_sc, y_train)
y_pred_xgb_sc_pt2 = xgb_model_pt2.predict(X_test_sc)

accuracy_score(y_test, y_pred_xgb_sc_pt2)

Output >>> 0.7887

We observ that Support Vector Classifier and XGBoost Classifier give best accuracy than ohter ML algorithm. But we will continue with XGBoost classifier because the accuracy is slightly higher than SVC.

Confusion Matrix

# confussion matrix
cm_xgb_pt2 = confusion_matrix(y_test, y_pred_xgb_sc_pt2)
sns.heatmap(cm_xgb_pt2, annot = True, fmt = 'g')
plt.title("Confussion Matrix", fontsize = 20) 

Output >>>

14 Confusion matrix xgboost Heatmap Machine Learning Project

The model is giving type II error higher than type I.

Classification report of ML model

# Clasification Report
cr_xgb_pt2 = classification_report(y_test, y_pred_xgb_sc_pt2)

print("Classification report >>> \n", cr_xgb_pt2)

Output >>>

Classification report >>> 
               precision    recall  f1-score   support

           0       0.78      0.82      0.80      5072
           1       0.80      0.76      0.78      4928

   micro avg       0.79      0.79      0.79     10000
   macro avg       0.79      0.79      0.79     10000
weighted avg       0.79      0.79      0.79     10000

Cross-validation of the ML model

To find the ML model is overfitted, under fitted or generalize doing cross-validation.

# Cross validation
from sklearn.model_selection import cross_val_score
cross_validation = cross_val_score(estimator = xgb_model_pt2, X = X_train_sc, y = y_train, cv = 10)
print("Cross validation of XGBoost model = ",cross_validation)
print("Cross validation of XGBoost model (in mean) = ",cross_validation.mean())

Output >>>

Cross validation of XGBoost model =  
[0.79255186 0.77855536 0.78875    0.78625    0.7795     0.78575
 0.79       0.7815     0.78944736 0.77844461]

Cross validation of XGBoost model (in mean) =  0.7850749196187449

The mean value cross-validation and XGBoost model accuracy is 78%. That means our XGBoost model is a generalized model.


Mapping predicted output to the target

In the below output, you can find the predicted output by model and actual target output.

final_result = pd.concat([test_userID, y_test], axis = 1)
final_result['predicted result'] = y_pred_xgb_sc_pt2

print(final_result)

Output >>>

         user  enrolled  predicted result
11841  239786         1                 1
19602  279644         1                 1
45519   98290         0                 0
25747  170150         1                 1
42642  237568         1                 0
31902   65042         1                 0
30346  207226         1                 1
12363  363062         0                 0
32490  152296         1                 1
26128   64484         0                 0
14227   38108         1                 1
26376  359940         0                 0
44173  136089         0                 0
12968   14231         1                 1
32104  216038         0                 0
17844   18918         1                 1
43460  316730         1                 1
8369    28308         1                 0
15055  228387         1                 1
6338    69640         1                 1
15301  358264         0                 0
46250  348059         0                 0
45580  178743         1                 1
24647  167556         0                 0
46712  294101         0                 0
4150   192801         0                 0
42460  163983         1                 1
29079  298830         0                 0
19412  151790         1                 1
34839   20200         1                 1
...       ...       ...               ...
3380   348989         0                 0
37623  248593         1                 0
24852  316086         1                 1
29372  192540         1                 1
49639  256833         0                 0
2930   273991         1                 1
1210   365937         0                 0
22652  295129         0                 0
32360  255715         1                 0
9171    37332         0                 1
49037  164886         1                 0
17793  309967         0                 0
28887   14907         0                 0
567    244737         1                 1
662    284862         0                 0
46038   60719         1                 1
16778  262103         1                 0
3075   243679         1                 1
34793  280000         1                 1
6557   255074         0                 0
19150  347521         0                 0
40096  335029         1                 0
7869    37271         1                 1
49546  240006         1                 1
45202  279449         0                 1
25091  143036         1                 1
27853   91158         1                 1
47278  248318         0                 0
37020  142418         1                 1
2217   279355         1                 0

[10000 rows x 3 columns]

Save the Machine Learning model

After completion of the Machine Learning project or building the ML model need to deploy in an application. To deploy the ML model need to save it first. To save the Machine Learning project we can use the pickle or joblib package.

Here, we will see both ways, Use anyone which is better for you.

Save the ML model with Pickle

## Pickle
import pickle

# save model
pickle.dump(xgb_model_pt2, open('FineTech_app_ML_model.pickle', 'wb'))

# load model
ml_model_pl = pickle.load(open('FineTech_app_ML_model.pickle', 'rb'))

# predict the output
y_pred_pl = ml_model.predict(X_test_sc)

# confusion matrix
cm_pl = confusion_matrix(y_test, y_pred)
print('Confussion matrix = \n', cm_pl)

# show the accuracy
print("Accuracy of model = ",accuracy_score(y_test, y_pred_pl))

Output >>>

Confussion matrix = 
 [[4156  916]
 [1197 3731]]

Accuracy of model =  0.7887

Save the Ml model with Joblib

## Joblib
from sklearn.externals import joblib

# save model
joblib.dump(xgb_model_pt2, 'FineTech_app_ML_model.joblib')

# load model
ml_model_jl = joblib.load('FineTech_app_ML_model.joblib')

# predict the output 
y_pred_jl = ml_model.predict(X_test_sc)

cm_jl = confusion_matrix(y_test, y_pred)
print('Confussion matrix = \n', cm_jl)

print("Accuracy of model = ", accuracy_score(y_test, y_pred_jl))

Output >>>

Confussion matrix = 
 [[4156  916]
 [1197 3731]]

Accuracy of model =  0.7887

Note: When we dump the model then model file is store in the disk where the project file is store but we can change path by passing its address.

Congratulation!!!!!!!

We have completed the Machine learning Project successfully with 78.87% accuracy which is great for ‘Directing Customers to Subscription Through Financial App Behavior Analysis’ project. Now, we are ready to deploy our ML model in the Fin_tech company project.

Click on the below button to download ‘Directing Customers to Subscription Through Financial App Behavior Analysis‘ Machine Learning end to end project in the Jupyter Notebook file and Python file format.

Conclusion

To get more accuracy, we train all supervised classification algorithms but you can try out a few of them which are always popular. After training all algorithms, we found that SVC and XGBoost classifiers are given high accuracy than remain but we have chosen XGBoost.

As ML Engineer, we always retrain the deployed model after some period of time to sustain the accuracy of the model. We hope our efforts will give more profit to the fin-tech company.

Please share your feedback and doubt regarding this ML project, so we can update it.

I hope you enjoy the Machine Learning End to End project. Thank you….. -:)

Click here to learn more Machine learning end to end projects.

Leave a Reply

Top