• utkarsh195

Startup Profit Prediction ML App - Step By Step Guide



Building An End To End Startup Profit Prediction ML Web App Using Streamlit

Table Of Contents

● Introduction

● Key Highlights

● Choosing The Dataset

● Loading Dependencies and Dataset

● EDA -Exploratory Data Analysis

● Preprocessing

● Model Development

● Model Deployment

Introduction

Let's say you want to start a company/startup, and you are planning to invest a thousand bucks for that, but in return, you want to make sure whether it's a good bet.

So you opened google and searched for the cost that the market demands. You get a reasonable estimate of what you should invest. However, as the market is unpredictable, it is challenging to guess the profit you may earn.

Many startups face this, but fret not; this is where we can automate the process using machine learning and get pretty good predictions.

Key Highlights

You must have guessed this article is all about solving a real-world problem. However, here are a few caveats to keep in mind while reading and learning:

●End to End- The term End to End refers to the process of getting a suitable dataset, loading it, EDA (Exploratory Data Analysis) and pre-processing, Model Development & Evaluation, and finally, Deployment and Inference.

● Learn the key points- Key goal here is to go for the process last and understand how to approach a problem with a Data Scientist mindset and apply the domain knowledge first.

● Practice is a must- It's suggested to apply the provided, instead of just reading the article, to get the most out of it.

● Separating projects- I highly encourage creating a separate folder and environment, which will help deploy our model later. Also, it's good practice to keep the CSV file/dataset in the created folder for easy access.

So with all this set, let's move forward.

Choosing The Dataset

Before anything, first, we are required to choose a suitable dataset. So let's discuss what we need.

Our dataset should have a target column as profit and the rest as contributors to target. Profit should be our dependent variable and components as explanatory variables in simple terms.

After browsing Kaggle, one dataset to our relevance was 50 startup datasets which consist of the profit of 50 startups (NY, CA, FL). So we are going to use this one for our use case.

Here is a breakdown of the dataset:

● Features- The features are the spending done on different company sectors and are R&D Spend, Administration, Marketing Spend, State.

● Target- Our required variable/aim — Profit.

Loading Dependencies & Dataset

Having learned about data, let's load it along with all the dependencies listed:

● Numpy: for performing mathematical calculations behind ML algorithms.

● Matplotib & Seaborn: for data visualization.

● Sklearn: for model development and evaluation.

● Pandas: for handling and cleaning the dataset


A simple import would do:

import numpy as np # for performing mathematical calculations behind ML algorithms
import matplotlib.pyplot as plt # for visualization
import pandas as pd # for handling and cleaning the dataset
import seaborn as sns # for visualization
import sklearn # for model evaluation and development

To load a dataset, we can use

pd.read_csv(filepath="path/to/file")

as:

# loading dataset
dataset = pd.read_csv(r'/content/drive/My Drive/Colab Notebooks/50_Startups.csv')
dataset.head()# displays first 5 rows of dataset
Displaying first 5 rows of the dataset
First 5 rows of the dataset

EDA -Exploratory Data Analysis

Loading data isn't enough and doesn't convey meaningful insights. So data science says, "Torture data, and it will speak for itself." I am glad you will do the same under the name EDA.

In the simplest term, EDA means exploring data, finding summary statistics, forming hypotheses that may or may not be accurate, generally compromised statistical graphs, and data visualization method (Data Analyst !).

In this section, we rely on pandas, seaborn (matplotlib), and numpy for the work. Each section below compromises a single step of the process.

1. Getting Summary Statistics

Often it's excellent to understand the measure of central tendency (where the centre of data lies) & measure of dispersion (how broad spread data is). These help us understand the quality and distribution of data.

The above can be achieved using df.describe() method of pandas, which returns the following:

dataset.describe()
Displaying descriptive statistics of the dataset
Descriptive statistics of Dataset

Describing

● Count- Number of observation/samples in each column, if not consistent- it means the dataset has something wrong.

● Mean- Average of each column of the dataset.

●Std/Standard Deviation- How widespread data points/observation values are from the data's mean.

● Min,25%,50%,75%, Max— 0th, 25th, 50th, 75th, and 100th percentile of each variable- often called five measure summaries.

2. Checking Duplicates

The data may have some duplicates present, which may lead to biased. Hence, we must check for it, which can be quickly done by chaining df.duplicate() and sum() method.

print('There are',dataset.duplicated().sum(),'duplicate values in the dateset.') #using duplicated() pre-defined function
Displaying no duplicates in the dataset
No Duplicates

Note: The changing part is pretty standard, and you will see a lot in an upcoming section. Also, df refers to dataset /data frame(native to pandas).

3. Check For Nulls

For the next part, we will check for null values as it may hamper the performance of data fed to the model later. The function to perform this is df.isnull().sum() which returns no null values in each column.

dataset.isnull().sum()

Displaying no nulls in the dataset
No Nulls

Here are, no null values are present. However, even if present, you can just drop them off using df.dropna() method.

4. Dtype-Evaluation

Type stands for data types and restricts the usage of operations performed, so let's evaluate any using the df.info() method.

dataset.info()
Data types of all columns in the dataset
Data types of all columns in the dataset

As per results, State it is of string-object data type in pandas; generally, these do not contribute to the target variable. Let’s visualize the numeric variables.

5. Correlation Matrix With Heat Map

One can think of a Correlation Matrix as a table that summarises the correlation between each feature in a metric called correlation coefficient & Heatmap is a visual representation where the strength of each correlation is displayed using colors.

The standard way to create the plot :

c = dataset.corr() # corr inbuilt fn
sns.heatmap(c,annot=True,cmap='Blues')
plt.show()
Displaying Correlation Heatmap for the Dataset
Correlation Heatmap for the Dataset

The above is achieved by combing df.corr() and sns.heatmap(). The further arguments passed are annot= True to enable annotation & cmap='Blues' to set the color mapping style/pallet.

Note: State is a categorical variable, so if it doesn’t get displayed, it’s pretty evident as we haven’t encoded it.

6. Outlier Detection

Outlier is a lot of data points/values that differ from actual data values. It can lead to unimaginable damage when training the model, so detecting and treating them(removing/imputing by mean) is good.

Outliers can be detected using a box plot provided by seaborn.


outliers = ['Profit']
plt.rcParams['figure.figsize'] = [8,8]
sns.boxplot(data=dataset[outliers], orient="v", palette="Set2" , width=0.7) 
plt.title("Outliers Variable Distribution")
plt.ylabel("Profit Range")
plt.xlabel("Continuous Variable")

plt.show() 

To understand, let's break the above code:

● At the start, we defined a new series called outliers consisting of variable Profit.

● Next, we change the figure size to 8 * 8 using rcPrarams['figure.figsize'] method.

● Finally, we plot our box plot by calling sns.boxplot(data = [dataset], orient = 'v:vertical/h:horizontal', palette='paletteset', width='0-1/width of one element')

● Lastly, we define the legends and labels and finally show the plot.

Displaying Profit Vs. Profit Range Outlier Detection
Profit Vs. Profit Range Outlier Detection

Some are outliers, but it's not clear which one, so let's plot the box plot relative to each state. To do so, we only need to pass the states column, and seaborn is smart enough to figure out the categories it has!

sns.boxplot(x = 'State', y = 'Profit', data = dataset)
plt.show()
Displaying State vs. Profit Outlier Detection
State vs. Profit Outlier Detection

Now we have the whole picture. The outlier was from the New York column, and one needs to treat it.

Note: State is a categorical variable. It’s good to plot each category (NY, CA, FL) respective to profit.

7. Understanding Distribution

Distribution refers to the overall representation of data. It is a great way to understand many properties of data. We can use sns.distplot to visualize the distribution.

Under the hood, it creates a histogram and uses KDE & RUG plots to carve a curve representing the datasets.

sns.distplot(dataset['Profit'],bins=5,kde=True)
plt.show()
Displaying Distribution plot
Distribution Plot

Great! We have Gaussian Distribution (a bell-shaped curve) here, which is excellent. The dataset is symmetric around the mean and has the same mean, median, and mode.

Section Summary

According to the data, we can say our data has good distribution, no null and duplicate values. Also, there are a few outliers that need to be treated.

Preprocessing

Though we have a reasonably good dataset, it isn't enough as the dataset will use this to train models on our PC. We all know "computers only understand numbers," which is not the case here.

So this section provides insight on how to create the dataset(features and labels), convert it into computer understood format & split the dataset for training and testing purposes.

1. Creating Features And Labels

If you recall our analogy in the dataset selection process, it was split into two parts. First, we need to break our dataset according to that.

Looking closely, we find all columns except the profit feature, and we can split the data using df.iloc which returns the sliced array according to the column index (similar to list slicing).

# spliting Dataset in Dependent & Independent Variables
X = dataset.iloc[:, :-2].values
y = dataset.iloc[:, 4].values
print(X)
Displaying split values
Values of Split

Note: .values It ensures we get only the deals of each feature and commit the state column.

2. Encoding Categories/Labels

Fantastic! we have the required split. As we have seen before, our States column is a categorical variable(string/object). It must encode it into numbers(integers). One clever way is to encode the categories as later referenced numbers.

We can use the LabelEncoder from sklearn for the same.

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X1 = pd.DataFrame(X)
X1.head()
Displaying encoded class
Encoded Class

Note: 0,1,2 are numeric encoded labels for state category.

3. Train Test Split

So we have our data with the format as numbers, but still, there is one piece missing, our data separation for train and test.

To those wondering why this is necessary as a procedure, if we train the model with the entire dataset, there will be no way to evaluate the model's performance.

As a general step, we train some portion of the data and evaluate the hold-out to make our evaluation non-biased.To perform such a split, we use train_test_split() function.

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=0) # performs the split
print(x_train)

To readers, here is the breakdown of the split:

X,y — actual data(features and labels)

train_size — defines the training size here 70% of data.

(x_train,x_test),(y_train,y_test) — Train and Test data features, Train and Test data Labels

Displaying result of X_Train
X_Train Result

Section Summary

At the end of this section, we have a well-split train and test split data compromising features and labels for each integer data type.

Model Development

Let's shift to model development- the process of finding the exemplary model architecture, creating and training, and evaluating the results. If required, re-iterate the entire process.

1. Selecting Model Architecture & Training

As per EDA done earlier, our data forms a Gaussian distribution, which means it is linearly separable. Hence, a straightforward technique called linear regression is appropriate here.

Linear Regression In a Nutshell

In a nutshell, the technique tries to fit a straight line between the data points. The error between the real and predicted data points is minimal.

Luckily for us, the algorithm is provided by sklearn.linear_model, and we are going to use it by first defining the model as LinearRegression and then calling the model.fit() method.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train,y_train)
print('Model has been trained successfully')
>> Model has been trained successfully

2. Predicting Results

So as our model is ready, let's now check the prediction. We can perform it by pausing only x_train to model.predict() function; why? (— answer me in the comments)

y_pred = model.predict(x_test)
print(y_pred)
Displaying Predicted Data
Predicted Data

3. Evaluating Performance Based On Model Score

As we cross-check the results with actuals whenever we perform some calculations. The same goes for data science in a process called Model Evaluation.

One of the simplest ways to evaluate the performance is by getting a score between 0 & 1, for which model.score() functionality is pretty helpful.

testing_data_model_score = model.score(x_test, y_test)
print("Model Score/Performance on Testing data",testing_data_model_score)

training_data_model_score = model.score(x_train, y_train)
print("Model Score/Performance on Training data",training_data_model_score)
Displaying Model Score
Model Score

Note:- We have passed both features and labels this time to get a model score

Our model is 93% accurate on test data and 95% accurate with train data. The test set ~ train set score is considered suitable as a general case scenario.

4. Evaluating Performance Based On Metrics

Another way is to use evaluation metrics such as , MSE, RMSE, and MAE

● MAE- Mean Absolute Error similar to the above two but uses absolute value and does not penalize significant errors.

● MSE — Mean Squared Error is a statistical measure that captures the error rate of the regression line fit. It penalizes the high errors but results in square units.

● RMSE — Square root of MSE to ensure the same measurement units for better interoperability. (I consider it as best!)

● R²- The person correlation coefficient, a statistical measure that describes the strength of association between 2 numerical variables (here prediction and actuals values)

Now off to implementation!. All the metrics are provided in sklearn.metrics class and all scores can be obtained by following the previous section.

a. MAE

from sklearn.metrics import mean_absolute_error

 mae = mean_absolute_error(y_pred,y_test)
 print("Mean Absolute Error is :" ,mae)
Displaying MAE Score
MAE Score

The score of 6603 is due to the deviation errors as expected.

b. MSE

from sklearn.metrics import mean_squared_error

 mse = mean_squared_error(y_pred, y_test)
 print("Mean Squarred Error is :" ,mse*100)
Displaying MSE Score
MSE Score

Don't panic. The MSE is expected to have more significant errors values than MAE as we squared the mistakes.

c. RMSE

rmse = np.sqrt(mean_squared_error(y_pred, y_test))
print("Root Mean Squarred Error is : ",rmse*100)
Displaying RMSE Score
RMSE Score

Intuitively, the values returned by RMSE are smaller than the MSE. Also, the error takes tremendous matters and is challenging to interpret, so we require our last measure to produce a single value.

d.R²

from sklearn.metrics import r2_score

r2Score = r2_score(y_pred, y_test)
print("R2 score of model is :" ,r2Score*100)
#multiplying by 100 insures we get a result between -100 to +100, instead of -1 to +1
Displaying R²/r² Score
R²/r² Score

Pretty close to 1 — means positive linear relationship, or in simple words, good line fit(good model performance!)

5. Confirming Hypothesis

Often the scores are not self-explanatory, so visualization is required. We can visualize to confirm our hypotheses by plotting the actual prediction and the regression line fit using a regression plot.

df = pd.DataFrame(data={'Predicted value':y_pred.flatten(),'Actual Value':y_test.flatten()})
 print(df.head())

We were required to perform the step to perform the array transpose of actual and predicted data, to pass it to x and y parameters of sns.regplot

plt.title('Actual vs Predicted')
plt.xlabel('Total cost')
plt.ylabel('Profit')
sns.regplot(x=y_test, y=y_pred, data=df)
Displaying Regplot — Total Cost Vs. Profit
Regplot — Total Cost Vs. Profit

As can be seen, the line fit is performing well. Congrats on making it to the end of half part of the article.

Section Summary

Finally, we have a good linear regression model with a decent line fit, as evident from scores, metrics, and graphs.

Model Deployment

Now that we have our final model up and running, the user will be required a way to use it. One of the best ways to do it is to deploy the model in the cloud as a web app, which includes two parts, writing the backend and adding UI elements.

So to make our life simpler, we will be using streamlit, a go-to library for both backend and frontend development due to its extensive support for python and HTML embedded widgets.

So let's get started.

1. Creating The Frontend

To create one, the library offers the embedded HTML and widgets support for leveraging to make the button, fields, image anchors, titles, and more.

So let's set the user stage: main.py

Title

string = "Startup's Profit Prediction"
# setup page config - dynamic web page
 st.set_page_config(page_title=string, page_icon="✅", layout="centered", initial_sidebar_state="auto", menu_items=None)
# st.title is a widget element
 st.title (string, anchor=None)

Image

from PIL import Image 
image = Image.open('startup.png') #load image
st.image(image) # st.image - image widget/placeholder

Input Fields

# st.sidebar.number_input - creates a side bar at with number input field
 rnd_cost = st.sidebar.number_input('Insert R&D Spend')
 st.write('The current number is ', rnd_cost) # main page display
Administration_cost = st.sidebar.number_input('Insert Administration cost Spend')
 st.write('The current number is ', Administration_cost)
Marketing_cost_Spend = st.sidebar.number_input('Insert Marketing cost Spend')
 st.write('The current number is ', Marketing_cost_Spend)
# gives a dropdown menu at sidebar
 option = st.sidebar.selectbox(
     'Select the region',
     ('Delhi', 'Banglore', 'Pune'))
st.write('You selected:', option) 

Adding Plot For Better Understanding — optional

Using the bar chart, one can add a plot to show results.

fig = plt.figure()
# defining our values
 X = ['Toal cost Spend']
 x_value = [rnd_cost+Administration_cost+Marketing_cost_Spend]
# creating evenly spaced integers between 0 and length of X  
 X_axis = np.arange(len(X))
# configuring plot by adding barchart data , axes ticks and labels
 plt.bar(X_axis - 0.2, x_value, 0.4, label = 'cost')
 plt.bar(X_axis + 0.2, y_pred, 0.4, label = 'profit') 
 plt.xticks(X_axis, X)
 plt.xlabel("RS")
 plt.title("Profit vs Toal cost spend")
 plt.legend()
#display figure as figure widget
 st.pyplot(fig)

So our will look something like this.

Displaying Result Of UI Code
Result Of UI Code

We will add the backend as our next step to make our app fully functional.

2. Writing The Backend

Here the backend refers to calling model architecture and handling the entire preprocessing laber_encoding step. We will take here to load the model and validate the inputs. This way, we can have more granular control over versioning the model. — main.py continued

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns 
import sklearn 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
dataset = pd.read_csv("50_Startups.csv")
# spliting Dataset in Dependent & Independent Variables
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
x_train,x_test,y_train,y_test=train_test_split(X,y,train_size=0.7,random_state=0) 
model = LinearRegression()model.fit(x_train,y_train)

Yes, you thought right! The model development(could have taken the saved model but applied this approach for more granular control).

Now let's add some preprocessing functionality using simple if-else conditions. Since our model is already changed, let’s change it to the Indian States.

if option == "Pune":
    optn = 0
if option == "Banglore":
    optn = 1
if option == "Delhi":
    optn = 2

Finally, add the prediction and data validation part.

y_pred = model.predict([[Marketing_cost_Spend,Administration_cost,rnd_cost,optn]])
if st.button('Predict'):
    st.success('The Profit must be  {} '.format(y_pred))
 else: 
     st.write('Please fill all the important details')

In the above code, we pass input as an array. When the predict button clicks, return the profit or ask a user to enter values (data validation to encounter errors).

To view the complete code, kindly visit this link or this one.

3. Hosting On Cloud- Additional Requirements

Streamlit also allows hosting ML apps to the cloud for free. Of course, you need to signup. However, it requires some other processes, so let's fix them.

Requirements.txt

This file contains all the libraries essential for the model function- remember the environment creation stated at the start of the article. We can use the below line of code to generate the file.

pip freeze > requirements.txt

You will get a new file something like this:

Displaying requirement.txt
requirement.txt

Repo Creation

To let streamlit fetch files, one needs to upload the file to a VCS system, using Github here:  creating a repo is pretty simple; please-go look on youtube if needed.

Now upload the following files:

Displaying files to be uploaded
Files to be uploaded

Note:- startup.png can be changed with other images too!

Setting Up The App

After signing up, visit https://share.streamlit.io/, click on the new app button and fill in the details:

● Repo Name: Name of GitHub repo file is hosted on, here this repo.

● Branch: Main master/ another unit to host deployed file.

● Main File Path: Name of the web app file, here main.py

● Click Deploy Button

After a few minutes, your app is ready to be tested.

Displaying the web app
The web app

Let's Test.- For an R&D Spend Of 2000, Administration Cost of 5000, and Marketing Spend of 1000, we get a profit of approx 43512.

Displaying the predicted output
Predicted output

Conclusion

This concludes our article on a step-by-step approach for developing and deploying ML apps from scratch. I hope you found your read interesting and implemented your app version.


By: Purnendu Shukla

15 views0 comments