Productizing Data Science Projects: Part 1 — API for Random Forest Regressor

Published in

Predmatic

6 min readApr 23, 2018

As data scientists, we write a lot of code to work with data. However, we do not necessarily write code for productizing (software-izing) the output, and sometimes rely on our respective organization’s IT or engineering team. This scenario could be challenging because of different priorities within the company. On the other hand, there are many generic (apparently promising) tools in the market. But often these products are expensive or have a high learning curve and need a long series of approvals from company executives. We then try to deliver results in Jupyter notebook, PPT, some other visualization tool which often falls short on the ease of use of a “business user”. The data science project then becomes an “interesting” but of little or no business value. If this scenario sounds familiar then you might find this article useful.

If the data scientists were to bring out the value on their own (not through an enterprise software or with engineering collaboration), there could be a few solutions. A possible solution to make your code available to the enterprise is to place your code on a server (EC2 instance, for example) and share it across the users. However, this presents another challenge. Not all users would like to (or have time to) login, handle data and download the results. So, to make that easier, one can develop a web application on top of it. But this would require more engineering (and perhaps more importantly, maintenance) effort.

I started exploring on this topic (an easy and generic way to productize the DS code for delivery) more deeply with emphasis on a solution that is:

Python based because probably majority of the data scientists are familiar with Python.
Requires little or no engineering/ IT maintenance on routine basis
Fast and easy to work with (About 10–20 lines of additional code, except the front-end code)

The AWS serverless concept seems promising. I am summarizing my findings in this series of articles along with relevant links. This way the readers of this article can save some time in searching the internet for relevant information.

In what follows, we will explore how to build a working web application using AWS serverless capability using Python. Before we begin, I would like to give credit to another post which also discusses a similar topic (I highly recommend reading this article): “Serverless Approximate Nearest Neighbors on AWS Lambda with Annoy and Chalice”. If you are interested in more details on serverless wen application, then you can also directly go to the AWS page.

This series of articles will cover the following topics:

Getting stated with Credentials
Create a Basic API with Chalice
Creating a Machine Learning Algorithm API with Zappa
Front-End and Database, if you need one
User Management

Getting Started with Credentials

For using AWS, you will need to an AWS account (there is a Free tier) and Create your AWS credentials here. Detailed information can be found here and here. In short you will need an access key ID and secret access key which you can place in the credentials and config file as follows.

mkdir ~/.aws
nano ~/.aws/config
nano ~/.aws/credentials

Nano is the editor I am familiar with. Other alternatives are vim and sublime.

Create a Basic API using Chalice

Let’s install Chalice, and Boto3

sudo pip install chalice
sudo pip install boto3

The following steps will create a new project and get you in the app directory:

chalice new-project myApp
cd myApp/

The following command can be used to deploy the app and provide you an endpoint. All the details can be found here. In interest of brevity, I will not repeat that over here.

chalice deploy

Within the myApp directory, create a directory called Chalicelib and place your code (myDataScienceCode.py) in this directory.

mkdir Chalicelib
cd  Chalicelib
nano myDataScienceCode.py

The myDataScienceCode.py will contain a function that takes parameters and returns the output. This is the main python output which will be delivered through the API. The details can be found here.

However, a big drawback of this approach is that packages like sklearn currently cannot be readily used due to the deployment size limit of 50MB (This could be a request to AWS to increase the size to 100MB and make it much easier for everyone OR upgrade chalice‘s capability). There are some workarounds which could be useful to explore which can be used with Chalice or with Chalice alternatives such as serverless and Zappa.

Creating a Machine Learning Algorithm API with Zappa

Let’s switch to Zappa which is equally easy and follows the same approach (initiate, put your code with connection to data from s3 and deploy).

Step 1: Pre-requisites

Zappa requires a virtual environment.

# Create Virtual Environment
sudo pip install virtualenv
virtualenv venv
source venv/bin/activate# Install libraries again
pip install pip==9.0.3 # Latest version gave me some error
pip install zappa numpy scipy pandas flask scikit-learn

Step 2: Data and ML Code

Data

To run an ML algorithm we need data. In a later part of the series, we will show how users can upload their own data. But for now, let’s put the data in a s3 Bucket. One way is the use of AWS CLI. This step doesn’t have to be within the virtual environment.

conda install -c conda-forge awscli

Please note that the name of your s3 buckets needs to be globally different.

aws s3 mb s3://yourBucketName
aws s3 cp myData.csv s3://yourBucketName

Detailed information for using s3 buckets through boto3 (python programmatically, you may find it more useful in long run) can be found here.

ML Code

An example simple ML code is below which doesn’t do much and is only used to demonstrate the concept (that is, use sklearn and return the results). The function takes data file name (csv file on s3), number of points for testing period, the label column name and predictor column as input, and returns the prediction accuracy in terms of MSE.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd
from StringIO import StringIO
import boto3def MLAccuracy(dataFile,n,x,y):    # Step 0: Read data from s3
    s3 = boto3.client('s3')
    obj=s3.get_object(Bucket='yourBucketName', Key=dataFile)
    body = obj['Body']
    csv_string = body.read().decode('utf-8')
    data = pd.read_csv(StringIO(csv_string))    # Step 1: Prepare data sets for regression modeling    Y=data[y]
    X=data[x]
    Xtrain, Ytrain = pd.DataFrame(X[:-n]), pd.DataFrame(Y[:-n])
    Xtest, Ytest = pd.DataFrame(X[-n:]), pd.DataFrame(Y[-n:])
    Ytest=Ytest.reset_index().drop(['index'],axis=1)    # Step 2: Call the ML algorithm    clf = RandomForestRegressor(n_estimators=100)
    clf.fit(Xtrain,Ytrain.y)    # Step 3: Use the classifier model to predict
    YPred = pd.DataFrame(clf.predict(Xtest), columns =['predicted'])    #Step 4: Test Accuracy on a test dataset (from step #2)    Ynew = pd.concat([Ytest,YPred], axis=1)
    MSE =  mean_squared_error(Ynew.y,Ynew.predicted,
    multioutput='uniform_average')    return MSE

Step 3: Initiate Zappa and Deploy

# Initiate Zappa
Zappa init# update your app.py file as belowfrom flask import Flask
import json
from MLAccuracy import MLAccuracyapp = Flask(__name__)
@app.route('/randomforest/<dataFilename>')def rf(dataFilename):
    k = MLAccuracy(dataFilename,3,'x','y')
    a = {"My ML Accuracy is": k}
    return json.dumps(a)if __name__ == '__main__':
app.run()# Add the following line to zappa_settings.json (Very Important)
"slim_handler": true
#Deploy
Zappa deploy dev

If this step is successful, then you will get an endpoint to which you can call to run a machine learning algorithm using data already located on s3. You can use http command to get the results.

http https://yourAPI.execute-api.us-east\
-1.amazonaws.com/dev/randomforest/yourfilename

In summary, this API enables you a run your DS/ML code with your desired input (yourfilename, in this case) and you can access from anywhere. But you may want to put access control and make it more configurable before you start sharing results with your organization or users. We will explore these aspects in the next part of this series of posts:

Step 3: Front-End and Database
Step 4: User Management

Links and References

Productizing Data Science Projects: Part 1 — API for Random Forest Regressor

Written by Anshuman Lall