We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Create, Train and Deploy Multi Layer Perceptron (MLP) models using Amazon Redshift ML
Redshift ML uses
Amazon Redshift ML supports supervised learning, including regression, binary classification, multi-class classification, and unsupervised learning using K-Means. You can optionally specify XGBoost, MLP, and linear learner model types, which are supervised learning algorithms used for solving either classification or regression problems, and provide a significant increase in speed over traditional hyperparameter optimization techniques. Amazon Redshift ML also supports
In this blog post, we show you how to use Redshift ML to solve binary classification problem using the Multi Layer Perceptron (MLP) algorithm, which explores different training objectives and chooses the best solution from the validation set.
A multilayer perceptron (MLP) is a deep learning method which deals with training multi-layer artificial neural networks, also called Deep Neural Networks. It is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers. MLP uses backpropagation for training the network. MLP is widely used for solving problems that require supervised learning as well as research into computational neuroscience and parallel distributed processing. It is also used for speech recognition, image recognition and machine translation.
As far as MLP usage with Redshift ML (powered by Amazon SageMaker Autopilot) is concerned, it supports tabular data as of now.
Solution Overview
To use the MLP algorithm, you need to provide inputs or columns representing dimensional values and also the label or target, which is the value you’re trying to predict.
With Redshift ML, you can use MLP on tabular data for regression, binary classification or multiclass classification problems. What is more unique about MLP is, is that the output function of MLP can be a linear or a continuous function as well. It need not be a straight line like the general regression model provides.
In this solution, we use binary classification to detect frauds based upon the credit cards transaction data. The difference between classification models and MLP is that logistic regression uses a logistic function, while perceptrons use a step function. Using the multilayer perceptron model, machines can learn weight coefficients that help them classify inputs. This linear binary classifier is highly effective in arranging and categorizing input data into different classes, allowing probability-based predictions and classifying items into multiple categories. Multilayer Perceptrons have the advantage of learning non-linear models and the ability to train models in real-time.
For this solution, we first ingest the data into Amazon Redshift, we then distribute it for model training and validation, then use Amazon Redshift ML specific queries for model creation and thereby create and utilize the generated SQL function for being able to finally predict the fraudulent transactions.
Prerequisites
To get started, we need an Amazon Redshift cluster or an
For an introduction to Redshift ML and instructions on setting it up, see
To create a simple cluster with a default IAM role, see
Data Set Used
In this post, we use the
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a
Here are sample records:
Prepare the data
Load the credit card dataset into Amazon Redshift using the following SQL. You can use the
Alternately we have provided a notebook you may use to execute all the sql commands that can be downloaded
|
To create the table, use the following command:
Load the data
To load data into Amazon Redshift, use the following COPY command:
Before creating the model, we want to divide our data into two sets by splitting 80% of the dataset for training and 20% for validation, which is a common practice in ML. The training data is input to the ML model to identify the best possible algorithm for the model. After the model is created, we use the validation data to validate the model accuracy.
So, in ‘ creditcardsfrauds
’ table, we check the distribution of data based upon ‘ txtime
’ value and identify the cutoff for around 80% of the data to train the model.
With this, the highest txtime value comes to 120954 (based upon the distribution of txtime’s min, max, ranking by window function and ceil(count(*)*0.80) values)), based upon which we consider the transaction records having ‘ txtime
’ field value less than 120954 for creating training data. We then validate the accuracy of that model by seeing if it correctly identifies the fraudulent transactions by predicting its ‘ class
’ attribute on the remaining 20% of the data.
This distribution for 80% cutoff need not always be based upon ordered time. It can be picked up randomly as well, based upon the use case under consideration.
Create a model in Redshift ML
Here, in the settings section of the command, you need to set up an S3_BUCKET
which is used to export the data that is sent to SageMaker and store model artifacts.
S3_BUCKET
setting is a required parameter of the command, whereas MAX_RUNTIME
is an optional one which specifies the maximum amount of time to train. The default value of this parameter is 90 minutes (5400 seconds), however you can override it by explicitly specifying it in the command, just like we have done it here by setting it to run for 9600 seconds.
The preceding statement initiates an
You can use the
READY
state within the max_runtime
parameter you defined while creating the model.
To check the status of the model, use the following command:
We notice from the preceding table that the
To elaborate, F1-score is the harmonic mean of precision and recall. It combines precision and recall into a single number using the following formula:
Where, Precision means: Of all positive predictions, how many are really positive?
And Recall means: Of all real positive cases, how many are predicted positive?
F1 scores can range from 0 to 1, with 1 representing a model that perfectly classifies each observation into the correct class and 0 representing a model that is unable to classify any observation into the correct class. So higher F1 scores are better.
The following is the detailed tabular outcome for the preceding command after model training was done.
Model Name | creditcardsfrauds_mlp |
Schema Name | public |
Owner | redshiftml |
Creation Time | Sun, 25.09.2022 16:07:18 |
Model State | READY |
validation:binary_f_beta | 0.908864 |
Estimated Cost | 112.296925 |
TRAINING DATA: | . |
Query | SELECT * FROM CREDITCARDSFRAUDS WHERE TXTIME < 120954 |
Target Column | CLASS |
PARAMETERS: | . |
Model Type | mlp |
Problem Type | BinaryClassification |
Objective | F1 |
AutoML Job Name | redshiftml-20221118035728881011 |
Function Name | creditcardsfrauds_mlp_fn |
. | creditcardsfrauds_mlp_fn_prob |
Function Parameters | txtime v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 v21 v22 v23 v24 v25 v26 v27 v28 amount |
Function Parameter Types | int4 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 |
IAM Role | default |
S3 Bucket | redshift-ml-blog-mlp |
Max Runtime | 54000 |
Redshift ML now supports Prediction Probabilities for binary classification models. For classification problem in machine learning, for a given record, each label can be associated with a probability that indicates how likely this record really belongs to the label. With option to have probabilities along with the label, customers could use the classification results when confidence based on chosen label is higher than a certain threshold value returned by the model
Prediction probabilities are calculated by default for binary classification models and an additional function is created while creating model without impacting performance of the ML model.
In above snippet, you will notice that predication probabilities enhancements have added another function as a suffix ( _prob
) to model function with a name ‘ creditcardsfrauds_mlp_fn_prob
’ which could be used to get prediction probabilities.
Additionally, you can check the
Model explainability helps to understand the cause of prediction by answering questions such as:
- Why did the model predict a negative outcome such as blocking of credit card when someone travels to a different country and withdraws a lot of money in different currency?
- How does the model make predictions? Lots of data for credit cards can be put in a tabular format and as per MLP process where a fully connected neural network of several layers is involved, we can tell which input feature actually contributed to the model output and its magnitude.
- Why did the model make an incorrect prediction? E.g. Why is the card blocked even though the transaction is legitimate?
- Which features have the largest influence on the behavior of the model? Is it just based upon the location where the credit card is swiped, or even the time of the day and unusual credit consumption that is influencing the prediction?
Run the following SQL command to retrieve the values from the explainability report:
In the preceding screenshot, we have only selected the column that projects shapley values from the response returned by the
Model validation
Now let’s run the prediction query and validate the accuracy of the model on the validation dataset:
We can observe here that Redshift ML is able to identify 99.88 percent of the transactions correctly as fraudulent or non-fraudulent.
Now you can continue to use this SQL function creditcardsfrauds_mlp_fn
for local inference in any part of the SQL query while analyzing, visualizing or reporting the newly arriving as well as existing data!
Here the output 1 means that the newly captured transaction is fraudulent as per the inference.
Additionally, you can change the above query to include prediction probabilities of label output for the above scenario and decide if you still like to use the prediction by the model.
The above screenshot shows that this transaction has 100% likelihood of being fraudulent.
Clean up
To avoid incurring future charges, you can stop the Redshift cluster when not being used. You can even terminate the Redshift cluster altogether if you have run the exercise in this blog post just for experimental purpose. If you are instead using serverless version of Redshift, it will not cost you anything, until it is used. However, like mentioned before, you will have to stop or terminate the cluster if you are using a provisioned version of Redshift.
Conclusion
Redshift ML makes it easy for users of all levels to create, train, and tune models using SQL interface. In this post, we walked you through how to use the MLP algorithm to create binary classification model. You can then use those models to make predictions using simple SQL commands and gain valuable insights.
To learn more about RedShift ML, visit
About the authors
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.