Sanchit Dilip Jain/Amazon Redshift ML - Overview 🔍

Created Mon, 15 Apr 2024 12:00:00 +0000 Modified Sun, 12 May 2024 01:47:18 +0000
1669 Words 8 min

Amazon Redshift ML - Overview

Introduction

  • Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply machine learning models using familiar SQL commands in Amazon Redshift data warehouses.

  • With Redshift ML, you can take advantage of Amazon SageMaker, a fully managed machine learning service, without learning new tools or languages. Simply use SQL statements to create and train Amazon SageMaker machine learning models using your Redshift data and then use these models to make predictions.

  • Imagine using customer retention data in Redshift to train a churn detection model. Then, applying that model to your dashboards, empowering your marketing team to offer incentives to customers at risk of churning. This is just one practical example of how Redshift ML can transform your work.

  • Redshift ML makes the model available as an SQL function within your Redshift data warehouse so you can easily apply it directly in your queries and reports.

    • No prior ML experience needed
    • Use ML on your Redshift data using standard SQL
    • Predictive analytics with Amazon Redshift
    • Bring your own model (BYOM)

Demo

1. Prerequisites

  • Objective: In this demo, we will perform some prerequisites via provisioning an AWS Cloudformation.

  • Below are the steps to perform provisioning an AWS Cloudformation

    • Download the Cloudformation from this URL
    • Visit AWS Cloudformation console, upload the CloudFormation template and click on Launch Stack.
    • The CloudFormation stack will create the necessary resources required for the demo. Check the CloudFormation console and wait for the status CREATE_COMPLETE as shown below

2. Load data

  • On the Redshift Serverless dashboard, navigate to the namespace starting with zero-etl-destination-* namespace.

  • Choose Query data to open Query Editor v2.

  • Click on the namespace name to open the connection dialog

  • Connect to the Redshift Serverless data warehouse by choosing Create connection.

  • Lets create and load data into a transaction history table with the following SQL command:

    CREATE TABLE cust_payment_tx_history
    (
        TRANSACTION_ID integer,
        TX_DATETIME timestamp,
        CUSTOMER_ID integer,
        TERMINAL_ID integer,
        TX_AMOUNT decimal(9,2),
        TX_TIME_SECONDS integer,
        TX_TIME_DAYS integer,
        TX_FRAUD integer,
        TX_FRAUD_SCENARIO integer,
        TX_DURING_WEEKEND integer,
        TX_DURING_NIGHT integer,
        CUSTOMER_ID_NB_TX_1DAY_WINDOW decimal(9,2),
        CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW decimal(9,2),
        CUSTOMER_ID_NB_TX_7DAY_WINDOW decimal(9,2),
        CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW decimal(9,2),
        CUSTOMER_ID_NB_TX_30DAY_WINDOW decimal(9,2),
        CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW decimal(9,2),
        TERMINAL_ID_NB_TX_1DAY_WINDOW decimal(9,2),
        TERMINAL_ID_RISK_1DAY_WINDOW decimal(9,2),
        TERMINAL_ID_NB_TX_7DAY_WINDOW decimal(9,2),
        TERMINAL_ID_RISK_7DAY_WINDOW decimal(9,2),
        TERMINAL_ID_NB_TX_30DAY_WINDOW decimal(9,2),
        TERMINAL_ID_RISK_30DAY_WINDOW decimal(9,2)
    );
    
    COPY cust_payment_tx_history
    FROM 's3://redshift-demos/ri2023/ant307/data/cust_payment_tx_history/'
    IAM_ROLE default
    PARQUET;
    
  • Now check how many transactions have been loaded:

    SELECT count(1) FROM cust_payment_tx_history;
    
  • Then check the monthly fraud and non fraud transactions trend:

    SELECT 
        to_char(tx_datetime, 'YYYYMM') AS YearMonth,
        sum(case when tx_fraud=1 then 1 else 0 end) AS fraud_tx,
        sum(case when tx_fraud=0 then 1 else 0 end) AS non_fraud_tx,
        count(*) AS total_tx
    FROM cust_payment_tx_history
    GROUP BY YearMonth;
    
  • Amazon Redshift ML enables you to train models with one single SQL CREATE MODEL command. The CREATE MODEL method creates a model that Amazon Redshift uses to generate model-based predictions with familiar SQL constructs.

  • In this demo, we will use CREATE MODEL to train and test a model on our historical card transaction table. We split this data into training and test datasets. Transactions from 2022-04-01 to 2022-07-31 are for the training set, and transactions from 2022-08-01 to 2022-09-30 are used for the test set.

    CREATE MODEL cust_cc_txn_fd
    FROM (
        SELECT 
            TX_AMOUNT ,
            TX_FRAUD ,
            TX_DURING_WEEKEND ,
            TX_DURING_NIGHT ,
            CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
            CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
            CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
            TERMINAL_ID_NB_TX_1DAY_WINDOW ,
            TERMINAL_ID_RISK_1DAY_WINDOW ,
            TERMINAL_ID_NB_TX_7DAY_WINDOW ,
            TERMINAL_ID_RISK_7DAY_WINDOW ,
            TERMINAL_ID_NB_TX_30DAY_WINDOW ,
            TERMINAL_ID_RISK_30DAY_WINDOW
        FROM cust_payment_tx_history
        WHERE cast(tx_datetime as date) BETWEEN '2022-06-01' AND '2022-09-30'
    ) 
    TARGET tx_fraud
    FUNCTION fn_customer_cc_fd
    IAM_ROLE default
    AUTO OFF
    MODEL_TYPE XGBOOST
    OBJECTIVE 'binary:logistic'
    PREPROCESSORS 'none'
    HYPERPARAMETERS DEFAULT EXCEPT (NUM_ROUND '100')
    SETTINGS (
        S3_BUCKET '<replace this with your s3 bucket name>',
        S3_GARBAGE_COLLECT off,
        MAX_RUNTIME 1200
    );
    
  • The ML model is called Cust_cc_txn_fd, and the prediction function is fn_customer_cc_fd. The FROM clause shows the input columns from the historical table public.cust_payment_tx_history.

  • The target parameter is set to tx_fraud, which is the target variable that we’re trying to predict. IAM_Role is set to default because the cluster is configured with this role; if not, you have to provide your Amazon Redshift cluster IAM role ARN.

  • During training, “Model State” is set to “TRAINING.” When “Model State” shows as “READY,” it means the model is trained and deployed.

    SHOW MODEL cust_cc_txn_fd;
    
  • Let’s test this model with the test dataset. Run the following command, which retrieves sample predictions:

    SELECT
        tx_fraud ,
        fn_customer_cc_fd(
            TX_AMOUNT ,
            TX_DURING_WEEKEND ,
            TX_DURING_NIGHT ,
            CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
            CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
            CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
            TERMINAL_ID_NB_TX_1DAY_WINDOW ,
            TERMINAL_ID_RISK_1DAY_WINDOW ,
            TERMINAL_ID_NB_TX_7DAY_WINDOW ,
            TERMINAL_ID_RISK_7DAY_WINDOW ,
            TERMINAL_ID_NB_TX_30DAY_WINDOW ,
            TERMINAL_ID_RISK_30DAY_WINDOW
        )
    FROM cust_payment_tx_history
    WHERE cast(tx_datetime as date) >= '2022-10-01'
    LIMIT 10;
    

  • We see that some values are matching and some are not. Let’s compare predictions to the ground truth:

    SELECT
        tx_fraud ,
        fn_customer_cc_fd(
            TX_AMOUNT ,
            TX_DURING_WEEKEND ,
            TX_DURING_NIGHT ,
            CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
            CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
            CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
            CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
            TERMINAL_ID_NB_TX_1DAY_WINDOW ,
            TERMINAL_ID_RISK_1DAY_WINDOW ,
            TERMINAL_ID_NB_TX_7DAY_WINDOW ,
            TERMINAL_ID_RISK_7DAY_WINDOW ,
            TERMINAL_ID_NB_TX_30DAY_WINDOW ,
            TERMINAL_ID_RISK_30DAY_WINDOW
        ) AS prediction, 
        count(*) AS values
    FROM public.cust_payment_tx_history
    WHERE cast(tx_datetime as date) >= '2022-08-01'
    GROUP BY 1,2;
    

  • We validated that the model is working and the F1 score is good.

  • Create the first view, which aggregates streaming data at the customer level:

    CREATE VIEW public.customer_transformations
    AS SELECT 
        customer_id, 
        CUSTOMER_ID_NB_TX_1DAY_WINDOW, 
        CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW,
        CUSTOMER_ID_NB_TX_7DAY_WINDOW,
        CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,
        CUSTOMER_ID_NB_TX_30DAY_WINDOW, 
        CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
    FROM (
        SELECT 
            customer_id,
            SUM(CASE WHEN cast(a.TX_DATETIME AS date) = cast(getdate() AS date) THEN TX_AMOUNT  ELSE 0 end) AS CUSTOMER_ID_NB_TX_1DAY_WINDOW,
            AVG(CASE WHEN cast(a.TX_DATETIME AS date) = cast(getdate() AS date) THEN TX_AMOUNT ELSE 0 end ) AS CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW,
            SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -7 AND cast(getdate() AS date) THEN TX_AMOUNT ELSE 0 end) AS CUSTOMER_ID_NB_TX_7DAY_WINDOW,
            AVG(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -7 AND cast(getdate() AS date) THEN TX_AMOUNT  ELSE 0 end ) AS CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,
            SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -30 AND cast(getdate() AS date) THEN TX_AMOUNT ELSE 0 end) AS CUSTOMER_ID_NB_TX_30DAY_WINDOW,
            AVG( CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -30 AND cast(getdate() AS date) THEN TX_AMOUNT  ELSE 0 end ) AS CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
        FROM (
            SELECT 
                CUSTOMER_ID, 
                TERMINAL_ID, 
                TX_AMOUNT, 
                cast(TX_DATETIME AS timestamp) TX_DATETIME
            FROM CUST_PAYMENT_TX_STREAM  --retrieve streaming data
            WHERE cast(TX_DATETIME AS date) BETWEEN cast(getdate() AS date) -37 AND cast(getdate() AS date)
            UNION
            SELECT 
                CUSTOMER_ID, 
                TERMINAL_ID, 
                TX_AMOUNT,
                cast(TX_DATETIME AS timestamp) TX_DATETIME
            FROM cust_payment_tx_history   -- retrieve historical data
            WHERE  cast(TX_DATETIME AS date) BETWEEN cast(getdate() AS date) -37 AND cast(getdate() AS date)
        ) a
    GROUP BY 1
    );
    
  • Then create the second view, which aggregates streaming data at terminal level:

    CREATE VIEW public.terminal_transformations
    AS SELECT 
        TERMINAL_ID,
        TERMINAL_ID_NB_TX_1DAY_WINDOW, 
        TERMINAL_ID_RISK_1DAY_WINDOW, 
        TERMINAL_ID_NB_TX_7DAY_WINDOW, 
        TERMINAL_ID_RISK_7DAY_WINDOW,
        TERMINAL_ID_NB_TX_30DAY_WINDOW, 
        TERMINAL_ID_RISK_30DAY_WINDOW
    FROM (
        SELECT 
        TERMINAL_ID, 
        MAX(cast(a.TX_DATETIME AS date) ) maxdt, 
        MIN(cast(a.TX_DATETIME AS date) ) mindt, 
        MAX(tx_fraud) mxtf, 
        MIN(tx_fraud) mntf, 
        SUM(CASE WHEN tx_fraud =1 THEN 1 ELSE 0 end) sumtxfraud,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -7 AND cast(getdate() AS date)  AND tx_fraud = 1 THEN tx_fraud ELSE 0 end) AS NB_FRAUD_DELAY1,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -7 AND cast(getdate() AS date) THEN tx_fraud ELSE 0 end)  AS NB_TX_DELAY1 ,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -8 AND cast(getdate() AS date)  AND tx_fraud = 1 THEN tx_fraud ELSE 0 end) AS NB_FRAUD_DELAY_WINDOW1,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -8 AND cast(getdate() AS date) THEN 1 ELSE 0 end)  AS NB_TX_DELAY_WINDOW1,
        NB_FRAUD_DELAY_WINDOW1-NB_FRAUD_DELAY1 AS NB_FRAUD_WINDOW1,
        NB_TX_DELAY_WINDOW1-NB_TX_DELAY1 AS TERMINAL_ID_NB_TX_1DAY_WINDOW,
        CASE WHEN TERMINAL_ID_NB_TX_1DAY_WINDOW = 0 
            THEN 0 
            ELSE NB_FRAUD_WINDOW1 / TERMINAL_ID_NB_TX_1DAY_WINDOW  
        end AS terminal_id_risk_1day_window ,--7 day
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -7 AND cast(getdate() AS date) AND tx_fraud = 1 THEN tx_fraud ELSE 0 end ) AS NB_FRAUD_DELAY7,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -7 AND cast(getdate() AS date) THEN 1 ELSE 0 end)  AS NB_TX_DELAY7,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -14 AND cast(getdate() AS date)  AND tx_fraud = 1 THEN tx_fraud ELSE 0 end) AS NB_FRAUD_DELAY_WINDOW7,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date) -14 AND cast(getdate() AS date) THEN 1 ELSE 0 end)  AS NB_TX_DELAY_WINDOW7,
        NB_FRAUD_DELAY_WINDOW7-NB_FRAUD_DELAY7 AS NB_FRAUD_WINDOW7,
        NB_TX_DELAY_WINDOW7-NB_TX_DELAY7 AS TERMINAL_ID_NB_TX_7DAY_WINDOW,
        CASE WHEN TERMINAL_ID_NB_TX_7DAY_WINDOW = 0 
            THEN 0 
            ELSE NB_FRAUD_WINDOW7 / TERMINAL_ID_NB_TX_7DAY_WINDOW 
        END AS terminal_id_risk_7day_window, --30 day period
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date)-7 AND cast(getdate() AS date)  AND tx_fraud = 1 THEN tx_fraud ELSE 0 end) AS NB_FRAUD_DELAY30,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date)-7 AND cast(getdate() AS date) THEN 1 ELSE 0 end)  AS NB_TX_DELAY30,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date)-37 AND cast(getdate() AS date)  AND tx_fraud = 1 THEN tx_fraud ELSE 0 end) AS NB_FRAUD_DELAY_WINDOW30,
        SUM(CASE WHEN cast(a.TX_DATETIME AS date) BETWEEN  cast(getdate() AS date)-37 AND cast(getdate() AS date) THEN 1 ELSE 0 end)  AS NB_TX_DELAY_WINDOW30,
        NB_FRAUD_DELAY_WINDOW30-NB_FRAUD_DELAY30 AS NB_FRAUD_WINDOW30,
        NB_TX_DELAY_WINDOW30-NB_TX_DELAY30 AS TERMINAL_ID_NB_TX_30DAY_WINDOW,
        CASE WHEN TERMINAL_ID_NB_TX_30DAY_WINDOW = 0 
            THEN 0 
            ELSE NB_FRAUD_WINDOW30 / TERMINAL_ID_NB_TX_30DAY_WINDOW 
        END AS TERMINAL_ID_RISK_30DAY_WINDOW
        FROM(
            SELECT 
                TERMINAL_ID, 
                TX_AMOUNT, 
                cast(TX_DATETIME AS timestamp) TX_DATETIME, 
                0 AS TX_FRAUD
            FROM cust_payment_tx_stream
            WHERE cast(TX_DATETIME AS date) BETWEEN cast(getdate() AS date) -37 AND cast(getdate() AS date)
            UNION ALL
            SELECT  
                TERMINAL_ID,
                TX_AMOUNT,
                TX_DATETIME,
                TX_FRAUD
            FROM cust_payment_tx_history  
            WHERE cast(TX_DATETIME AS date) BETWEEN cast(getdate() AS date) -37 AND cast(getdate() AS date)
        ) a
    GROUP BY 1
    );
    
  • Create the third view, which combines incoming transactional data with customer and terminal aggregated data and calls the prediction function all in one place:

    CREATE VIEW public.cust_payment_tx_fraud_predictions
    AS SELECT
        A.APPROXIMATE_ARRIVAL_TIMESTAMP,
        D.FULL_NAME, 
        D.EMAIL_ADDRESS, 
        D.PHONE_NUMBER,
        A.TRANSACTION_ID, 
        A.TX_DATETIME, 
        A.CUSTOMER_ID, 
        A.TERMINAL_ID,
        A.TX_AMOUNT ,
        A.TX_TIME_SECONDS ,
        A.TX_TIME_DAYS ,
        fn_customer_cc_fd(
            A.TX_AMOUNT ,
            A.TX_DURING_WEEKEND,
            A.TX_DURING_NIGHT,
            C.CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
            C.CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
            C.CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
            C.CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
            C.CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
            C.CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
            T.TERMINAL_ID_NB_TX_1DAY_WINDOW ,
            T.TERMINAL_ID_RISK_1DAY_WINDOW ,
            T.TERMINAL_ID_NB_TX_7DAY_WINDOW ,
            T.TERMINAL_ID_RISK_7DAY_WINDOW ,
            T.TERMINAL_ID_NB_TX_30DAY_WINDOW ,
            T.TERMINAL_ID_RISK_30DAY_WINDOW
        ) FRAUD_PREDICTION
    FROM(
        SELECT
            APPROXIMATE_ARRIVAL_TIMESTAMP,
            TRANSACTION_ID, 
            TX_DATETIME, 
            CUSTOMER_ID, 
            TERMINAL_ID,
            TX_AMOUNT,
            TX_TIME_SECONDS,
            TX_TIME_DAYS,
            CASE WHEN extract(dow from cast(TX_DATETIME as timestamp)) in (1,7) then 1 else 0 end as TX_DURING_WEEKEND,
            CASE WHEN extract(hour from cast(TX_DATETIME as timestamp)) BETWEEN 00 and 06 then 1 else 0 end as TX_DURING_NIGHT
        FROM public.cust_payment_tx_stream
    ) a
    JOIN public.terminal_transformations t ON a.terminal_id = t.terminal_id
    JOIN public.customer_transformations c ON a.customer_id = c.customer_id
    JOIN postgres.customer_info d ON a.customer_id = d.customer_id
    WITH NO SCHEMA BINDING;
    
  • Run a SELECT statement on the view:

    SELECT * FROM cust_payment_tx_fraud_predictions WHERE FRAUD_PREDICTION = 1;
    

As you run the SELECT statement repeatedly, the latest credit card transactions undergo transformations and ML predictions in near real-time.

Resources

  • Visit this page to find the latest documentation.