Build and Train a Machine Learning Model on AWS SageMaker

8 min readNov 24, 2023

Go to the AWS Home Console

In case you have not set up AWS SageMaker or want to understand the basics, please feel free to go through the blog below.

AWS SageMaker: Sign In, Set up SageMaker Studio and Use Jupyter Notebook Instance

Sign Up for AWS Account

anishmahapatra.medium.com

This tutorial will teach us how to build and train an ML model with AWS SageMaker Studio. Using SageMaker Studio, we can do the following:

Explore Datasets
Prepare training data
Build and train models
Deploy trained models for inference

When we explore a dataset sample, iterate over multiple models and perform parameter configurations on the entire dataset, we need to experiment. This can be done on AWS SageMaker in the “local” mode.

In this blog, we will use a synthetically generated auto-insurance claims dataset. For this, the inputs will be:

train data
test data

Each of these will have a fraud column to indicate whether it was a fraudulent claim or not. We will use the XGBoost model to make a simple, binary classification model. I have used a reference for the code, which can be found here.

00. Aim

In this blog, we will aim to do the following:

Ingest training data from our Amazon S3 Bucket onto Amazon Sagemaker
Build and train an XGBoost model
Save the trained model and artefacts to Amazon S3

01. Pre-Requisites

Make sure you have an AWS account
Set up AWS SageMaker Studio

Please feel free to use any guide to do the above. To help, I have also done the same with an extensive array of screenshots, which you can find here:

AWS SageMaker: Sign In, Set up SageMaker Studio and Use Jupyter Notebook Instance

Sign Up for AWS Account

anishmahapatra.medium.com

Let’s begin!

02. Create a Virtual Private Cloud (VPC)

Go to the link here to create your Virtual Private Cloud. We are using the region — North Virginia.

The console should look like this. Create the VPC.

Select the Following:

Resources to Create: VPC & More
Number of Availability Zones: 1
Number of Public Subnets: 1
Number of private subnets: 0
VPC Endpoints: None
Let everything else be at default.

It will create a VPC Workflow.

Select View VPC. These are the details of my VPC.

03. Create CloudFormation Stack

Click on the link to create CloudFormation here: link

This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. It will take about 10 minutes to get created. The location is still North Virginia (us-east-1).

Do not input anything in the IAM role -permissions.

Create the Stack. It will take 10 minutes or so.

03. Set up SageMaker Studio

Search for S3 in your Search Bar.

Select the Create Bucket Button.

Make sure your location is us-east-1 and give your bucket a name of your choice. I am going to give “sagemakeranish”

Now, go ahead and create your bucket.

04. Set up SageMaker Studio

Go to SageMaker Studio in a New tab by opening a new tab in your AWS home console and searching for “SageMake Studio”.

Select Create Domain.

Select “Set up for single user (Quick Setup)”.

You will now have a StudioDomain selected as shown below.

Click on it and go to user profiles, and you will see the studio-user we created previously. Select “Launch” and select “Studio”

05. JupyterLab on AWS SageMaker

This will start up Jupyter Lab, and your screen will look like this.

05.01 Open New Notebook

Select File > New > Notebook and Select “Data Science 3.0”

We will see the below options selected by default.

The notebook kernel will start in a few minutes.

Once it is set up, you will see Python 3 in the top right corner.

06. Run Code

%pip install --upgrade -q aiobotocore
%pip install -q  xgboost==1.3.1

Restart your kernel.

import pandas as pd
import boto3
import sagemaker
import json
import joblib
import xgboost as xgb
from sklearn.metrics import roc_auc_score

# Import necessary libraries for data handling, AWS services, and model evaluation

# Initialize a SageMaker session and retrieve the AWS region
sess = sagemaker.Session()
region = sess.boto_region_name

# Create an S3 client for interacting with S3 buckets in the given region
s3_client = boto3.client("s3", region_name=region)

# Obtain the SageMaker execution role
sagemaker_role = sagemaker.get_execution_role()

# Define S3 bucket names, prefixes, and file paths for data storage and retrieval

# Default bucket and prefix for storing model and output data
write_bucket = sess.default_bucket()
write_prefix = "fraud-detect-demo"

# Bucket and prefix for reading sample data from S3
read_bucket = "sagemaker-sample-files"
read_prefix = "datasets/tabular/synthetic_automobile_claims" 

# File keys for train and test data, model, and output
train_data_key = f"{read_prefix}/train.csv"
test_data_key = f"{read_prefix}/test.csv"
model_key = f"{write_prefix}/model"
output_key = f"{write_prefix}/output"

# S3 URIs for train and test data
train_data_uri = f"s3://{read_bucket}/{train_data_key}"
test_data_uri = f"s3://{read_bucket}/{test_data_key}"

# Define hyperparameters for the XGBoost model
hyperparams = {
    "max_depth": 3,
    "eta": 0.2,
    "objective": "binary:logistic",
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "min_child_weight": 3
}

num_boost_round = 100  # Number of boosting rounds
nfold = 3  # Number of folds for cross-validation
early_stopping_rounds = 10  # Early stopping rounds

# Set up data input

label_col = "fraud"  # Define the label column name

# Read training data from the specified S3 URI into a Pandas DataFrame
data = pd.read_csv(train_data_uri)

# Extract features and labels from the data
train_features = data.drop(label_col, axis=1)  # Features excluding the label column
train_label = pd.DataFrame(data[label_col])  # DataFrame containing only the label column

# Prepare the data in the format expected by XGBoost (DMatrix)
dtrain = xgb.DMatrix(train_features, label=train_label)

# Cross-validate on the training data using XGBoost's built-in cross-validation method
cv_results = xgb.cv(
    params=hyperparams,
    dtrain=dtrain,
    num_boost_round=num_boost_round,
    nfold=nfold,
    early_stopping_rounds=early_stopping_rounds,
    metrics=["auc"],  # Evaluation metric
    seed=10,
)

# Aggregate and organize the cross-validation metrics
metrics_data = {
    "binary_classification_metrics": {
        "validation:auc": {
            "value": cv_results.iloc[-1]["test-auc-mean"],  # Mean validation AUC
            "standard_deviation": cv_results.iloc[-1]["test-auc-std"]  # Standard deviation of validation AUC
        },
        "train:auc": {
            "value": cv_results.iloc[-1]["train-auc-mean"],  # Mean training AUC
            "standard_deviation": cv_results.iloc[-1]["train-auc-std"]  # Standard deviation of training AUC
        },
    }
}

# Print the cross-validated AUC scores
print(f"Cross-validated train-auc:{cv_results.iloc[-1]['train-auc-mean']:.2f}")
print(f"Cross-validated validation-auc:{cv_results.iloc[-1]['test-auc-mean']:.2f}")

# Read the test data from the specified S3 URI into a Pandas DataFrame
data = pd.read_csv(test_data_uri)

# Extract test features and labels from the test data
test_features = data.drop(label_col, axis=1)  # Test features excluding the label column
test_label = pd.DataFrame(data[label_col])  # DataFrame containing only the label column for test data

# Prepare the test data in the format expected by XGBoost (DMatrix)
dtest = xgb.DMatrix(test_features, label=test_label)

# Train the XGBoost model using specified hyperparameters
model = xgb.train(
    params=hyperparams,
    dtrain=dtrain,
    evals=[(dtrain, 'train'), (dtest, 'eval')],  # Evaluation sets for training and test data
    num_boost_round=num_boost_round,
    early_stopping_rounds=early_stopping_rounds,
    verbose_eval=0  # Suppress verbosity during training
)

# Make predictions using the trained model on test and train datasets
test_pred = model.predict(dtest)  # Predictions on the test data
train_pred = model.predict(dtrain)  # Predictions on the train data

# Calculate AUC scores for model evaluation
test_auc = roc_auc_score(test_label, test_pred)  # AUC score for the test data
train_auc = roc_auc_score(train_label, train_pred)  # AUC score for the train data

# Print the AUC scores for model evaluation
print(f"Train-auc:{train_auc:.2f}, Test-auc:{test_auc:.2f}")

# Save model and performance metrics locally

# Save the metrics data as a JSON file locally
with open("./metrics.json", "w") as f:
    json.dump(metrics_data, f)

# Save the trained XGBoost model locally using joblib
with open("./xgboost-model", "wb") as f:
    joblib.dump(model, f)

# Upload model and performance metrics to S3

# Define file paths for metrics and model in the S3 bucket
metrics_location = output_key + "/metrics.json"  # Location for metrics JSON file in S3
model_location = model_key + "/xgboost-model"    # Location for model file in S3

# Upload the metrics JSON file from local to S3 bucket
s3_client.upload_file(Filename="./metrics.json", Bucket=write_bucket, Key=metrics_location)

# Upload the trained model file from local to S3 bucket
s3_client.upload_file(Filename="./xgboost-model", Bucket=write_bucket, Key=model_location)