How to Deploy a Multi-Endpoint Model to a real-time inference

6 min readNov 28, 2023

This tutorial will teach us how to deploy a multi-model endpoint into a single real-time inference using AWS SageMaker. SageMaker is an IDE for Machine Learning, where we are able to perform end-to-end Machine Learning, including model deployment.

SageMaker has different inference options to support a broad range of use cases:

SageMaker Real-Time Inference is for workloads with low latency requirements on the order of milliseconds.
SageMaker Serverless Inference is for workloads with intermittent or infrequent traffic patterns.
SageMaker Asynchronous Inference is for inferences with large payload sizes or requiring long processing times.
SageMaker Batch Transform is to run predictions on batches of data.

Aim

In this blog, we will use SageMaker Inference to deploy a set of binary classification XGBoost models that have been trained on a synthetic house price prediction dataset. The dataset has details on number of bedrooms, square feet, and the number of bathrooms. These models predicts the housing price for a single location. You are a Machine Learning Engineer whose role it is to deploy these models. We will be able to do the following by the end of the blog:

Create multiple SageMaker models from respective trained model artefacts
Configure and deploy a real-time endpoint to serve these models
Invoke the multi-model endpoint to run sample predictions using test data

Pre-Requisites

Make sure you have an AWS Account set up
Set up AWS SageMaker Domain

In case you need help in setting it up, you can use the following blogs that I wrote for this specific purpose. I made this blog using this reference link.

Build and Train a Machine Learning Model on AWS SageMaker

Build a XGBoost model on AWS & get the model artifacts from AWS Sagemaker.

anishmahapatra.medium.com

AWS SageMaker: Sign In, Set up SageMaker Studio and Use Jupyter Notebook Instance

Sign Up for AWS Account

anishmahapatra.medium.com

1. Set up AWS SageMaker Studio Domain & VPC

Head to the AWS CloudFormation link here and it should auto-populate a stack for CFN-SM-IM-Lambda-Catalog.

Next, set up the VPC using the link here. I have done these steps in detail with screenshots for reference in the blog below.

Build and Train a Machine Learning Model on AWS SageMaker

Build a XGBoost model on AWS & get the model artifacts from AWS Sagemaker.

anishmahapatra.medium.com

2. Set up SageMaker Studio and Notebook

Set up a SageMaker Studio and a Notebook in the US East (N. Virginia) region, generally labelled as US-east-1. When making a new notebook, you can set it up with the following configurations.

Now, we will run the scripts to make the model, and we will deploy a multi-model endpoint, with real-time inference.

Code

%pip install --upgrade -q aiobotocore

import boto3
import sagemaker
import time

from sagemaker.image_uris import retrieve
from time import gmtime, strftime
from sagemaker.amazon.amazon_estimator import image_uris

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
write_prefix = "housing-prices-prediction-mme-demo"

region = sagemaker_session.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sm_client = boto3.client("sagemaker", region_name=region)
sm_runtime_client = boto3.client("sagemaker-runtime")
role = sagemaker.get_execution_role()

# S3 locations used for parameterizing the notebook run
read_bucket = "sagemaker-sample-files"
read_prefix = "models/house_price_prediction"
model_prefix = "models/xgb-hpp"

# S3 location of trained model artifact
model_artifacts = f"s3://{default_bucket}/{model_prefix}/"

# Location
location = ['Chicago_IL', 'Houston_TX', 'NewYork_NY', 'LosAngeles_CA']

test_data = [1997, 2527, 6, 2.5, 0.57, 1]

import boto3

# Initialize the s3 resource
s3 = boto3.resource('s3')

# Rest of your code
for i in range(0, 4):
    copy_source = {'Bucket': read_bucket, 'Key': f"{read_prefix}/{location[i]}.tar.gz"}
    bucket = s3.Bucket(default_bucket)
    bucket.copy(copy_source, f"{model_prefix}/{location[i]}.tar.gz")

# Retrieve the SageMaker managed XGBoost image
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")

# Specify an unique model name that does not exist
model_name = "housing-prices-prediction-mme-xgb"
primary_container = {
                     "Image": training_image,
                     "ModelDataUrl": model_artifacts,
                     "Mode": "MultiModel"
                    }

model_matches = sm_client.list_models(NameContains=model_name)["Models"]
if not model_matches:
    model = sm_client.create_model(ModelName=model_name,
                                   PrimaryContainer=primary_container,
                                   ExecutionRoleArn=role)
else:
    print(f"Model with name {model_name} already exists! Change model name to create new")

# Endpoint Config name
endpoint_config_name = f"{model_name}-endpoint-config"

# Create endpoint if one with the same name does not exist
endpoint_config_matches = sm_client.list_endpoint_configs(NameContains=endpoint_config_name)["EndpointConfigs"]
if not endpoint_config_matches:
    endpoint_config_response = sm_client.create_endpoint_config(
                                                                EndpointConfigName=endpoint_config_name,
                                                                ProductionVariants=[
                                                                    {
                                                                        "InstanceType": "ml.m5.xlarge",
                                                                        "InitialInstanceCount": 1,
                                                                        "InitialVariantWeight": 1,
                                                                        "ModelName": model_name,
                                                                        "VariantName": "AllTraffic",
                                                                    }
                                                                ],
                                                                )
else:
    print(f"Endpoint config with name {endpoint_config_name} already exists! Change endpoint config name to create new")

Now, you will find the housing-prices-prediction-mme-xgb model made in the Models section in

AWS SageMaker > Inference > Models

housing-prices-prediction-mme-xgb

# Endpoint name
endpoint_name = f"{model_name}-endpoint"

endpoint_matches = sm_client.list_endpoints(NameContains=endpoint_name)["Endpoints"]
if not endpoint_matches:
    endpoint_response = sm_client.create_endpoint(
                                                  EndpointName=endpoint_name,
                                                  EndpointConfigName=endpoint_config_name
                                                 )
else:
    print(f"Endpoint with name {endpoint_name} already exists! Change endpoint name to create new")

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
while status == "Creating":
    print(f"Endpoint Status: {status}...")
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
print(f"Endpoint Status: {status}")

Now, if you go to the Endpoints Section, you will now be able to see an active endpoint that we will be able to query.

AWS SageMaker > Inference > Endpoints

housing-prices-prediction-mme-xgb-endpoint

We can click on the endpoint to see the usage statistics and the usage.

Now, if you remember, we had made a sample list for which we wanted to get the predictions. We will now pass the list to get the prediction.

location = [‘Chicago_IL’, ‘Houston_TX’, ‘NewYork_NY’, ‘LosAngeles_CA’]
test_data = [1997, 2527, 6, 2.5, 0.57, 1]

# converting the elements in test data to string
payload = ' '.join([str(elem) for elem in test_data])

for i in range (0,4):
    predicted_value = sm_runtime_client.invoke_endpoint(EndpointName=endpoint_name, TargetModel=f"{location[i]}.tar.gz", ContentType="text/csv", Body=payload)
    print(f"Predicted Value for {location[i]} target model:\n ${predicted_value['Body'].read().decode('utf-8')}")

We are able to hit the multi-model endpoint and get the following results.

We were successfully able to implement our model.

Now, if we check the behaviour of the endpoint, we will be able to see additional information.

We have successfully implemented a multi-endpoint model for real-time inference.

If you want to see the multiple models that have been deployed in a single endpoint, you can type CloudWatch in the Search bar and go to the metrics.

CloudWatch > Metrics > All Metrics

Here, if you click on endpoints, you will be able to see the different models that have been implemented.

Resources Cleanup

Clean up and delete the model.

# Delete model
sm_client.delete_model(ModelName=model_name)

# Delete endpoint configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

# Delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)