How to Deploy a Multi-Endpoint Model to a real-time inference

Anish Mahapatra
6 min readNov 28, 2023


This tutorial will teach us how to deploy a multi-model endpoint into a single real-time inference using AWS SageMaker. SageMaker is an IDE for Machine Learning, where we are able to perform end-to-end Machine Learning, including model deployment.

Photo by Andrea De Santis on Unsplash

SageMaker has different inference options to support a broad range of use cases:


In this blog, we will use SageMaker Inference to deploy a set of binary classification XGBoost models that have been trained on a synthetic house price prediction dataset. The dataset has details on number of bedrooms, square feet, and the number of bathrooms. These models predicts the housing price for a single location. You are a Machine Learning Engineer whose role it is to deploy these models. We will be able to do the following by the end of the blog:

  • Create multiple SageMaker models from respective trained model artefacts
  • Configure and deploy a real-time endpoint to serve these models
  • Invoke the multi-model endpoint to run sample predictions using test data


  • Make sure you have an AWS Account set up
  • Set up AWS SageMaker Domain

In case you need help in setting it up, you can use the following blogs that I wrote for this specific purpose. I made this blog using this reference link.

1. Set up AWS SageMaker Studio Domain & VPC

Head to the AWS CloudFormation link here and it should auto-populate a stack for CFN-SM-IM-Lambda-Catalog.

Next, set up the VPC using the link here. I have done these steps in detail with screenshots for reference in the blog below.

2. Set up SageMaker Studio and Notebook

Set up a SageMaker Studio and a Notebook in the US East (N. Virginia) region, generally labelled as US-east-1. When making a new notebook, you can set it up with the following configurations.

Now, we will run the scripts to make the model, and we will deploy a multi-model endpoint, with real-time inference.


%pip install --upgrade -q aiobotocore
import boto3
import sagemaker
import time

from sagemaker.image_uris import retrieve
from time import gmtime, strftime
from import image_uris

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
write_prefix = "housing-prices-prediction-mme-demo"

region = sagemaker_session.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sm_client = boto3.client("sagemaker", region_name=region)
sm_runtime_client = boto3.client("sagemaker-runtime")
role = sagemaker.get_execution_role()

# S3 locations used for parameterizing the notebook run
read_bucket = "sagemaker-sample-files"
read_prefix = "models/house_price_prediction"
model_prefix = "models/xgb-hpp"

# S3 location of trained model artifact
model_artifacts = f"s3://{default_bucket}/{model_prefix}/"

# Location
location = ['Chicago_IL', 'Houston_TX', 'NewYork_NY', 'LosAngeles_CA']

test_data = [1997, 2527, 6, 2.5, 0.57, 1]
import boto3

# Initialize the s3 resource
s3 = boto3.resource('s3')

# Rest of your code
for i in range(0, 4):
copy_source = {'Bucket': read_bucket, 'Key': f"{read_prefix}/{location[i]}.tar.gz"}
bucket = s3.Bucket(default_bucket)
bucket.copy(copy_source, f"{model_prefix}/{location[i]}.tar.gz")
# Retrieve the SageMaker managed XGBoost image
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")

# Specify an unique model name that does not exist
model_name = "housing-prices-prediction-mme-xgb"
primary_container = {
"Image": training_image,
"ModelDataUrl": model_artifacts,
"Mode": "MultiModel"

model_matches = sm_client.list_models(NameContains=model_name)["Models"]
if not model_matches:
model = sm_client.create_model(ModelName=model_name,
print(f"Model with name {model_name} already exists! Change model name to create new")
# Endpoint Config name
endpoint_config_name = f"{model_name}-endpoint-config"

# Create endpoint if one with the same name does not exist
endpoint_config_matches = sm_client.list_endpoint_configs(NameContains=endpoint_config_name)["EndpointConfigs"]
if not endpoint_config_matches:
endpoint_config_response = sm_client.create_endpoint_config(
"InstanceType": "ml.m5.xlarge",
"InitialInstanceCount": 1,
"InitialVariantWeight": 1,
"ModelName": model_name,
"VariantName": "AllTraffic",
print(f"Endpoint config with name {endpoint_config_name} already exists! Change endpoint config name to create new")

Now, you will find the housing-prices-prediction-mme-xgb model made in the Models section in

AWS SageMaker > Inference > Models

# Endpoint name
endpoint_name = f"{model_name}-endpoint"

endpoint_matches = sm_client.list_endpoints(NameContains=endpoint_name)["Endpoints"]
if not endpoint_matches:
endpoint_response = sm_client.create_endpoint(
print(f"Endpoint with name {endpoint_name} already exists! Change endpoint name to create new")

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
while status == "Creating":
print(f"Endpoint Status: {status}...")
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print(f"Endpoint Status: {status}")

Now, if you go to the Endpoints Section, you will now be able to see an active endpoint that we will be able to query.

AWS SageMaker > Inference > Endpoints


We can click on the endpoint to see the usage statistics and the usage.

Now, if you remember, we had made a sample list for which we wanted to get the predictions. We will now pass the list to get the prediction.

location = [‘Chicago_IL’, ‘Houston_TX’, ‘NewYork_NY’, ‘LosAngeles_CA’]
test_data = [1997, 2527, 6, 2.5, 0.57, 1]

# converting the elements in test data to string
payload = ' '.join([str(elem) for elem in test_data])

for i in range (0,4):
predicted_value = sm_runtime_client.invoke_endpoint(EndpointName=endpoint_name, TargetModel=f"{location[i]}.tar.gz", ContentType="text/csv", Body=payload)
print(f"Predicted Value for {location[i]} target model:\n ${predicted_value['Body'].read().decode('utf-8')}")

We are able to hit the multi-model endpoint and get the following results.

We were successfully able to implement our model.

Now, if we check the behaviour of the endpoint, we will be able to see additional information.

We have successfully implemented a multi-endpoint model for real-time inference.

If you want to see the multiple models that have been deployed in a single endpoint, you can type CloudWatch in the Search bar and go to the metrics.

CloudWatch > Metrics > All Metrics

Here, if you click on endpoints, you will be able to see the different models that have been implemented.

Resources Cleanup

Clean up and delete the model.

# Delete model

# Delete endpoint configuration

# Delete endpoint

Go to S3 Buckets by typing it out in the search bar and delete the bucket that has the tar files of the models.

Then, you can proceed to delete the CloudFormation stack.

That’s all folks, we are done!

Find me on LinkedIn!



Anish Mahapatra

Senior AI & ML Engineer | Fortune 500 | Senior Technical Writer - Google me. Anish Mahapatra |