All posts
AWS
Machine Learning
XGBoost
Cloud

Build, Train, and Deploy a Machine Learning Model with Amazon SageMaker

Step-by-step guide to using SageMaker notebook instances to train an XGBoost binary classification model and deploy it as a real-time inference endpoint.

October 10, 2021 · 3 min read · By Kshitiz Regmi

Amazon SageMaker is a fully managed ML platform that enables data scientists and ML engineers to prepare, build, train, and deploy models quickly — without managing infrastructure. It supports Jupyter notebooks, TensorFlow, PyTorch, XGBoost, and more.

Why Amazon SageMaker?

Traditional ML workflows require:

  • Setting up and managing Jupyter environments
  • Provisioning GPU/CPU compute for training jobs
  • Building, containerizing, and maintaining inference servers

SageMaker handles all of this. A SageMaker Notebook Instance is a managed Jupyter environment backed by EC2 — write code, SageMaker handles compute.

What You'll Build

A binary classification model (customer churn prediction) using SageMaker's built-in XGBoost algorithm, trained on S3 data and deployed as a real-time HTTPS endpoint.

Step 1: Create a Notebook Instance

  1. Open the SageMaker consoleNotebook instancesCreate notebook instance
  2. Name: ml-tutorial
  3. Instance type: ml.t3.medium
  4. IAM role: Create new (allow S3 access)
  5. Click Create notebook instance and wait for status InService

Step 2: Set Up the Session

import boto3
import sagemaker
from sagemaker import get_execution_role

session = sagemaker.Session()
role = get_execution_role()
bucket = session.default_bucket()
prefix = "xgboost-churn"

print(f"Role: {role}")
print(f"Bucket: {bucket}")

Step 3: Prepare and Upload Data to S3

SageMaker training jobs read data from S3. The XGBoost built-in algorithm expects CSV data with the target column first:

import pandas as pd
from sklearn.model_selection import train_test_split

# Assume df is your churn dataset
# Target: 'churn' (0/1), features: all other columns
df = pd.read_csv("churn.csv")

# Move target column to front (SageMaker XGBoost requirement)
cols = ['churn'] + [c for c in df.columns if c != 'churn']
df = df[cols]

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df.to_csv("train.csv", index=False, header=False)
test_df.to_csv("test.csv", index=False, header=False)

# Upload to S3
train_path = session.upload_data("train.csv", bucket=bucket, key_prefix=f"{prefix}/train")
test_path  = session.upload_data("test.csv",  bucket=bucket, key_prefix=f"{prefix}/test")
print("Train data:", train_path)
print("Test data:", test_path)

Step 4: Configure the XGBoost Estimator

SageMaker provides built-in algorithms as Docker images:

from sagemaker.estimator import Estimator

xgboost_image = sagemaker.image_uris.retrieve(
    "xgboost", session.boto_region_name, "1.5-1"
)

estimator = Estimator(
    image_uri=xgboost_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=f"s3://{bucket}/{prefix}/output",
    sagemaker_session=session,
)

estimator.set_hyperparameters(
    objective="binary:logistic",
    num_round=100,
    max_depth=5,
    eta=0.2,
    eval_metric="auc",
    subsample=0.8,
    colsample_bytree=0.8,
)

Step 5: Train

from sagemaker.inputs import TrainingInput

train_input = TrainingInput(train_path, content_type="text/csv")
val_input   = TrainingInput(test_path,  content_type="text/csv")

estimator.fit({"train": train_input, "validation": val_input})

SageMaker:

  1. Spins up an ml.m5.xlarge training instance
  2. Pulls the XGBoost container
  3. Downloads data from S3
  4. Runs training, logging metrics every round
  5. Saves the model artifact to S3
  6. Terminates the training instance

You're billed only for the training duration.

Step 6: Deploy as a Real-Time Endpoint

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
)

This creates a persistent HTTPS endpoint. Inference:

from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

predictor.serializer = CSVSerializer()

# Send a feature row (without label)
result = predictor.predict("45,1,200,3,1,2500,0,1")
print(result)  # e.g., b'0.73' — probability of churn

Step 7: Clean Up

predictor.delete_endpoint()

Endpoints are billed by the hour even when idle. Always delete after testing.

SageMaker Architecture

Notebook Instance
      │
      ├── Training Job (ml.m5.xlarge — spun up and terminated automatically)
      │         │
      │         └── Model artifact → S3
      │
      └── Inference Endpoint (ml.m5.large — persistent until deleted)

Key Advantages Over DIY

DIYSageMaker
Infra setupManual EC2 + DockerFully managed
Distributed trainingComplexinstance_count > 1
Model registryBuild yourselfBuilt-in
MonitoringCustomCloudWatch + Model Monitor
AutoscalingManualBuilt-in with target tracking

What's Next

  • SageMaker Pipelines — orchestrate multi-step ML workflows
  • SageMaker Model Monitor — detect data drift and model degradation
  • SageMaker Experiments — track hyperparameter experiments
  • SageMaker Studio — fully integrated web-based IDE for ML