Getting started with Amazon SageMaker: Building your first machine learning model

This article provides a guide on using Amazon SageMaker to build, train, and deploy a machine learning model for predicting house prices using the Ames Housing dataset. It covers the key features of SageMaker, data preprocessing steps, model training, …


This content originally appeared on DEV Community and was authored by Donovan HOANG

This article provides a guide on using Amazon SageMaker to build, train, and deploy a machine learning model for predicting house prices using the Ames Housing dataset. It covers the key features of SageMaker, data preprocessing steps, model training, and deployment, and demonstrates how to test the deployed model. The guide also includes important steps to clean up resources to avoid unnecessary costs.

Overview of Amazon Sagemaker

Amazon SageMaker is a fully managed service provided by AWS (Amazon Web Services) that enables developers and data scientists to build, train, and deploy machine learning models at scale. It simplifies the machine learning workflow by offering a suite of tools and services designed to handle various stages of the machine learning lifecycle, from data preparation to model deployment and monitoring.

Key Features

Integrated Development Environment:

  • SageMaker Studio: An integrated development environment (IDE) for machine learning that provides a web-based interface to build, train, and deploy models. It offers a collaborative environment with support for notebooks, debugging, and monitoring.
    Data Preparation:

  • Data Wrangler: Simplifies data preparation and feature engineering with a visual interface that integrates with various data sources.

  • Feature Store: A repository to store, share, and manage features for machine learning models, ensuring consistency and reusability across projects.

Model Building:

  • Built-in Algorithms: Provides a collection of pre-built machine learning algorithms optimized for performance and scalability.

  • Custom Algorithms: Supports bringing your own algorithms and frameworks, including TensorFlow, PyTorch, and Scikit-learn.

Model Training:

  • Managed Training: Automatically provisions and manages the underlying infrastructure for training machine learning models.

  • Distributed Training: Supports distributed training for large datasets and complex models, reducing training time.

  • Automatic Model Tuning: Also known as hyperparameter optimization, it helps find the best version of a model by automatically adjusting hyperparameters.

Model Deployment:

  • Real-time Inference: Deploy models as scalable, secure, and high-performance endpoints for real-time predictions.

  • Batch Transform: Allows for batch processing of large datasets for inference.

  • Multi-Model Endpoints: Supports deploying multiple models on a single endpoint, optimizing resource utilization.

Model Monitoring and Management:

  • Model Monitor: Automatically monitors deployed models for data drift and performance degradation, triggering alerts and actions when necessary.

  • Pipelines: Enables the creation and management of end-to-end machine learning workflows, from data preparation to deployment and monitoring.

Step-by-Step guide

AWS Free Tier :

First let's talk about AWS Free Tier for SageMaker. What is interesting in our case is the "Studio notebooks, and notebook instances" and "Training" section. Based on what it offers, we are going to use ml.t2.medium as Notebook instance type and ml.m5.large as Training instance (check the availability of theses types in the region you will provision the resources, I am currently using eu-west-3 region).

AWS Free Tier for SageMaker

Storing datasets

First, create a basic S3 bucket that will be used to store our raw dataset and then the training and test formatted datasets.

For this example, I will use the "Ames Housing Dataset" dataset which is a well-known dataset used in machine learning for predictive modeling. It contains information about various houses in Ames, Iowa, and includes features such as the size of the house, the year it was built, the type of roof, and the sale price. The goal is to predict the sale price of houses based on these features.

IAM Role

First, create an IAM role with the AmazonSageMakerFullAccess policy as well as the permission to get and create files on the dataset S3 bucket.

IAM role policies

Notebook instance

Create a SageMaker Notebook instance with the following parameters :

  • type : ml.t2.medium
  • attach the previously created IAM role

SageMaker Notebook instance configuration

Load and explore dataset

Once your instance appears as "InService" you can click on "Open Jupyter". (It can take minutes for your instance to become ready to use)

Notebook instance InService state

This will open a new page with the Jupyter Notebook interface. Now create a new notebook of type conda_python3.

Dependencies and dataset loading

Add this code in the first block and run it.

import boto3
import pandas as pd
import numpy as np
import sagemaker
from sagemaker import get_execution_role
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from sklearn.model_selection import train_test_split

# Load Data from S3
s3 = boto3.client('s3')
bucket_name = 'dhoang-sagemaker-datasets'
file_key = 'AmesHousing.csv'
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
df = pd.read_csv(obj['Body'])
df

It will import all the libraries that we need to access the S3 bucket, pre-process the dataset, train our model and deploy it. You should have a result like this :

Imported dataset

Pre-processing the dataset

Because we want to avoid empty or non-numerical values, we need to pre-process the dataset. In addition, we need to split the formatted dataset into a train and a test part to validate the model.

Add this code and run it :

# Train Model
role = get_execution_role()
sess = sagemaker.Session()
output_location = 's3://{}/output'.format(bucket_name)
container = sagemaker.image_uris.retrieve("linear-learner", sess.boto_region_name, "1.0-1")

linear = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=output_location,
    sagemaker_session=sess
)

linear.set_hyperparameters(
    predictor_type='regressor',
    mini_batch_size=100
)

train_data = 's3://{}/train.csv'.format(bucket_name)
test_data = 's3://{}/test.csv'.format(bucket_name)

data_channels = {
    'train': sagemaker.inputs.TrainingInput(train_data, content_type='text/csv'),
    'validation': sagemaker.inputs.TrainingInput(test_data, content_type='text/csv')
}

linear.fit(inputs=data_channels)

# Deploy Model
predictor = linear.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Train and deploy the model

Now this is the that we are going to explore SageMaker features. In our case, we use the linear-learner from SageMaker that allow us to make a linear regression on our dataset.

Next, we define the instance type that will be used for our SageMaker Training Jobs, here ml.m5.large to benefits of the Free Tier.

Then, after being trained, the model is deployed as a SageMaker Endpoint so we can use it afterward.

Add this code and run it :

# Train Model
role = get_execution_role()
sess = sagemaker.Session()
output_location = 's3://{}/output'.format(bucket_name)
container = sagemaker.image_uris.retrieve("linear-learner", sess.boto_region_name, "1.0-1")

linear = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=output_location,
    sagemaker_session=sess
)

linear.set_hyperparameters(
    predictor_type='regressor',
    mini_batch_size=100
)

train_data = 's3://{}/train.csv'.format(bucket_name)
test_data = 's3://{}/test.csv'.format(bucket_name)

data_channels = {
    'train': sagemaker.inputs.TrainingInput(train_data, content_type='text/csv'),
    'validation': sagemaker.inputs.TrainingInput(test_data, content_type='text/csv')
}

linear.fit(inputs=data_channels)

# Deploy Model
predictor = linear.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

It should take some minutes to execute. During this time, you can go to the console to see the Training Job status :

Training job status in console

with the logs in your notebook :

Log of training

and after the job finished, the Model Endpoint deployment :

Endpoint creation

Wait until the Endpoint status appears as InService.

Test the model

The following code aims to use the fresh new Endpoint to analyse and predict SalePrice on the test dataset. I will run it into the notebook, but you can actually run it from anywhere (if you have access to your endpoint) :

# Test Model
test_data_no_target = test.drop(columns=['SalePrice'])

# Ensure all data is numeric
assert test_data_no_target.applymap(np.isreal).all().all(), "Test data contains non-numeric values"

# Convert test data to CSV string
csv_input = test_data_no_target.to_csv(header=False, index=False).strip()

# Initialize the predictor with correct serializers
predictor = sagemaker.predictor.Predictor(
    endpoint_name=predictor.endpoint_name,
    sagemaker_session=sagemaker.Session(),
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
)

# Make predictions
predictions = predictor.predict(csv_input)
print(predictions)

You should get a set of analysis results, here the precision is not mandatory as I just want to provide you a guide to use SageMaker.

Model result on test dataset

Cleaning

To avoid unwanted cost and clean your account, do not forget to delete :

  • SageMaker Endpoint
  • SageMaker Notebook instance
  • S3 bucket
  • IAM role

Thanks for reading ! Hope this helped you to use or understand how to train a model with Amazon SageMaker from a raw dataset to a ready to use endpoint. Don’t hesitate to give me your feedback or suggestions.


This content originally appeared on DEV Community and was authored by Donovan HOANG


Print Share Comment Cite Upload Translate Updates
APA

Donovan HOANG | Sciencx (2024-07-08T05:00:00+00:00) Getting started with Amazon SageMaker: Building your first machine learning model. Retrieved from https://www.scien.cx/2024/07/08/getting-started-with-amazon-sagemaker-building-your-first-machine-learning-model/

MLA
" » Getting started with Amazon SageMaker: Building your first machine learning model." Donovan HOANG | Sciencx - Monday July 8, 2024, https://www.scien.cx/2024/07/08/getting-started-with-amazon-sagemaker-building-your-first-machine-learning-model/
HARVARD
Donovan HOANG | Sciencx Monday July 8, 2024 » Getting started with Amazon SageMaker: Building your first machine learning model., viewed ,<https://www.scien.cx/2024/07/08/getting-started-with-amazon-sagemaker-building-your-first-machine-learning-model/>
VANCOUVER
Donovan HOANG | Sciencx - » Getting started with Amazon SageMaker: Building your first machine learning model. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/07/08/getting-started-with-amazon-sagemaker-building-your-first-machine-learning-model/
CHICAGO
" » Getting started with Amazon SageMaker: Building your first machine learning model." Donovan HOANG | Sciencx - Accessed . https://www.scien.cx/2024/07/08/getting-started-with-amazon-sagemaker-building-your-first-machine-learning-model/
IEEE
" » Getting started with Amazon SageMaker: Building your first machine learning model." Donovan HOANG | Sciencx [Online]. Available: https://www.scien.cx/2024/07/08/getting-started-with-amazon-sagemaker-building-your-first-machine-learning-model/. [Accessed: ]
rf:citation
» Getting started with Amazon SageMaker: Building your first machine learning model | Donovan HOANG | Sciencx | https://www.scien.cx/2024/07/08/getting-started-with-amazon-sagemaker-building-your-first-machine-learning-model/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.