This content originally appeared on DEV Community and was authored by almamon rasool abdali
welcome again
in previous article ,we get genral overview of MLOps
today we want to start our MLOps implementation
our first thing to do is visibility some of you may think that visibility ( monitoring ) is at the end of the deployment.
to me, visibility here is monitoring, tracking, collaboration between the team, and getting insight on the data journey from the beginning to the end of the pipeline.
and we need continuous visibility over the following things:
- visibility over code
- visibility over data
- visibility over model training process and all the experiments undergoing
- visibility over inference and feedbacks
- visibility over activities for security
Now, let's check them one by one
1. visibility over code changes
for normal Software developers, this is not an issue but for a managing team of data scientists and ML researchers it can be considered as a headache ;
mostly the team use notebooks and mostly you find your team develops bad coding habits and that also affects the version control and code change tracking, CI/CD problems .. and many other things.
also, there are many tools that try to solve these problems but it is not the notebook itself that makes the problem it is due to bad coding habits by the team itself.
and all problems can be solved if you enforce your team for writing good code and to me, good code must be at least fulfill three main points (Modularity, High Cohesion, Loose Coupling)
so basically if we use notebooks for only importing and calling our class and methods and also separate each script by its work nature such as pre-processing script has to be fully functional without the training code and vise versa and to make work more scalable we need to containerize each script so we can run it on the cluster to do the work.
now but what if the environment you use will help you and the team to do the above ??
based on the best practice method to use sagemaker when running our scripts it needs you to separate each phase in different code (training code, preprocessing code, infer code ), and each phase gonna be containerized and run separately, and the notebook in sagemaker is used for functions calling and the heavy coding is inside scripts that shipped in the containers of each stage
let take an example to get into sagemaker mentality
starting by shipping a pre-processing script inside pre-made aws container for sklearn to do preprocessing
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
#get region and excution role
role = get_execution_role()
region = boto3.session.Session().region_name
#set the machine type and number of machines
sk_proc = SKLearnProcessor(
framework_version="0.20.0", role=role, instance_type="ml.m5.xlarge", instance_count=2
)
#sagemaker will copy data from s3 loction to /opt/ml/processing/input
#your script will read data from /opt/ml/processing/input
#sagemaker will expact you now to give it the output preproceesdata
#into /opt/ml/processing/train and /opt/ml/processing/test
#we also add cmd arg called --train-test-split-ratio to control spliting ratio
#run
sk_proc.run(
code="preproc.py",
inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
],
arguments=["--train-test-split-ratio", "0.2"],
)
#get information regarding our runing job
preproc_job_info = sk_proc.jobs[-1].describe()
#get the conifgartion info to get the output uri for each final s3 for train and test
out_cfg = preproc_job_info["ProcessingOutputConfig"]
for output in out_cfg["Outputs"]:
if output["OutputName"] == "train_data":
train_preprco_s3 = output["S3Output"]["S3Uri"]
if output["OutputName"] == "test_data":
test_preprco_s3 = output["S3Output"]["S3Uri"]
as you can see we just provide our script (the script is easier to track than a notebook ) and sagemaker will ship it in a container ( containerizing our code make it more portable and scalable and re-usable ) also if we want to train a model on it it has to be on a different container, let see example
from sagemaker.sklearn.estimator import SKLearn
#send our script to the sklearn container by aws
sklearn_model = SKLearn(
entry_point="train.py", framework_version="0.20.0",
instance_type="ml.m5.xlarge",
role=role
)
#aws sagemaker will put data for you in /opt/ml/input/data/train from s3
# your model must output the final model in /opt/ml/model so sagemaker will copy it to s3
sklearn_model.fit({"train": train_preprco_s3})
#get job info
training_job_info = sklearn_model.jobs[-1].describe()
#get final model from s3
model_data_s3_uri = "{}{}/{}".format(
training_job_info["OutputDataConfig"]["S3OutputPath"],
training_job_info["TrainingJobName"],
"output/model.tar.gz",
)
now when work is done as above the code can be part of any normal CI/CD pipeline and team can work togther and collaborate based on any normal software lifecycle
let move to the next section of data
2. visibility over data
here i want to cover three things
- collaborate over features created by team members
- versioning of the data or features
- montoring data quality and detecting drifts
solving 1 & 2 by using feature store (AWS sagemaker feature store )
and solving 3 by monitoring some statistical information about the data and here we will use (Amazon SageMaker Model Monitor - Monitor Data Quality )
so let start by exploaring them one by one
feature store
if you work with team , and say you finished preprocessing data and get feature ready for modeling , now maybe you ask how to share features between team , how to re-use them over different project , how to make them fast to reach fast to query without need to re-do the work again
feature stores is to help you create, share, and manage features and it works as single source of truth to store, retrieve, remove, track, share, discover, and control access to features.
before we start working with aws sagemaker feature store we need to understand few concepts
Feature group – main Feature Store resource that contains the metadata for all the data stored in Amazon SageMaker Feature Store.
Feature definition – the schema definition for that data such as feature named prices is float , and feature named age is integer
Record identifier name – Each feature group is defined with a record identifier name. The record identifier name must refer to one of the names of a feature defined in the feature group's feature definitions.
Record – A Record is a collection of values for features for a single record identifier value. A combination of record identifier name and a timestamp uniquely identify a record within a feature group.
Event time – a point in time when a new event occurs that corresponds to the creation or update of a record in a feature group.
Online Store – the low latency, high availability cache for a feature group that enables real-time lookup of records.
Offline store – stores historical data in your S3 bucket. It is used when low (sub-second) latency reads are not needed.
now let see how to work with feature stores in aws these video will show you the main idea of using feature store after doing preprocessing from aws data wrangler to see the flow of data from raw data into analyzing and preprocessing the data with aws data wrangler to creating feature store from the data flow pipline
now let see how we can deal with data drift
but first, let's understand what is drifts
Let first logically ask ourselves if the model is deployed and it is static with all its code and artifacts, so what makes things break, and why model accuracy degrades over time ??
- in any system the input, always is something that needs to be checked and validated and in ml input must be checked for drifts and security stuff and the input here is the data .. so what can happen to the data that make things not work as it must be ?? Data Drift happens when the distribution of data changes such as a change in clothes trends and fashions which maybe affect your clothes recommender system, or changes in the country economy and salaries which will affect houses ranges, or maybe you have a CCTV system with the problem in some of it cameras that send damaged stream or a new type of cameras with different video formats our different output ranges.
to make things more focused we have
Concept drift is a type of model drift where the relationship or the mapping between x to y is changed such as ML-based WAF where new attacks emerge that no longer the previous pattern can help to detect them so what the model know as the attack has been changed.
Data drift is a type of drift here we have changes in data distribution where the relation of x to y is still valid but something change the distribution such as nature change in temperature or new clothes trends or changes in customer preference
Upstream data changes refer to change in the data pipeline such as CCTV systems with a problem in some of its cameras that send damaged
so now how to detect these drifts ???
not all drifts can be detected automatically and many need humans in the loop
but generally, it is all about capturing the model performance decay if we can !!
so if possible we compare model accuracy with some ground truth.
but for tasks that these round truth not available there is other common methods
Kolmogorov-Smirnov method: simply we compare the cumulative distributions of two datasets; if the distributions from both datasets are not identical then we have data drift.
for more refer to
https://www.sciencedirect.com/topics/engineering/kolmogorov-smirnovpopulation stability index (PSI) : it measures how much a variable has shifted over time.
when we have
PSI < 0.10 means a “little change”.
0.10 < PSI < 0.25 means a “moderate change”
PSI > 0.25 means a “significant change, action required”.
for more refer to https://www.risk.net/journal-of-risk-model-validation/7725371/statistical-properties-of-the-population-stability-index
now let's back to the AWS sagemaker model monitor and how it can help us here
it can help us with ( Monitor drift in data quality, Monitor drift in model quality metrics, Monitor bias in your model's predictions, Monitor drift in feature attribution )
let's check data quality as an example
the idea is that we create baseline data that sagemaker will use to compare with new data to check some rules that help to detect drift
the steps needed is that
first, you must enable data capture for your model when deployed for inference
from sagemaker.model_monitor import DataCaptureConfig
#set the conifgration
capture_config=DataCaptureConfig(
enable_capture = True,
sampling_percentage=100,
destination_s3_uri=s3_capture_path)
#add the confi to your model deployment
predictor = model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge',
endpoint_name='endpoint name'
data_capture_config=capture_config)
Next, we must create a baseline from the main data so we will have some baseline statistical calculations so we can know when the new data changes from the baseline
example of creating the baseline
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat
data_monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type='ml.m5.xlarge',
volume_size_in_gb=20,
max_runtime_in_seconds=3600,
)
data_monitor.suggest_baseline(
baseline_dataset=baseline_maindata_uri,
dataset_format=DatasetFormat.csv(header=True),
output_s3_uri=baseline_result,
wait=True
)
for more please check out https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html
now we reach the end of these part and will cover in the next part the remaining items in the visibility section ... see you next
This content originally appeared on DEV Community and was authored by almamon rasool abdali
almamon rasool abdali | Sciencx (2022-01-03T19:55:47+00:00) MLOps journey with AWS – part 2 (Visibility is job zero). Retrieved from https://www.scien.cx/2022/01/03/mlops-journey-with-aws-part-2-visibility-is-job-zero/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.