January 1, 2022

Using AWS SageMaker to Set Up a Production ML Pipeline: Part 2

By Christina Bernard

As we continue this series, we look to understand how AWS can be used to set up production pipelines. This all with using only AWS for the set up. No fancy devops tools will be helping us through this struggle. 

There are a few things that you need to know about Amazon. It has multiple ways to implement everything you want to do. And that is no exception when it comes to creating ML pipelines. There are two ways of doing this with SageMaker pipelines and then directly a CodePipeline, Code Build, and Code Deploy.

What are AWS SageMaker Projects?

For this article, we’ll be focusing on AWS SageMaker Projects. Yes, you read that right. See SageMaker Projects will spin up the other components such as CodePipeline, CodeBuild, and CodeCommit and create that SageMaker pipeline for you. We’ll focus on the overall architecture of the Pipeline and where to find each element. 

If you don’t know what Amazon SageMaker is, here’s a short overview of it. SageMaker is a service on AWS that allows you to launch Data Science projects in the cloud or even launch a full CI/CD pipeline for a Machine Learning project.

How to Set Up a Production ML Pipeline in AWS SageMaker

Within SageMaker Studio, you can load a SageMaker template. Your organization can also customize these templates and load them here for use. 

After you’ve created the projects, these artifacts will spin up. 

  • Code repository: AWS CodeCommit
  • CI/CD workflow automation: AWS CodePipeline
  • Artifact Storage: Amazon S3 bucket
  • Logs: CloudWatch

Let’s go take a look at where the magic happens. Once the project set up runs, You’ll see that two repositories have been created. One for building and one for deploying. They created example repositories in AWS Code Commit. 

How to Verify That All Components Have Been Set Up

Clicking to the next tab, you’ll see that only one pipeline was created.There are no experiments set, model groups, or endpoints set up for this example. 

The pipeline is all set up now. We can click on it and see much more information about it.

AWS also provides us with an overview of every step in the pipeline. We’ll go over how to create steps as apart of a pipeline in the next article. 

Lastly we can see what is the processor type being used on this pipeline by default. AWS Sagemaker runs on an ml.m5.xlarge instance. If you need more power, you can change to a higher power in one of the steps in the pipeline. Also, we can see where our input data is coming from! Yes, that input URI shows where the data is being dumped to begin your pipeline. This is a way for you to have data cleaning steps outside of the pipeline then dump it into a S3 bucket. That can actually start the pipeline to run but that’s a different article. So stay tuned for the good part. 

The pipeline ARN and role ARN are important for the part you are staying tuned for so take a note and let’s keep moving forward. 

How to Check Where your Repository is Being Stored

AWS Codecommit is essentially an AWS version of GitHub. You can use your GitHub if you want but if you want to use all AWS services, then here you go. By starting a project, you have created a repository for model deploy and model build.

All of the files graciously spun up for you by AWS magic bots can be seen below. We’ll talk more about how they work in another part, but AWS does provide a decent ReadMe file to get you started.  

How to Check the Build in AWS CodeBuild

This takes us into the territory of CodeBuild. It did fail to build which we saw in AWS SageMaker Studio.

For each build, we can check the history and detail of what happened to it. This is great for investigating why it failed or what trigger misfired. All of the logs are there. 

How to Check the Log Files in AWS Cloud Watch

Just in case you don’t like that view. Don’t you worry, AWS has you covered with the logs in AWS CloudWatch too. Click on Log groups.

Choose which build you want to take a look at. 

Then you can read the logs all day long contemplating why things failed over your morning coffee. 

How to Find your Model Artifacts in AWS S3 Buckets

And last but not least, where do the model artifacts hang their hats? Well if you mosey on over to Amazon S3 you’ll find a bucket named after your pipeline. If you click on it, we will be able to see the default structure.

You would see several objects made by the pipeline but if you click on one. You’ll see an input folder that will take you all the way to a file that was used at a certain time in the pipeline. 

Putting it All Together

And there you go, (in a my greek fat wedding voice), you now know where your pipeline is and how to access it. 

AWS SageMaker Projects spins up a full CI/CD pipeline for your project. This implementation allows for faster experimentation, validation, and security. If you take away anything from this article, remember that AWS stores your model artifacts in an S3 bucket it generates unless otherwise specified. AWS Code Commit is the default git repository but Github and Bitbucket can be used as well. 

Next time, we’ll dive more into the basic code structure so you can understand how it works.

Have more AWS SageMaker or Machine Learning questions? Our team of Machine Learning experts are here to help!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit