How Sagemaker Autopilot is bringing Machine Learning closer to everyone

Data scientists are usually too expensive to hire, so it very often comes down to regular developers to manage and tune the machine learning models the enterprises require. That is why AWS introduced Sagemaker Autopilot, trying to get more people to use its technology to build their next-generation ML-based applications.

Training and deploying a complex machine learning model involve multiple stages with each phase demanding a unique skill set. An enterprise data science team usually consists of data engineers, data scientists, ML developers, researches, business analysts and DevOps professionals to manage the workflow required in operationalizing AI for businesses.

A typical end-to-end machine learning pipeline has the steps represented in the below diagram:

Each step is handled by an individual or a team specializing in that task. Data engineers deal with the acquisition and ingestion of data from different sources. Data scientists and business analysts focus on aggregating the data sources, developing the dataset and extracting the right features from the final dataset. Machine Learning researchers determine the right set of models adjusted with the expected business outcome. DevOps engineers provision the infrastructure to perform training and deployment of models. Machine Learning researchers fine-tune the model based on a technique called hyperparameter tuning, and lastly, developers write code that integrates the trained and deployed models with business applications to perform classification and predictions.

AutoML simplifies the workflow by the automation of the stages of the pipeline. Any developer can point a dataset to an AutoML platform and expect a fully-trained model to become accessible in several minutes. This method essentially changes the process associated with training ML models.

Amazon SageMaker is a fully managed service that covers the entire machine learning process, including data labelling and data preparation, algorithm selection, model training and optimization for deployment. This presents the capability to deploy models in a production environment faster and with much less effort and time.

SageMaker Autopilot is a new product of the Sagemaker suite, makes the process of training machine learning models simpler while providing an opportunity to examine data and trying different algorithms. It’s an AutoML platform, but a little different than its competitors. The key differentiator of SageMaker Autopilot is the auto-generation of the notebooks as part of the workflow. Developers can create an Autopilot task only to generate notebooks instead of running the entire process. These notebooks are standard, open-source Jupyter notebooks. Developers can download these notebooks to investigate how the data was prepared and the which algorithm was used within each pipeline built for each candidate.

The first Jupyter notebook produced by SageMaker Autopilot has a comprehensive analysis of the input dataset. Including finding missing values and the analyzes of the categorical columns that need encoding. The second notebook records the combination of the feature engineering techniques and the algorithms selected for each candidate.

SageMaker Autopilot enables you to provide a tabular dataset in a CSV format and select the target column to predict.

Currently, it offers 3 options:

•Prediction of a number (regression) such as a house price

•Prediction of a category (binary classification) such as true/false

•Prediction from multiple categories (multiclass classification) such as colour or shape

SageMaker Autopilot automatically investigates different solutions and it will perform data preparation, data transformation, model selection, pipeline creation, hyperparameter tuning and even selecting the best model from the final list of candidates. As a result, you can deploy the model in a working environment with a single click.

In the background, SageMaker will then train up to 50 different models on that data (also provides access to each of these as notebooks) and present them in the form of a leaderboard within SageMaker Studio.

Autopilot also determines the right EC2 instance family and the number of instances necessary for each of the stages of the pipeline. When the job is done and the final model is selected, Amazon SageMaker Autopilot will tear down the infrastructure.

With the introduction of SageMaker Autopilot, AWS has taken the first in making AutoML solution transparent and understandable. The platform offers several benefits like quick and easy generating of high-quality models, provides visibility and control throughout the process and makes model deployment quite easy. It also pleases expert data scientists with its integration with other SageMaker Studio capabilities such as the SageMaker Model Monitor makes it easy for operationalizing ML solutions.

Mark's Blog about AI and ML

How Sagemaker Autopilot is bringing Machine Learning closer to everyone