Automated Machine Learning (AutoML) tools have become wildly popular in the field of data science because they can automate some of most repetitive tasks across machine learning (ML) projects. These tools are applicable to most ML projects and applications, and can be used in virtually any industry to rapidly develop ML models. By automating tasks, they also open the doors to a wider range of professionals who might want to get involved in ML projects.
In this post, we’ll explain what AutoML tools can do and how they can fit into your organization or project. We’ll then go into the technical details of how AutoML gets the job done. Finally, we’ll cover which sorts of ML applications are best suited for AutoML, which might not be, and what else you should consider to get your application from concept to production.
What is AutoML?
AutoML tools optimize supervised, predictive ML algorithms for a given dataset. Most AutoML tools are designed to work with structured tabular data, such as a database table. Given a dataset with a certain regression or classification target, an AutoML tool will train many ML models and select the best one for a given use case.
AutoML should not be mistaken with other key concepts which combine automation and artificial intelligence (AI) or ML. AutoML is distinct from the concept of MLOps, which brings DevOps-style automation to the engineering of ML applications. It is also different from AIOps, which uses artificial intelligence and analytics in the field of IT Operations.
Who can use AutoML?
AutoML allows people who aren’t data scientists to get involved in ML projects because it orchestrates many esoteric ML tasks and concepts, and it can potentially reduce the amount of time that it takes data scientists to perform routine tasks. It allows you to train and tune models without thinking about hyperparameter-selection and cross-validation strategies, as well as specific model architectures (i.e. logistic regression vs. decision tree). You still need some understanding of databases and other data sources, business knowledge to identify use cases, and some statistical literacy to select metrics for a given use case.
Will AutoML replace data scientists?
In short, no. While AutoML tools are powerful, they only do part of a data scientists’ job: optimize models. Since the model-optimization procedure is repeatable for a vast array of use cases, it can be automated by AutoML. Data-science skills are still necessary to discover use cases, gather appropriate data, design applications, select metrics, etc. Data scientists can also use AutoML tools to automate some of the boring parts of their job, freeing up time to focus on more complex problems.
How does AutoML work?
AutoML creates supervised ML models based on data – let’s start by unpacking that a bit. ML models are computer programs which learn their routines automatically rather than being programmed manually. The most commonly used ML models are trained by supervised learning using previously recorded examples, i.e. data.
During supervised learning, a model is trained to predict one or more target variables when given a collection of feature variables. A trained model is then able to predict the target variable based on the features alone. Such a model can be useful for future cases where the target is unknown, but the features are. Exposing a model to make predictions on new examples is commonly known as model deployment.
Supervised ML models fall into two categories: classification and regression; classification models learn to predict a class label, while regression models predict a continuous numerical value.
AutoML tools start by connecting to a particular dataset as an input. The basic assumption is that your dataset has the necessary data to build a supervised ML model for a classification or regression task.
More specifically, this dataset should contain the target variable as well as any other data that will be used as features for your model to take as input for its predictions. These datasets are usually a table in a database, or something similar like a CSV/parquet file. An AutoML tool will prompt you to select the dataset and identify the target column (or columns for multiple regression or multiclass classification).
When preparing a dataset for input, you should take care to only include variables (columns) that are appropriate for use in the given ML application. The target, of course, should be the variable you intend for the ML model to predict. The remaining columns will be used as features. Most importantly, any predictions generated by the output ML model will require those features.
PRO TIP: You should take care not to include variables that would not be available as input for prediction once deployed.
Once the input dataset has been configured, most AutoML tools provide a profile of the data. The data profile will include descriptive statistics for each variable in the dataset, such as mean, median, quartiles, etc. It may also include some visualizations such as histograms or measures of correlation. As part of this profiling step, the tool will determine which variables are numeric vs. categorical and count missing values for each variable. It may also determine which categorical variables have high cardinality (many unique values) to influence how the variable will be encoded.
Correlation analysis, histograms, and descriptive statistics from Dataiku’s AutoML solution.
The AutoML process
The key innovation of AutoML tools is that they automate the process of hyperparameter optimization, also known as hyperparameter tuning or model selection. Hyperparameters are all of the different knobs and switches that data scientists can tweak when building transformers and models for an ML application. This even includes the type of transformer or model itself. AutoML will fit many candidate models with different combinations of hyperparameters and determine which one is best. This is basically equivalent to throwing a bunch of spaghetti at the wall and seeing what sticks, but with some more sophistication.
AutoML tools can experiment with hundreds or thousands of candidate models during optimization. While all hyperparameter tuning usually starts with some amount of random sampling, most tools will use a technique for intelligently refining samples later in the process. Common strategies for this included Bayesian Optimization and Bandit approaches.
Optimization naturally assumes there is some target metric to be optimized – this allows candidate models to be ranked on a scoreboard. AutoML tools usually make this scoreboard visible to you then automatically select the top model; though sometimes you can explore alternative candidates as well. As such, you’ll need to specify the metric which should be optimized for any given problem. In the case of a classification problem, this may be precision, recall, F1 measure, or ROC-AUC; or for regression, something like RMSE or MAE. For every candidate model, the AutoML tool will measure the metric using a cross-validation technique. From the candidates, it will select the one with the best score.
Comparison of trained regression models in Dataiku. In this case, the trained and optimized hyperparameters for three different model types: Random Forest, Ridge (L2) Regression, and XGBoost. The best model was a Random Forest (denoted with the trophy icon) with XGBoost as a close second.
Hyperparameters tuned by AutoML tools fall broadly into two categories: those related to engineering/transforming features for the model, and those related to the supervised ML model itself. For engineering features, the process might experiment with different strategies for imputing missing values, such as simple mean/median imputation or something more sophisticated like MICE. They may also experiment with (or heuristically select) normalization strategies for numerical/continuous variables and encoding strategies for categorical variables.
Many AutoML tools also experiment with feature engineering by dimensionality reduction, such as PCA. When it comes to the predictive model itself, AutoML tools try multiple model architectures – such as logistic regression, SVM, GBDT – and randomly sample their underlying hyperparameters to search for an optimal configuration.
Types of AutoML tools
AutoML tools fall into a few basic categories. Some are designed for users with little to no code experience, while others involve code. In broad strokes, the types of AutoML tools fall into three categories: graphical, programmatic, and hybrid.
Graphical User Interfaces
Platforms such as Dataiku, DataRobot, and H2O.ai include AutoML tools with convenient graphical interfaces that allow you to create models without writing any code. These platforms are actually more full-featured and include more than just AutoML, and can enable your organization to take greater advantage of the wide range of skill sets already available within their workforce.
In addition to providing a graphical interface for AutoML, you can also use graphical tools to help wrangle data as input to AutoML. These platforms also support the deployment of models with reduced engineering overhead.
For users who prefer to write code, there are APIs available such as Caret for R, and PyCaret or TPOT for Python. AutoML platforms allow you to point to a dataset with target variables and features using code, and kick off the process by calling functions. This allows you to wrangle data in your language of choice, and deploy your models with a high degree of flexibility using code.
Sagemaker Autopilot is a good example of a hybrid tool which involves both a graphical interface and code. It prompts you for input using a graphical interface, then generates code (as Python/Jupyter notebooks) for data profiling and hyperparameter tuning. The generated code can also be used to register the best model in the Sagemaker Model Registry.
In some sense, Dataiku could also be considered a hybrid tool since it optionally allows you to incorporate your own code, but code is not as central to the design as Sagemaker Autopilot.
When to use AutoML
AutoML is best suited for ML projects which use structured data, which actually covers a solid majority of projects. When features are organized into rows and columns, they are already formatted for AutoML tools to ingest. Since AutoML tools handle imputation, it doesn’t matter if some data is missing from the columns. And AutoML tools will encode categorical variables and normalize numerical ones to engineer features for ML algorithms.
Small and medium datasets
AutoML is also most appropriate for small- to medium-sized datasets. Training ML models on larger datasets simply takes longer. The technology is built on training many candidate models as experiments, so running larger datasets through so many experiments can be costly in terms of time and/or compute resources. While the increase in training time is linear in many cases, models trained on larger datasets can also learn more structure and thus require more iterations during training or optimization.
There are no hard definitions for a medium-size dataset, and compute budgets depend on the organization and project. As a rule of thumb, datasets with up to 50 features (columns) and up to 100,000 rows can be compatible with AutoML tools. Smaller than that, and users will be fine. Larger datasets will likely require more careful thought about the time/compute budget for the project and whether more directed experimentation by a data scientist would be more efficient.
Rapid prototyping and proof of concept
PyCaret includes built-in visualizations to help evaluate models. Generating these visualizations requires a single function call rather than complex custom code.
When not to use AutoML
While some AutoML tools include deep neural networks in the suite of candidate models, the vast majority of them can’t claim to do deep learning in the most practical sense of the term – namely, the engineering of features from raw unstructured data.
Examples of unstructured data include raw text (natural language) or images and videos. Dealing with these types of data requires a bit more expertise to transform the data and get it ready for modeling. The full-featured platforms such as Dataiku and DataRobot can help with some of these prerequisite steps and can even handle certain unstructured data.
That said, AutoML is still not as well suited for deep learning as traditional machine learning. The hyperparameters involved in deep learning – network architecture, advanced regularization techniques, transfer learning, etc. – are too vast and open-ended for the brute-force optimization style of AutoML. The compute budgets necessary to tune these models lends itself better to the skilled and structured experimentation of a data scientist.
In a similar sense, large datasets will also pose problems for AutoML tools. Training models on large datasets takes a long time. As a result, it may require too much time or compute (and ultimately cost) to execute the sheer number of experiments required to select hyperparameters; recall that it is generally necessary for AutoML to run hundreds or thousands of experiments to find the optimal model.
If the AutoML tool (on the scale of computing infrastructure you have available) is taking several hours or days to complete its experiments, consider taking a more structured approach to your experimentation. In the meanwhile, you can create a prototype model on a subset of your data to help get answers sooner.
Complex use cases
There are also complex use cases which are not well suited for AutoML. Within the highly structured framework of AutoML tools, it’s easy to optimize models when the metrics are clear and easy to calculate. Some problems, however, require custom logic or metrics to evaluate the quality of a model. Challenges may arise in plugging that logic or metric into the AutoML tool at your disposal. For prototype purposes, it may be possible to reduce your problem onto a simpler metric and go forward with AutoML, then iterate to refine the model later.
What else is there to worry about when using AutoML?
AutoML tools will train models, but they still won’t adapt them to your business case. That’s your job, but it’s arguably the most fun. This means you need a clear understanding of what metric to optimize in order to satisfy the business demands of your project. You will also want to take care in how you work with the intricacies of your data and application, such as classification on highly imbalanced data.
Improving a model that was optimized using AutoML can also require significant effort. The speed and ease at which a prototype model is developed with AutoML can lead to great gains and send the message (to you and your leadership) that developing ML models is easy. But improving the performance of those models by even a small amount may require significant code and research. If the performance of an AutoML model is borderline, you should be grounded in your expectations about the time and effort required to improve that model.
And whenever you develop an ML model, it will likely need to be deployed to a production environment to deliver value. Some platforms (e.g. Dataiku, DataRobot, Sagemaker) can greatly simplify this process by automatically wrapping your model as a REST API. But even with that help, there are high-level considerations that can require careful evaluation by (software or DevOps) engineers, such as security, availability, reliability, and scaling. For those that need a robust solution to these problems, a full MLOps implementation may be desirable.
AutoML tools represent a powerful advancement in data-science technology. They enable data scientists to do their work more quickly, and can empower teams of less experienced professionals to build ML models and drive their organizations forward.
Most importantly, AutoML tools can probably be applied to a significant portion of your projects. While they don’t solve all problems (such as ML engineering and operations) they can certainly accelerate your data-driven transformation. In most cases, project success will still depend on support from good ML engineers and data scientists.
At phData, we’ve repeatedly seen the effectiveness of AutoML in practice. If you need guidance in adopting AutoML tools or deploying models into production, we’re here to help.
Frequently Asked Questions About AutoML
Some AutoML tools are free and/or open source. The Dataiku platform (which includes AutoML) can be run standalone with reduced features, but includes AutoML. The Python library PyCaret is completely free and open source. Sagemaker AutoPilot is technically free on its own, although users will pay for the AWS compute resources used to train models.
AutoML tools can automate the following tasks for ML:
- Data profiling (visualization and calculation of descriptive statistics)
- Data preprocessing/cleansing
- Feature engineering (e.g. categorical encoding, normalization, dimensionality reduction)
- Hyperparameter tuning
- Model evaluation
AutoML can be used for many many common use cases where machine learning generally works well:
- Customer churn prediction
- Process automation
- Fraud detection
- Personalized marketing
- Anomaly detection
Yes, most AutoML tools will integrate in some way with external data sources, such as Snowflake, Amazon S3, relational databases (Oracle, MS SQL Server), document databases (MongoDB, etc.), or even enterprise systems like SAP and Salesforce. Full-featured platforms like Dataiku and DataRobot have connectors built-in or available as plugins. Programmatic tools can ingest data from any source for which the language has a connector; for instance, PyCaret can effectively connect to any data source with a Python connector library.