3 Critical Steps to Take When Predicting Adverse Health Events with ML

In this post, we are going to discuss three steps to take when leveraging machine learning (ML) to successfully predict adverse health events. We will skip over security, privacy, and transparency — which you can find in this article

The three steps we’re going to cover in this post are: 

  1. Population
  2. Data cleaning, manipulation, and target identification
  3. Choosing an algorithm and explainability 


When attempting to collect data to reflect an entire actual population, there will often be issues with the data collection. Having data collection problems in healthcare models can lead to adverse patient outcomes due to a bad algorithm, rather than health conditions. Therefore, checking that data for demographic statistics is a key step to success.

Take for example, maternal mortality rates. Statistically, minority women have a higher risk of dying after pregnancy. If a model is only trained on the maternal data of white women, it will miss genetic patterns or comorbidities unique to that subpopulation. Misidentifying a high-risk patient is not the same as recommending the wrong discount code to me to make a purchase. The need for an accurate and reliable model is critical in the healthcare space.

It’s good to look at the population statistics of the users of your healthcare system and construct a representative data set. If you purchase a pre-built model, make sure to check that the population breakdown is representative of your population or that the model can be retrained with your data. Also, confirm that the model approach applies well to both the population and use case you are exploring.

Data Cleaning, Manipulation, and Target Identification

Healthcare data can be some of the hardest data to clean and prepare. This is because of the unique shorthand used by healthcare professionals to annotate healthcare data. Depending on the situation, it could be very beneficial for the data scientist to interview a few stakeholders. 

This can prevent key data from being manipulated or misinterpreted. Oftentimes, eliminating short words and non-English words is standard in natural language processing, but in your dataset, this could actually indicate a medication shorthand. 

Is there a specific field that already indicates that an adverse health event has occurred? A different level of complexity is needed if the identification of a health event is buried within a text file. Sometimes the complexity can be reduced if they can create the column manually for a particular outcome. This target column allows for a supervised learning method to be used, instead of an unsupervised or deep learning method. 

Choosing an Algorithm and Explainability 

Healthcare is a regulated field. With this in mind, many health ethics advocates want the decisions behind the prediction to be transparent for the doctors and the patients. Care must be taken to ensure that the algorithms being used allow for explainability. This does not mean that deep learning algorithms are automatically thrown out, but that experimental design is taken into consideration early on. 

With traditional machine learning models, it’s easy to extract which features or columns in the data have the most importance to the outcome. We can sometimes even plot the weight of that feature in relation to the other features. However, when we look at methods that use neural networks such as RNN, we can not easily understand why the decisions were made. That’s why data scientists have been working to develop methods for making black box models more explainable. 

One of those is using LIME. 

LIME uses local suffrage models which are “interpretable models that are used to explain individual predictions of black box machine learning models.” Essentially, it trains the strange results in an external interpretable model. 

How to Implement Adverse Event Prediction on the Snowflake Data Cloud

Snowflake can not only help data scientists build models for predicting adverse events, but also provides them with a platform for deploying those models. Data used to train such models will almost certainly come from electronic health record (EHR) systems, but other data may be necessary to optimize predictive performance.

For instance, claims data may provide a more complete longitudinal view of patient health. Centralizing data from these disparate sources within Snowflake provides a powerful platform for data science and model development. 

Leveraging patient data requires tight management of appropriate access levels by users to meet standards for HIPAA and HL7. Snowflake makes it easy to grant the right access to certain users based on their role.

Once models have been trained, the next step is to create a process that can generate predictions (inference) for new data. The models can be deployed using Snowpark Python user-defined functions (UDFs) to package the model and run inference workloads on Snowflake compute. Predictions generated in this way can be written into Snowflake tables to make them available downstream. 

The final step in implementing a churn solution is to serve those predictions back to healthcare providers. To do that, predictions should be integrated into systems that providers are using on a regular basis. Most likely, this means pushing data and alerts into an EHR system so that it is visible at nursing stations or at the patient’s bedside. 

Ultimately, the intent is to allow providers to take proactive actions to prevent adverse events. 


There will always be mitigating factors that affect your ability to implement these steps effectively. But that should not stop you from ensuring that they are at least addressed in your experiments. Each is a key factor to ensure your data experiments align with your end goals. 

If you have any questions or would like to explore implementing a predictive model in your organization, please reach out to us!

More to explore

Accelerate and automate your data projects with the phData Toolkit

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.