Price as low as $4499 | Learn to build custom large language model applications

machine learning models

Huda Mahmood

Machine Learning Models: 4 Ways to Test them in Production

Machine learning models are algorithms designed to identify patterns and make predictions or decisions based on data. These models are trained using historical data to recognize underlying patterns and relationships. Once trained, they can be used to make predictions on new, unseen data.

Modern businesses are embracing machine learning (ML) models to gain a competitive edge. It enables them to personalize customer experience, detect fraud, predict equipment failures, and automate tasks. Hence, improving the overall efficiency of the business and allow them to make data-driven decisions.

Deploying ML models in their day-to-day processes allows businesses to adopt and integrate AI-powered solutions into their businesses. Since the impact and use of AI are growing drastically, it makes ML models a crucial element for modern businesses.

Here’s a step-by-step guide to deploying ML in your business

A PwC study on Global Artificial Intelligence states that the GDP for local economies will get a boost of 26% by 2030 due to the adoption of AI in businesses. This reiterates the increasing role of AI in modern businesses and consequently the need for ML models.

However, deploying ML models in businesses is a complex process and it requires proper testing methods to ensure successful deployment. In this blog, we will explore the 4 main methods to test ML models in the production phase.

What is Machine Learning Model Testing?

In the context of machine learning, model testing refers to a detailed process to ensure that it is robust, reliable, and free from biases. Each component of an ML model is verified, the integrity of data is checked, and the interaction among components is tested.

The main objective of model testing is to identify and fix flaws or vulnerabilities in the ML system. It aims to ensure that the model can handle unexpected inputs, mitigate biases, and remain consistent and robust in various scenarios, including real-world applications.

ML model testing in the ML lifecycle — Workflow for model deployment with testing – Source: markovML

It is also important to note that ML model testing is different from model evaluation. Both are different processes and before we explore the different testing methods, let’s understand the difference between machine learning model evaluation and testing.

What is the Difference between Model Evaluation and Testing?

A quick overview of the basic difference between model evaluation and model testing is as follows:

Aspect	Model Evaluation	Model Testing
Focus	Overall performance	Detailed component analysis
Metrics	Accuracy, Precision, Recall, RMSE, AUC-ROC	Code, Data, and Model behavior
Objective	Monitor performance, compare models	Identify and fix flaws, ensure robustness
Process	Split dataset, train, and evaluate	Unit tests, regression tests, integration tests
Use Cases	Algorithm comparison, hyperparameter tuning, performance summary	Bias detection, robustness checks, consistency verification

From the above-mentioned details it can be concluded that while model evaluation gives a snapshot of how well a model performs, model testing ensures the model’s reliability, robustness, and fairness in real-world applications. Thus, it is important to test a machine learning model in its production to ensure its effectiveness and efficiency.

Explore this list of 9 free ML courses to get you started

Frameworks Used in ML Model Testing

Since testing ML models is a very important task, it requires a thorough and efficient approach. Multiple frameworks in the market offer pre-built tools, enforce structured testing, provide diverse testing functionalities, and promote reproducibility. It results in faster and more reliable testing for robust models.

machine learning model testing frameworks — A list of frameworks to use for ML model testing

Here’s a list of key frameworks used for ML model testing.

TensorFlow

There are three main types of TensorFlow frameworks for testing:

TensorFlow Extended (TFX): This is designed for production pipeline testing, offering tools for data validation, model analysis, and deployment. It provides a comprehensive suite for defining, launching, and monitoring ML models in production.
TensorFlow Data Validation: Useful for testing data quality in ML pipelines.
TensorFlow Model Analysis: Used for in-depth model evaluation.

PyTorch

Known for its dynamic computation graph and ease of use, PyTorch provides model evaluation, debugging, and visualization tools. The torchvision package includes datasets and transformations for testing and validating computer vision models.

Scikit-learn

Scikit-learn is a versatile Python library that offers various algorithms and model evaluation metrics, including cross-validation and grid search for hyperparameter tuning. It is widely used for data mining, analysis, and machine learning tasks.

Read more about the top 6 python libraries for data science

Fairlearn

Fairlearn is a toolkit designed to assess and mitigate fairness and bias issues in ML models. It includes algorithms to reweight data and adjust predictions to achieve fairness, ensuring that models treat all individuals fairly and equitably.

Evidently AI

Evidently AI is an open-source Python tool that is used to analyze, monitor, and debug machine learning models in a production environment. It helps implement testing and monitoring for different model types and data types.

Amazon SageMaker Model Monitor

Amazon SageMaker is a tool that can alert developers of any deviations in model quality so that corrective actions can be taken. It supports no-code monitoring capabilities and custom analysis through coding.

These frameworks provide a comprehensive approach to testing machine learning models, ensuring they are reliable, fair, and well-performing in production environments.

4 Ways to Test ML Models in Production

Now that we have explored the basics of ML model testing, let’s look at the 4 main testing methods for ML models in their production phase.

1. A/B Testing

A_B Testing - machine learning model testing — A visual representation of A/B testing – Source: Medium

This is used to compare two versions of an ML model to determine which one performs better in a real-world setting. This approach is essential for validating the effectiveness of a new model before fully deploying it into production. This helps in understanding the impact of the new model and ensuring it does not introduce unexpected issues.

It works by distributing the incoming requests non-uniformly between the two models. A smaller portion of the traffic is directed to the new model that is being tested to minimize potential risks. The performance of both models is measured and compared based on predefined metrics.

Benefits of A/B Testing

Risk Mitigation: By limiting the exposure of the candidate model, A/B testing helps in identifying any issues in the new model without affecting a large portion of users.
Performance Validation: It allows teams to validate that the new model performs at least as well as, if not better than, the legacy model in a production environment.
Data-Driven Decisions: The results from A/B testing provide concrete data to support decisions on whether to fully deploy the candidate model or make further improvements.

Thus, it is a critical testing step in ML model testing, ensuring that a new model is thoroughly vetted in a real-world environment, thereby maintaining model reliability and performance while minimizing risks associated with deploying untested models.

2. Canary Testing

canary testing - machine learning model testing — An outlook of canary testing – Source: Ambassador Labs

The canary testing method is used to gradually deploy a new ML model to a small subset of users in production to minimize risks and ensure that the new model performs as expected before rolling it out to a broader audience. This smaller subset of users is often referred to as the ‘canary’ group.

The main goal of this method is to limit the exposure of the new ML model initially. This incremental approach helps in identifying and mitigating any potential issues without affecting the entire user base. The performance of the ML model is monitored in the canary group.

If the model performs well in the canary group, it is gradually rolled out to a larger user base. This process continues incrementally until the new model is fully deployed to all users.

Benefits of Canary Testing

Risk Reduction: By initially limiting the exposure of the new model, canary testing reduces the risk of widespread issues affecting all users. Any problems detected can be addressed before a full-scale deployment.
Controlled Environment: This method provides a controlled environment to observe the new model’s behavior and make necessary adjustments based on real-world data.
User Impact Minimization: Users in the canary group serve as an early indicator of potential issues, allowing teams to respond quickly and minimize the impact on the broader user base.

Canary testing is an effective strategy for deploying new ML models in production. It ensures that potential issues are identified and resolved early, thereby maintaining the stability and reliability of the service while introducing new features or improvements.

3. Interleaved Testing

interleaved testing - machine learning model testing — A display of how interleaving works – Source: Medium

It is used to evaluate multiple ML models by mixing their outputs in real-time within the same user interface or service. This type of testing is particularly useful when you want to compare the performance of different models without exposing users to only one model at a time.

Users interact with the integrated output without knowing which model generated which part of the response. This helps in gathering unbiased user feedback and performance metrics for both models, allowing for a direct comparison under the same conditions and identifying which model performs better in real-world scenarios.

The performance of each model is tracked based on user interactions. Metrics such as click-through rates, engagement, and conversion rates are analyzed to determine which model is more effective.

Benefits of Interleaved Testing

Direct Comparison: Interleaved testing allows for a direct, side-by-side comparison of multiple models under the same conditions, providing more accurate insights into their performance.
User Experience Consistency: Since users are exposed to outputs from both models simultaneously, the overall user experience remains consistent, reducing the risk of user dissatisfaction.
Detailed Feedback: This method provides detailed feedback on how users interact with different model outputs, helping in fine-tuning and improving model performance.

Interleaved testing is a useful testing strategy that ensures a direct comparison, providing valuable insights into model performance. It helps data scientists and engineers to make informed decisions about which model to deploy.

4. Shadow Testing

shadow testing - machine learning model testing — A glimpse of how shadow testing is implemented – Source: Medium

Shadow testing, also known as dark launching, is a technique used for real-world testing of a new ML model alongside the existing one, providing a risk-free way to gather performance data and insights.

It works by deploying both the new and old ML models in parallel. For each incoming request, the data is sent to both models simultaneously. Both models generate predictions, but only the output from the older model is served to the user. Predictions from the new ML model are logged for later analysis.

These predictions are then compared against the results of the older ML model and any available ground truth data to evaluate the performance of the new model.

Benefits of Shadow Testing

Risk-Free Evaluation: Since the candidate model’s predictions are not served to the users, any errors or issues in the new model do not affect the user experience. This makes shadow testing a safe way to test new models.
Real-World Data: Shadow testing provides insights based on real-world data and conditions, offering a more accurate assessment of the model’s performance compared to offline testing.
Benchmarking: It allows for direct comparison between the legacy and candidate models, making it easier to benchmark the new model’s performance and identify areas for improvement.

Hence, it is a robust technique for evaluating new ML models in a live production environment without impacting the user experience. It provides valuable performance insights, ensures safe testing, and helps in making informed decisions about model deployment.

How to Choose a Testing Technique for Your ML Model Testing?

Choosing the appropriate testing technique for your machine learning models in production depends on several factors, including the nature of your model, the risks associated with its deployment, and the specific requirements of your application.

Here are some key considerations and steps to help you decide on the right testing technique:

Understand the Nature and Requirements of Your Model

Different models (classification, regression, recommendation, etc.) require different testing approaches. Complex models may benefit from more rigorous testing techniques like shadow testing or interleaved testing. Hence, you must understand the nature of your model and its complexity.

Moreover, it is crucial to assess the potential impact of model errors. High-stakes applications, such as financial services or healthcare, may necessitate more conservative and thorough testing techniques.

Evaluate Common Testing Techniques

Review and evaluate the pros and cons of the testing techniques, like the 4 methods discussed earlier in the blog. A thorough understanding of the techniques can make your decision easier and more informed.

Learn more about important ML techniques

Assess Your Infrastructure and Resources

While you have multiple options available, the state of your infrastructure and available resources are strong parameters for your final decision. Ensure that your production environment can support the chosen testing technique. For example, shadow testing requires infrastructure capable of parallel processing.

You must also evaluate the available resources, including computational power, storage, and monitoring tools. Techniques like shadow testing and interleaved testing can be resource-intensive. Hence, you must consider both factors when choosing a testing technique for your ML model.

Consider Ethical and Regulatory Constraints

Data privacy and digital ethics are important parameters for modern-day businesses and users. Hence, you must ensure compliance with data privacy regulations such as GDPR or CCPA, especially when handling sensitive data. You must choose techniques that allow for the mitigation of model bias, ensuring fairness in predictions.

Monitor and Iterate

Testing ML models in production is a continuous process. You must continuously track your model performance, data drift, and prediction accuracy over time. This must link to an iterative model improvement process. You can establish a feedback loop to retrain and update the model based on the gathered performance data.

Hence, you must carefully select the model technique for your ML model. You can consider techniques like A/B testing for direct performance comparison, canary testing for gradual rollout, interleaved testing for simultaneous output assessment, and shadow testing for risk-free evaluation.

To Sum it Up…

ML model testing when in production is a critical step. You must ensure your model’s reliability, performance, and safety in real-world scenarios. You can do that by evaluating the model’s performance in a live environment, identifying potential issues, and finding ways to resolve them.

We have explored 4 different methods to test ML models where way offers unique benefits and is suited to different scenarios and business needs. By carefully selecting the appropriate technique, you can ensure your ML models perform as expected, maintain user satisfaction, and uphold high standards of reliability and safety.

If you are interested in learning how to build ML models from scratch, here’s a video for a more engaging learning experience:

July 5, 2024

Machine Learning

Data Science Dojo Staff

Machine learning model deployment 101: A comprehensive guide

Machine Learning (ML) is a powerful tool that can be used to solve a wide variety of problems. However, building and deploying a machine-learning model is not a simple task. It requires a comprehensive understanding of the end-to-end machine learning lifecycle.

The development of a Machine Learning Model can be divided into three main stages:

Building your ML data pipeline: This stage involves gathering data, cleaning it, and preparing it for modeling.
Getting your ML model ready for action: This stage involves building and training a machine learning model using efficient machine learning algorithms.
Making sense of your ML model: This stage involves deploying the model into production and using it to make predictions.

Building your ML data pipeline

The first step of crafting a Machine Learning Model is to develop a pipeline for gathering, cleaning, and preparing data. This pipeline should be designed to ensure that the data is of high quality and that it is ready for modeling.

The following steps are involved in pipeline development:

Gathering data: The first step is to gather the data that will be used to train the model. For data scrapping a variety of sources, such as online databases, sensor data, or social media.
Cleaning data: Once the data has been gathered, it needs to be cleaned. This involves removing any errors or inconsistencies in the data.

Exploratory data analysis (EDA): EDA is a process of exploring data to gain insights into its distribution, relationships, and patterns. This information can be used to inform the design of the model.
Model design: Once the data has been cleaned and explored, it is time to design the model. This involves choosing the right machine-learning algorithm and tuning the model’s hyperparameters.
Training and validation: The next step is to train the model on a subset of the data. Once the model has been trained, it can be evaluated on a holdout set of data to measure its performance.

Getting your machine learning model ready for action

Once the pipeline has been developed, the next step is to train the model. This involves using a machine learning algorithm to learn the relationship between the features and the target variable.

The following steps are involved in training:

Choosing a machine learning algorithm: There are many different machine learning algorithms available. The choice of algorithm will depend on the specific problem that is being solved.
Tuning hyperparameters: Hyperparameters are parameters that control the behavior of the machine learning algorithm. These parameters need to be tuned to achieve the best performance.
Training the model: Once the algorithm and hyperparameters have been chosen, the model can be trained on a dataset.
Evaluating the model: Once the model has been trained, it can be evaluated on a holdout set of data to measure its performance.

Making sense of ML model’s predictions

Once the model has been trained, it can be deployed into production and used to make predictions.

The following steps are involved in inference:

Deploying the model: The model can be deployed in a variety of ways, such as a web service, a mobile app, or a desktop application.
Making predictions: Once the model has been deployed, it can be used to make predictions on new data.
Monitoring the model: It is important to monitor the model’s performance in production to ensure that it is still performing as expected.

Conclusion

Developing a Machine Learning Model is a complex process, but it is essential for building and deploying successful machine-learning applications. By following the steps outlined in this blog, you can increase your chances of success.

Here are some additional tips for building and deploying machine-learning models:

Establish a strong baseline model. Before you deploy a machine learning model, it is important to have a baseline model that you can use to measure the performance of your deployed model.
Use a production-ready machine learning framework. There are a number of machine learning frameworks available, but not all of them are suitable for production deployment. When choosing a machine learning framework for production deployment, it is important to consider factors such as scalability, performance, and ease of maintenance.
Use a continuous integration and continuous delivery (CI/CD) pipeline. A CI/CD pipeline automates the process of building, testing, and deploying your machine-learning model. This can help to ensure that your model is always up-to-date and that it is deployed in a consistent and reliable manner.
Monitor your deployed model. Once your model is deployed, it is important to monitor its performance. This will help you to identify any problems with your model and to make necessary adjustments
Using visualizations to understand the insights better. With the help of the model many insights can be drawn, and they can be visualized using software like Power BI.

Written by Murk Sindhya Memon

July 5, 2023

Machine Learning

Guest Blog

Learn to deploy machine learning models to a web app or REST API with Saturn Cloud

Data science model deployment can sound intimidating if you have never had a chance to try it in a safe space. Do you want to make a rest API or a full frontend app? What does it take to do either of these? It’s not as hard as you might think.

In this series, we’ll go through how you can take machine learning models and deploy them to a web app or a rest API (using saturn cloud) so that others can interact. In this app, we’ll let the user make some feature selections and then the model will predict an outcome for them. But using this same idea, you could easily do other things, such as letting the user retrain the model, upload things like images, or conduct other interactions with your model.

Just to be interesting, we’re going to do this same project with two frameworks, voila and flask, so you can see how they both work and decide what’s right for your needs. In a flask, we’ll create a rest API and a web app version.
A

*Learn data science with Data Science Dojo and Saturn Cloud – Data Science Dojo*A

a
Our toolkit

saturn cloud (so you can deploy easily!)
flask
voila
plotly (python and js)
scikit-learn (for our model)
A

The project – Deploying machine learning models

The first steps of our process are exactly the same, whether we are going for voila or flask. We need to get some data and build a model! I will take the us department of education’s college scorecard data, and build a quick linear regression model that accepts a few inputs and predicts a student’s likely earnings 2 years after graduation. (you can get this data yourself at https://collegescorecard.ed.gov/data/)

About measurements

According to the data codebook: “the cohort of evaluated graduates for earnings metrics consists of those individuals who received federal financial aid, but excludes those who were subsequently enrolled in school during the measurement year, died before the end of the measurement year, received a higher-level credential than the credential level of the field of the study measured, or did not work during the measurement year.”

Load data

I already did some data cleaning and uploaded the features I wanted to a public bucket on s3, for easy access. This way, I can load it quickly when the app is run.

Format for training

Once we have the dataset, this is going to give us a handful of features and our outcome. We just need to split it between features and target with scikit-learn to be ready to model. (note that all of these functions will be run exactly as written in each of our apps.)

Our features are:

Region: geographic location of college
Locale: type of city or town the college is in
Control: type of college (public/private/for-profit)
Cipdesc_new: major field of study (cip code)
Creddesc: credential (bachelor, master, etc)
Adm_rate_all: admission rate
Sat_avg_all: average sat score for admitted students (proxy for college prestige)
Tuition: cost to attend the institution for one year

Our target outcome is earn_mdn_hi_2yr: median earnings measured two years after completion of degree.

Train model

We are going to use scikit-learn’s pipeline to make our feature engineering as easy and quick as possible. We’re going to return a trained model as well as the r-squared value for the test sample, so we have a quick and straightforward measure of the model’s performance on the test set that we can return along with the model object.

Now we have a model, and we’re ready to put together the app! All these functions will be run when the app runs, because it’s so fast that it doesn’t make sense to save out a model object to be loaded. If your model doesn’t train this fast, save your model object and return it in your app when you need to predict.

If you’re interested in learning some valuable tips for machine learning projects, read our blog on machine learning project tips.

Visualization

In addition to building a model and creating predictions, we want our app to show a visual of the prediction against a relevant distribution. The same plot function can be used for both apps, because we are using plotly for the job.

The function below accepts the type of degree and the major, to generate the distributions, as well as the prediction that the model has given. That way, the viewer can see how their prediction compares to others. Later, we’ll see how the different app frameworks use the plotly object.

This is the general visual we’ll be generating — but because it’s plotly, it’ll be interactive!

You might be wondering whether your favorite visualization library could work here — the answer is, maybe! Every python viz library has idiosyncrasies and is not likely to be supported exactly the same for voila and flask. I chose plotly because it has interactivity and is fully functional in both frameworks, but you are welcome to try your own visualization tool and see how it goes.

Wrapping up

In conclusion, deploying machine learning models to a web app or REST API can seem daunting, but it’s not as difficult as it may seem. By using frameworks like voila and Flask, along with libraries like scikit-learn, plotly, and pandas, you can easily create an app that allows users to interact with machine learning models.

In this project, we used the US Department of Education’s college scorecard data to build a linear regression model that predicts a student’s likely earnings two years after graduation.

Written by Stephanie Kirmer

March 3, 2023

Machine Learning

LLM - Online Courses

Reviews

Consulting

Community

machine learning models

Huda Mahmood

Machine Learning Models: 4 Ways to Test them in Production

What is Machine Learning Model Testing?

What is the Difference between Model Evaluation and Testing?

Frameworks Used in ML Model Testing

TensorFlow

PyTorch

Scikit-learn

Fairlearn

Evidently AI

Amazon SageMaker Model Monitor

4 Ways to Test ML Models in Production

1. A/B Testing

Benefits of A/B Testing

2. Canary Testing

Benefits of Canary Testing

3. Interleaved Testing

Benefits of Interleaved Testing

4. Shadow Testing

Benefits of Shadow Testing

How to Choose a Testing Technique for Your ML Model Testing?

Understand the Nature and Requirements of Your Model

Evaluate Common Testing Techniques

Assess Your Infrastructure and Resources

Consider Ethical and Regulatory Constraints

Monitor and Iterate

To Sum it Up…

Data Science Dojo Staff

Machine learning model deployment 101: A comprehensive guide

Building your ML data pipeline

Getting your machine learning model ready for action

Making sense of ML model’s predictions

Conclusion

Guest Blog

Learn to deploy machine learning models to a web app or REST API with Saturn Cloud

a Our toolkit

Other helpful links

The project – Deploying machine learning models

About measurements

Load data

Format for training

Train model

Visualization

Wrapping up

Related Topics

Training

Enterprise

Community

About

a
Our toolkit