For a hands-on learning experience to develop Agentic AI applications, join our Agentic AI Bootcamp today. Early Bird Discount

azure

In today’s data-driven era, organizations expect more than static dashboards or descriptive analytics. They demand forecasts, predictive insights, and intelligent decision-making support. Traditionally, delivering this requires piecing together multiple tools, data lakes for storage, notebooks for model training, separate platforms for deployment, and BI tools for visualization. 

Microsoft Fabric reimagines this workflow. It brings every stage of the machine learning lifecycle, from data ingestion and preparation to model training, deployment, and visualization, into a single, governed environment. In this blog, we’ll explore how Microsoft Fabric empowers data scientists to streamline the end-to-end ML process and unlock predictive intelligence at scale. 

To go deeper into forecasting vs inference, discover predictive analytics and AI interactions in this Predictive Analytics vs. AI article.

Data Science in Microsoft Fabric

Why Choose Microsoft Fabric for Modern Data Science Workflows?

Why Choose Microsoft Fabric for Data Science

End-to-End Unification

One platform for data ingestion, preparation, model training, deployment, and data visualization. A wide range of activities are offered in Microsoft Fabric across the entire data science process, empowering users to build end-to-end data science workflows within a single platform. 

Scalability

Spark-based distributed compute, enabling seamless handling of large datasets and complex machine learning models. With built-in support for Apache Spark in Microsoft Fabric, you can utilize the efficiency of Spark through Spark batch job definitions or with interactive Fabric notebooks. 

MLflow integration 

Allows autologging runs, metrics, and parameters for easy comparison of different models and experiments without requiring manual tracking. 

AutoML (low-code)

With Fabric’s low-code AutoML interface, users can easily get started with machine learning tasks, while the platform automates most of the workflow with minimal manual effort. 

AI-powered Copilot

With AI support in Microsoft Fabric, it saves time and effort for data scientist and makes data science accessible to everyone. It offers helpful suggestions, assists in writing and fixing code, and helps you analyse and visualize data. 

Governance & Compliance

Features like role-based access, lineage tracking, and model versioning in Microsoft Fabric enable teams to reproduce models, trace issues efficiently, and maintain full transparency across the data science lifecycle. 

Explore a concrete Azure-based predictive modeling example

Advanced Machine Learning Lifecycle in Microsoft Fabric 

Microsoft Fabric offers capabilities to support every step of the machine learning lifecycle in one governed environment. Let’s explore how each step is supported by powerful features in Fabric: 

Machine Learning Lifecyle in Microsoft Fabric
source: learn.microsoft.com

 1. Data Ingestion & Exploration

  • OneLake acts as the single source of truth, storing all data in Delta format with support for versioning, schema evolution, and ACID transactions. Fabric is standardized on Delta Lake which means all Fabric engines can interact with the same dataset stored in a Lakehouse. This eliminates the overhead of managing separate data lakes and warehouses. 
  • Fabric notebooks with Spark pools provide distributed compute for profiling, visualization, and correlations at scale. 
  • Lakehouse:  Fabric notebooks allow you to ingest data from various sources, such as Lakehouse, Data Warehouses or Semantic mode. You can simply store your data in Lakehouse that can be attached to the Notebook and then you can read or write to this Lakehouse using a local path in your Notebook. 

Data Ingestion - Microsoft Fabric

  • Environments: You can create an environment and enable it for multiple notebooks. It ensures reproducibility by packaging runtimes, libraries, and dependencies.

Explore top AI tools for data analytics

2. Data Cleaning & Feature Engineering

  • Pandas on Spark lets data scientists apply familiar syntax while scaling workloads across Spark clusters to prepare data for training. You can perform data profiling and visualization efficiently on large amount of data. 

Data Cleaning & Feature Engineering - Data Science in Microsoft Fabric

  • Data Wrangler offers an interactive interface to impute missing values, and with GenAI in Data Wrangler, reusable PySpark code is generated for auditability. It also gives you AI-powered suggestions to apply transformations.  

Data Wrangler - Microsoft Fabric

  • Feature Engineering can also be easily performed using Data Wrangler. It offers direct options to perform encoding and normalize features without requiring you to write any code. 

Feature Engineering - Microsoft Fabric

  • Copilot integration accelerates preprocessing with AI-powered suggestions and code generation.  
  • Processed features can be written back into OneLake as Delta tables, sharable across projects and teams. 

Data Science in Microsoft Fabric

Understand core analysis methods behind predictive models

3. Model Training & Experimentation

  • MLFlow Autologging can be enabled so that it automatically captures the values of input parameters and output metrics of a machine learning model as it is being trained. This information is then logged to your workspace, where it can be accessed and visualized using the MLflow APIs or the corresponding experiment in your workspace, reducing manual effort and ensuring consistency. 

MLFlow Autotagging - Microsoft Fabric

  • Frameworks: Choose Spark MLlib for distributed training, scikit-learn or XGBoost for tabular tasks, or PyTorch/TensorFlow for deep learning. 
  • Hyperparameter tuning: The FLAML library supports lightweight, cost-efficient tuning strategies. SynapseML, a distributed machine learning library can also be used in Microsoft Fabric Notebooks to identify the best combination of hyperparameters 
  • Experiments & Runs: Microsoft Fabric integrates MLflow for experiment tracking.  

Experiment Tracking - Microsoft Fabric

  • Within Experiment, there is a collection of runs for simplified tracking and comparison. Data scientists can compare those runs to select the model with best performing parameters. Runs can be visualized, searched, and compared, with full metadata available for export or further analysis. 

Collection of Runs - Microsoft Fabric

  • Model versioning; model run Iterations can be registered with tags and metadata, providing traceability and governance across versions. 

Model Versioning - Microsoft Fabric

  • AutoML; a low-code interface generates preconfigured notebooks for tasks like classification, regression, or forecasting. It performs all the Machine Learning steps automatically from data transformation, model definition to training. These notebooks also leverage MLflow logging to capture parameters and metrics automatically. Therefore, completely automating the Machine Learning lifecycle. 

AutoML - Microsoft Fabric

4. Model Evaluation & Selection

  • Notebook visualizations such as ROC curves, confusion matrix, and regression error plots provide immediate insights. 
  • Experiment dashboards make it simple to compare models’ side-by-side, highlighting the best-performing candidate. 
  • PREDICT function can be used during evaluation to generate test predictions at scale. You can use this function to generate batch predictions directly from a Microsoft Fabric notebook or from the item page of a given ML model.  

Model Evaluation - Microsoft Fabric

  • You can simply select the specific model version you need to score and copy generated code template into a notebook and customize the parameters yourself.  
  • Another way is to use the GUI experience to generate PREDICT code by selecting ‘apply this model to wizard’. 

Model Evaluation GUI Version - Microsoft Fabric

For a forward-looking look at how intelligent systems can autonomously analyze and act, explore agentic analytics in our companion piece on Agentic Analytics

5. Consumption & Visualization

  • Power BI integration makes predictions stored in OneLake available to analysts with no extra data movement.  

Power BI Integration - Microsoft Fabric

  • Direct Lake mode ensures low latency querying of large Delta tables, keeping dashboards fast and responsive even at enterprise scale. 
  • Semantic Link is a feature that allows you to establish a connection between semantic models and Synapse Data Science in Microsoft Fabric. Through the Semantic link (preview), data scientists can use PowerBI sematic models in Notebooks using the SemPy Python library or Spark (in Python, R, SQL, and Scala) to perform tasks such as in-depth statistical analysis and predictive modelling with machine learning. The output data can then be stored in the OneLake which can be used by PowerBI. 
Semantic Link - Microsoft Fabric
source: learn.microsoft.com

 

6. Monitoring & Control

Models are assets that require governance and continuous maintenance. 

  • Automated retraining pipelines can be triggered on a schedule or in response to specific metric drop. 
  • Versioning and lineage tracking make it clear which combination of data, code, and parameters produced any given model and the dependency of each ML item. 
  • Machine learning experiments and models are integrated with the lifecycle management capabilities in Microsoft Fabric. 
  • Microsoft Fabric deployment pipeline can track ML artifacts across development, test, and production workspaces while preserving experiment runs and model versions. Metadata, Lineage between notebooks, experiments, and models is maintained. 
  • In Microsoft Fabric, ML experiments and models are also synced via Git Integration, but experiment runs, and model versions remain in workspace storage and aren’t versioned in Git. Git tracks only artifact metadata, not data. which includes display name, version, and dependencies. Lineage between notebooks, experiments, and models is preserved across Git-connected workspaces, ensuring traceability. 
  • Access controls in Fabric provide fine-grained permissions for models, experiments, and workspaces, ensuring responsible collaboration. You can grant controlled access to teams to access the items and data that is useful only for their department context. 

Beyond ML: Other Data Science Capabilities in Microsoft Fabric 

Besides ML workflows, Fabric also empowers organizations to build AI-driven solutions: 

  • Data Agents: A newly introduced feature, Data Agents let you create conversational Q&A systems tailored to your organization’s data in OneLake. They are powered by Azure OpenAI Assistant APIs, and can access multiple sources such as Lakehouse, Warehouse, Power BI datasets, and KQL databases. You can customize them with specific instructions, and examples, so they align with organizational needs. The process is iterative: as you refine performance, you can publish the agent, generating a read-only version to share across teams. 
Data Agents - Microsoft Fabric
source: learn.microsoft.com
  • LLM-powered Applications: Fabric integrates seamlessly with Azure OpenAI Service and SynapseML, making it possible to run large-scale natural language workflows directly on Spark. Instead of handling prompts one by one, Fabric enables distributed processing of millions of prompts in parallel. This makes it practical to deploy LLMs for enterprise-scale use cases such as summarization, classification, and question answering. 

Conclusion: Unlocking Predictive Intelligence with Fabric 

Microsoft Fabric isn’t just another data platform, it’s a game-changer for data science teams. By eliminating silos between storage, experimentation, deployment, and visualization, Fabric empowers organizations to move faster from raw data to business impact. Whether you’re a data scientist building custom models or an analyst looking to leverage interactive, Fabric provides the tools to scale predictive insights across your enterprise. 

The future of data science is unified, governed, and intelligent, and Microsoft Fabric is paving the way. 

Ready to build the next generation of agentic AI?
Explore our Large Language Models Bootcamp and Agentic AI Bootcamp for hands-on learning and expert guidance.

October 17, 2025

Data Science Dojo is offering Meltano CLI for FREE on Azure Marketplace preconfigured with Meltano, a platform that provides flexibility and scalability. It comprises four features, it is customizable, observable with a full view of data visualization, testable and versionable to track changes, and can easily be rolled back if needed. 

It is somewhat of a tiring process to install the technology. Then look after the integration and dependency issues. Already feeling tired? It is somehow confusing to resolve the installation errors. Not to worry as Data Science Dojo’s Meltano CLI instance fixes all of that. But before we delve further into it, let us get to know some basics.  

What is Meltano? 

Meltano is an open-source Command Line Interface (CLI) tool that offers a flexible and scalable solution for Extract, Load, and Transform (ELT) processes. It is designed to assist data engineers in transforming, converting, and validating data in a simplified manner while ensuring accuracy and reliability.

The Meltano CLI can efficiently handle complex data engineering tasks, providing a user-friendly interface that simplifies the ELT process. It can also integrate with different data sources, enabling users to extract data from various sources, load it into a target destination, and transform it according to their specific requirements.

In addition, it offers a range of plugins that extend its capabilities and allow users to customize their ELT workflows. These plugins include extractors, loaders, and transformers, among others.

Challenges for individuals

Before Meltano CLI, there were several challenges associated with data integration that made the process difficult and time-consuming. Here are a few of the main challenges: 

  • Lack of Standardization: Data integration tools were often proprietary, which made it difficult to integrate different tools and workflows. This meant that organizations often had to use multiple tools to complete a data integration project. 
  • Complexity: Many data integration tools were complex and required extensive knowledge of programming and data architecture to use effectively. This made it difficult for non-technical users to participate in data integration projects. 
  • Scalability: As data volumes grew, many data integration tools struggled to handle the scale of the data. This led to slow and inefficient data integration processes. 
  • Cost: Many data integration tools were expensive, which made them inaccessible for smaller organizations with limited budgets. 
  • Limited Customization: Many data integration tools offered limited customization options, which made it difficult to adapt the tool to fit the unique needs of an organization.

 

All in all, it was designed to address many of these challenges by providing an open-source, flexible, and user-friendly tool that can be customized to fit the unique requirements of users.

Meltano CLI for ELT
                                          Meltano CLI for ELT – Data Science Dojo

Why Meltano? 

Meltano CLI stands out as a data engineering tool. It provides flexibility and scalability. It comprises of four features, it is customizable, observable with a full view of data visualization, testable and versionable to track changes, and can easily be rolled back if needed.

Meltano CLI has solved many struggles that make it a compelling choice for many users, including: 

  1. Open-source: It is free and open-source, which means that users can download, use, and modify the source code as per their needs. 
  2. Easy-to-use: It is designed to be easy to use with a simple command-line interface and intuitive user interface. Users can easily configure, execute, and monitor data integration pipelines. 
  3. Customizable: Meltano CLI offers a high degree of customization, allowing users to define custom transformations, connectors, and integrations. 
  4. Modern stack: It is built using modern open-source technologies such as Python, Flask, and Vue.js, making it easy to extend and integrate with other tools. 
  5. GitLab Integration: Meltano CLI is developed by GitLab, which means it can be easily integrated with GitLab for version control, collaboration, and continuous integration and deployment (CI/CD). 


Overall, Meltano CLI is a powerful and flexible data integration tool that offers a unique set of features and benefits that may make it a good choice for certain data integration projects. However, the choice of tool ultimately depends on the specific needs and requirements of the project at hand.
 

Integrations

MeltanoHub is the primary location to find all plugins, including Singer taps and targets. It serves as a single source of truth for users, making it easy to discover and use plugins within Meltano. Additionally, users can contribute to the Hub by adding more plugins, which are immediately accessible.

The Hub is maintained by Meltano and the broader community, ensuring that it is continuously curated and up to date. This centralized platform simplifies the process of finding and using plugins, enabling users to enhance their data engineering workflows with ease. 

Key features

Meltano CLI includes several features, including: 

  • Easy to setup and easy to use 
  • Pipeline creation and management 
  • Extract, transform, and load (ETL) processes 
  • Plugin management 
  • Visualization 
  • Configuration management 
  • Version control 
  • Testability 
  • Integration with other tools: It seamlessly integrates with other tools such as dbt, Singer, and Airflow, among others, to enhance your workflow.

What Data Science Dojo has for you?

Azure Virtual Machine is preconfigured with CLI plug-and-play functionality, so you do not have to worry about setting up the environment. 

  • Features include a zero-setup CLI platform that offers a high degree of customization, allowing users to define custom transformations, connectors, and integrations. It is designed to be easy to use with a simple command-line interface and intuitive user interface.
  • Meltano CLI helps you efficiently transform, convert, and validate your data using a simplified process for data engineering, with the assurance of accuracy and reliability. 

 

And many others which you check by taking a quick peek here: Meltano CLI on Azure Marketplace sets it apart from others is that it is an open-source, flexible, and scalable CLI for ELT+. It is customizable. It is also observable, provides a full view with detailed pipeline logs and statistics, and allows inspection of code for debugging. Meltano is versionable which allows easy tracking and rollback of changes. It is testable and only deploys to production once everything is green. 

Moreover, Meltano CLI is a powerful and flexible data integration tool that offers many benefits over other tools on the market. Its open-source nature, ease of use, integration with other tools, reconfigurability, and community support make it a compelling choice for data integration projects. 

Conclusion  

The Meltano CLI comes with pre-configured Ubuntu 20.04 and a ready-to-use project, allowing for a plug-and-play experience without any setup required. By using Azure, the fault tolerance of data pipelines is increased, resulting in higher performance and faster content delivery.

The Meltano CLI provides an open-source, flexible, and scalable CLI for ELT+, allowing for efficient data transformation, conversion, and validation with accuracy and reliability. When combined with Microsoft Azure services, Meltano outperforms traditional methods by performing data-intensive computations in the cloud. Collaboration and sharing of notebooks with stakeholders is also possible.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free project Environment dedicated specifically to Data Integration and ELT on Azure Market Place. Do not wait to install this offer by Data Science Dojo, your ideal companion in your journey to learn data science! 

Try Now

 

Written by Insiyah Talib

March 15, 2023

Data Science Dojo is offering Memphis broker for FREE on Azure Marketplace preconfigured with Memphis, a platform that provides a P2P architecture, scalability, storage tiering, fault-tolerance, and security to provide real-time processing for modern applications suitable for large volumes of data. 

Introduction

It is a cumbersome and tiring process to install Docker first and then install Memphis. Then look after the integration and dependency issues. Are you already feeling tired? It is somehow confusing to resolve the installation errors. Not to worry as Data Science Dojo’s Memphis instance fixes all of that. But before we delve further into it, let us get to know some basics.  

 

llm bootcamp

 

What is Memphis? 

Memphis is an open-source modern replacement for traditional messaging systems. It is a cloud-based messaging system with a comprehensive set of tools that makes it easy and affordable to develop queue-based applications. It is reliable, can handle large volumes of data, and supports modern protocols. It requires minimal operational maintenance and allows for rapid development, resulting in significant cost savings and reduced development time for data-focused developers and engineers. 

Challenges for Individuals

Traditional messaging brokers, such as Apache Kafka, RabbitMQ, and ActiveMQ, have been widely used to enable communication between applications and services. However, there are several challenges with these traditional messaging brokers: 

  1. Scalability: Traditional messaging brokers often have limitations on their scalability, particularly when it comes to handling large volumes of data. This can lead to performance issues and message loss. 
  2. Complexity: Setting up and managing a traditional messaging broker can be complex, particularly when it comes to configuring and tuning it for optimal performance.
  3. Single Point of Failure: Traditional messaging brokers can become a single point of failure in a distributed system. If the messaging broker fails, it can cause the entire system to go down. 
  4. Cost: Traditional messaging brokers can be expensive to deploy and maintain, particularly for large-scale systems. 
  5. Limited Protocol Support: Traditional messaging brokers often support only a limited set of protocols, which can make it challenging to integrate with other systems and technologies. 
  6. Limited Availability: Traditional messaging brokers can be limited in terms of the platforms and environments they support, which can make it challenging to use them in certain scenarios, such as cloud-based systems.

Overall, these challenges have led to the development of new messaging technologies, such as event streaming platforms, that aim to address these issues and provide a more flexible, scalable, and reliable solution for modern distributed systems.  

Memphis As a Solution

Why Memphis? 

“It took me three minutes to build in Memphis what took me a week and a half in Kafka.” Memphis and traditional messaging brokers are both software systems that facilitate communication between different components or systems in a distributed architecture. However, there are some key differences between the two: 

  1. Architecture: It uses a peer-to-peer (P2P) architecture, while traditional messaging brokers use a client-server architecture. In a P2P architecture, each node in the network can act as both a client and a server, while in a client-server architecture, clients send messages to a central server which distributes them to the appropriate recipients. 
  2. Scalability: It is designed to be highly scalable and can handle large volumes of messages without introducing significant latency, while traditional messaging brokers may struggle to scale to handle high loads. This is because Memphis uses a distributed hash table (DHT) to route messages directly to their intended recipients, rather than relying on a centralized message broker. 
  3. Fault tolerance: It is highly fault-tolerant, with messages automatically routed around failed nodes, while traditional messaging brokers may experience downtime if the central broker fails. This is because it uses a distributed consensus algorithm to ensure that all nodes in the network agree on the state of the system, even in the presence of failures. 
  4. Security: Memphis provides end-to-end encryption by default, while traditional messaging brokers may require additional configuration to ensure secure communication between nodes. This is because it is designed to be used in decentralized applications, where trust between parties cannot be assumed. 

Overall, while both Memphis and traditional messaging brokers facilitate communication between different components or systems, they have different strengths and weaknesses and are suited to different use cases. It is ideal for highly scalable and fault-tolerant applications that require end-to-end encryption, while traditional messaging brokers may be more appropriate for simpler applications that do not require the same level of scalability and fault tolerance. 

What Struggles does Memphis Solve?

Handling too many data sources can become overwhelming, especially with complex schemas. Analyzing and transforming streamed data from each source is difficult, and it requires using multiple applications like Apache Kafka, Flink, and NiFi, which can delay real-time processing.

Additionally, there is a risk of message loss due to crashes, lack of retransmits, and poor monitoring. Debugging and troubleshooting can also be challenging. Deploying, managing, securing, updating, onboarding, and tuning message queue systems like Kafka, RabbitMQ, and NATS is a complicated and time-consuming task. Transforming batch processes into real-time can also pose significant challenges.

Integrations:

Memphis Broker provides several integration options for connecting to diverse types of systems and applications. Here are some of the integrations available in Memphis Broker: 

  • JMS (Java Message Service) Integration 
  • .NET Integration 
  • REST API Integration 
  • MQTT Integration 
  • AMQP Integration 
  • Apache Camel, Apache ActiveMQ, and IBM WebSphere MQ. 

Key features: 

  • Fully optimized message broker in under 3 minutes 
  • Easy-to-use UI, CLI, and SDKs 
  • Dead-letter station (DLQ) 
  • Data-level observability 
  • Runs on your Docker or Kubernetes
  • Real-time event tracing 
  • SDKs: Python, Go, Node.js, Typescript, Nest.JS, Kotlin, .NET, Java 
  • Embedded schema management using Protobuf, JSON Schema, GraphQL, Avro 
  • Slack integration

 

How generative AI and LLMs work

 

What Data Science Dojo has For You:

Azure Virtual Machine is preconfigured with plug-and-play functionality, so you do not have to worry about setting up the environment. Features include a zero-setup Memphis platform that offers you to: 

  • Build a dead-letter queue 
  • Create observability 
  • Build a scalable environment 
  • Create client wrappers 
  • Handle back pressure. Client or queue side 
  • Create a retry mechanism 
  • Configure monitoring and real-time alerts 

a
It stands out from other solutions because it can be set up in just three minutes, while others can take weeks. It’s great for creating modern queue-based apps with large amounts of streamed data and modern protocols, and it reduces costs and dev time for data engineers. Memphis has a simple UI, CLI, and SDKs, and offers features like automatic message retransmitting, storage tiering, and data-level observability.

Moreover, Memphis is a next-generation alternative to traditional message brokers. A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables cost-effective, fast, and reliable development of modern queue-based use cases.

Wrapping Up

Memphis comes pre-configured with Ubuntu 20.04, so users do not have to set up anything featuring a plug n play environment. It on the cloud guarantees high availability as data can be distributed across multiple data centers and availability zones on the go. In this way, Azure increases the fault tolerance of data pipelines.

The power of Azure ensures maximum performance and high throughput for the server to deliver content at low latency and faster speeds. It is designed to provide a robust messaging system for modern applications, along with high scalability and fault tolerance.

The flexibility, performance, and scalability provided by Azure virtual machine to Memphis make it possible to offer a production-ready message broker in under 3 minutes. They provide durability and stability and efficient performing systems. 

When coupled with Microsoft Azure services and processing speed, it outperforms the traditional counterparts because data-intensive computations are not performed locally, but in the cloud. You can collaborate and share notebooks with various stakeholders within and outside the company while monitoring the status of each  

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Memphis instance dedicated specifically for highly scalable and fault-tolerant applications that require end-to-end encryption on Azure Market Place. Do not wait to install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!

 

 

Written by Insiyah Talib

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

March 9, 2023

In this step-by-step guide, learn how to deploy a web app for Gradio on Azure with Docker. This blog covers everything from Azure Container Registry to Azure Web Apps, with a step-by-step tutorial for beginners.

I was searching for ways to deploy a Gradio application on Azure, but there wasn’t much information to be found online. After some digging, I realized that I could use Docker to deploy custom Python web applications, which was perfect since I had neither the time nor the expertise to go through the “code” option on Azure. 

The process of deploying a web app begins by creating a Docker image, which contains all of the application’s code and its dependencies. This allows the application to be packaged and pushed to the Azure Container Registry, where it can be stored until needed.

From there, it can be deployed to the Azure App Service, where it is run as a container and can be managed from the Azure Portal. In this portal, users can adjust the settings of their app, as well as grant access to roles and services when needed. 

Once everything is set and the necessary permissions have been granted, the web app should be able to properly run on Azure. Deploying a web app on Azure using Docker is an easy and efficient way to create and deploy applications, and can be a great solution for those who lack the necessary coding skills to create a web app from scratch!’

Comprehensive overview of creating a web app for Gradio

Gradio application 

Gradio is a Python library that allows users to create interactive demos and share them with others. It provides a high-level abstraction through the Interface class, while the Blocks API is used for designing web applications.

Blocks provide features like multiple data flows and demos, control over where components appear on the page, handling complex data flows, and the ability to update properties and visibility of components based on user interaction. With Gradio, users can create a web application that allows their users to interact with their machine learning model, API, or data science workflow. 

The two primary files in a Gradio Application are:

  1. App.py: This file contains the source code for the application.
  2. Requirements.txt: This file lists the Python libraries required for the source code to function properly.

Docker 

Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers. It uses a container-based approach to package software, which enables applications to be isolated from each other, making it easier to deploy, run, and manage them in a variety of environments. 

A Docker container is a lightweight, standalone, and executable software package that includes everything needed to run a specific application, including the code, runtime, system tools, libraries, and settings. Containers are isolated from each other and the host operating system, making them ideal for deploying microservices and applications that have multiple components or dependencies. 

Docker also provides a centralized way to manage containers and share images, making it easier to collaborate on application development, testing, and deployment. With its growing ecosystem and user-friendly tools, Docker has become a popular choice for developers, system administrators, and organizations of all sizes. 

Azure Container Registry 

Azure Container Registry (ACR) is a fully managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. It allows you to store, manage, and deploy Docker containers in a secure and scalable way, making it an important tool for modern application development and deployment. 

With ACR, you can store your own custom images and use them in your applications, as well as manage and control access to them with role-based access control. Additionally, ACR integrates with other Azure services, such as Azure Kubernetes Service (AKS) and Azure DevOps, making it easy to deploy containers to production environments and manage the entire application lifecycle. 

ACR also provides features such as image signing and scanning, which helps ensure the security and compliance of your containers. You can also store multiple versions of images, allowing you to roll back to a previous version if necessary. 

Azure Web App 

Azure Web Apps is a fully managed platform for building, deploying, and scaling web applications and services. It is part of the Azure App Service, which is a collection of integrated services for building, deploying, and scaling modern web and mobile applications. 

With Azure Web Apps, you can host web applications written in a variety of programming languages, such as .NET, Java, PHP, Node.js, and Python. The platform automatically manages the infrastructure, including server resources, security, and availability, so that you can focus on writing code and delivering value to your customers. 

Azure Web Apps supports a variety of deployment options, including direct Git deployment, continuous integration and deployment with Visual Studio Team Services or GitHub, and deployment from Docker containers. It also provides built-in features such as custom domains, SSL certificates, and automatic scaling, making it easy to deliver high-performing, secure, and scalable web applications. 

A step-by-step guide to deploying a Gradio application on Azure using Docker

This guide assumes a foundational understanding of Azure and the presence of Docker on your desktop. Refer to the instructions for getting started on Mac,  Windows , or Linux for Docker. 

Step 1: Create an Azure Container Registry resource 

Go to Azure Marketplace, search for ‘container registry’, and hit ‘Create’. 

STEP 1: Create an Azure Container Registry resource
Create an Azure Container Registry resource

Under the “Basics” tab, complete the required information and leave the other settings as the default. Then, click “Review + Create.” 

Web App for Gradio Step 1A
Web App for Gradio Step 1A

 

Step 2: Create a Web App resource in Azure 

In Azure Marketplace, search for “Web App”, select the appropriate resource as depicted in the image, and then click “Create”. 

STEP 2: Create a Web App resource in Azure
Create a Web App resource in Azure

 

Under the “Basics” tab, complete the required information, choose the appropriate pricing plan, and leave the other settings as the default. Then, click “Review + Create.”  

Web App for Gradio Step 2B
Web App for Gradio Step 2B

 

Web App for Gradio Step 2C
Web App for Gradio Step 2c

 

Upon completion of all deployments, the following three resources will be in your resource group. 

Web App for Gradio Step 2D
Web App for Gradio Step 2D

Step 3: Create a folder containing the “App.py” file and its corresponding “requirements.txt” file 

To begin, we will utilize an emotion detector application, the model for which can be found at https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion. 

APP.PY 

REQUIREMENTS.TXT 

Step 4: Launch Visual Studio Code and open the folder

Step 4: Launch Visual Studio Code and open the folder. 
Step 4: Launch Visual Studio Code and open the folder.

Step 5: Launch Docker Desktop to start Docker. 

STEP 5: Launch Docker Desktop to start Docker
STEP 5: Launch Docker Desktop to start Docker.

Step 6: Create a Dockerfile 

A Dockerfile is a script that contains instructions to build a Docker image. This file automates the process of setting up an environment, installing dependencies, copying files, and defining how to run the application. With a Dockerfile, developers can easily package their application and its dependencies into a Docker image, which can then be run as a container on any host with Docker installed. This makes it easy to distribute and run the application consistently in different environments. The following contents should be utilized in the Dockerfile: 

DOCKERFILE 

STEP 6: Create a Dockerfile
STEP 6: Create a Dockerfile

Step 7: Build and run a local Docker image 

Run the following commands in the VS Code terminal. 

1. docker build -t demo-gradio-app 

  • The “docker build” command builds a Docker image from a Docker file. 
  • The “-t demo-gradio-app” option specifies the name and optionally a tag to the name of the image in the “name:tag” format. 
  • The final “.” specifies the build context, which is the current directory where the Dockerfile is located.

 

2. docker run -it -d –name my-app -p 7000:7000 demo-gradio-app 

  • The “docker run” command starts a new container based on a specified image. 
  • The “-it” option opens an interactive terminal in the container and keeps the standard input attached to the terminal. 
  • The “-d” option runs the container in the background as a daemon process. 
  • The “–name my-app” option assigns a name to the container for easier management. 
  • The “-p 7000:7000” option maps a port on the host to a port inside the container, in this case, mapping the host’s port 7000 to the container’s port 7000. 
  • The “demo-gradio-app” is the name of the image to be used for the container. 

This command will start a new container with the name “my-app” from the “demo-gradio-app” image in the background, with an interactive terminal attached, and port 7000 on the host mapped to port 7000 in the container. 

Web App for Gradio Step 7A
Web App for Gradio Step 7A

 

Web App for Gradio Step 7B
Web App for Gradio Step 7B

 

To view your local app, navigate to the Containers tab in Docker Desktop, and click on link under Port. 

Web App for Gradio Step 7C
Web App for Gradio Step 7C

Step 8: Tag & Push the Image to Azure Container Registry 

First, enable ‘Admin user’ from the ‘Access Keys’ tab in Azure Container Registry. 

STEP 8: Tag & Push Image to Azure Container Registry
Tag & Push Images to Azure Container Registry

 

Login to your container registry using the following command, login server, username, and password can be accessed from the above step. 

docker login gradioappdemos.azurecr.io

Web App for Gradio Step 8B
Web App for Gradio Step 8B

 

Tag the image for uploading to your registry using the following command. 

 

docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app 

  • The command “docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app” is used to tag a Docker image. 
  • “docker tag” is the command used to create a new tag for a Docker image. 
  • “demo-gradio-app” is the source image name that you want to tag. 
  • “gradioappdemos.azurecr.io/demo-gradio-app” is the new image name with a repository name and optionally a tag in the “repository:tag” format. 
  • This command will create a new tag “gradioappdemos.azurecr.io/demo-gradio-app” for the “demo-gradio-app” image. This new tag can be used to reference the image in future Docker commands. 

Push the image to your registry. 

docker push gradioappdemos.azurecr.io/demo-gradio-app 

  • “docker push” is the command used to upload a Docker image to a registry. 
  • “gradioappdemos.azurecr.io/demo-gradio-app” is the name of the image with the repository name and tag to be pushed. 
  • This command will push the Docker image “gradioappdemos.azurecr.io/demo-gradio-app” to the registry specified by the repository name. The registry is typically a place where Docker images are stored and distributed to others. 
Web App for Gradio Step 8C
Web App for Gradio Step 8C

 

In the Repository tab, you can observe the image that has been pushed. 

Web App for Gradio Step 8D
Web App for Gradio Step 8B

Step 9: Configure the Web App 

Under the ‘Deployment Center’ tab, fill in the registry settings then hit ‘Save’. 

STEP 9: Configure the Web App
Configure the Web App

 

In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’. 

Web App for Gradio Step 9B
Web App for Gradio Step 9B
Web App for Gradio Step 9C
Web App for Gradio Step 9C

 

Web App for Gradio Step 9D
Web App for Gradio Step 9D

 

In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’. 

Web App for Gradio Step 9E
Web App for Gradio Step 9E

 

After the image extraction is complete, you can view the web app URL from the Overview page. 

 

Web App for Gradio Step 9F
Web App for Gradio Step 9F

 

Web App for Gradio Step 9G
Web App for Gradio Step 9G

Step 1O: Pushing Image to Docker Hub (Optional) 

Here are the steps to push a local Docker image to Docker Hub: 

  • Login to your Docker Hub account using the following command: 

docker login

  • Tag the local image using the following command, replacing [username] with your Docker Hub username and [image_name] with the desired image name: 

docker tag [image_name] [username]/[image_name]

  • Push the image to Docker Hub using the following command: 

docker push [username]/[image_name] 

  • Verify that the image is now available in your Docker Hub repository by visiting https://hub.docker.com/ and checking your repositories. 
Web App for Gradio Step 10A
Web App for Gradio Step 10A

 

Web App for Gradio Step 10B
Web App for Gradio Step 10B

Wrapping it up

In conclusion, deploying a web application using Docker on Azure is an easy and efficient way to create and deploy applications. This method is suitable for those who lack the necessary coding skills to create a web app from scratch. Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers.

Azure Container Registry is a fully managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. Azure Web Apps is a fully managed platform for building, deploying, and scaling web applications and services. By following the step-by-step guide provided in this article, users can deploy a Gradio application on Azure using Docker.

February 22, 2023

Data Science Dojo is offering Airbyte for FREE on Azure Marketplace packaged with a pre-configured web environment enabling you to quickly start the ELT process rather than spending time setting up the environment. 

 

What is an ELT pipeline?  

An ELT pipeline is a data pipeline that extracts (E) data from a source, loads (L) the data into a destination, and then transforms (T) data after it has been stored in the destination. The ELT process that is executed by an ELT pipeline is often used by the modern data stack to move data from across the enterprise into analytics systems.  

 

ELT process
ELT process

 

In other words, in the ELT approach, the transformation (T) of the data is done at the destination after the data has been loaded. The raw data that contains the data from a source record is stored in the destination as a JSON blob. 

 

Airbyte’s architecture: 

Airbyte is conceptually composed of two parts: platform and connectors. 

The platform provides all the horizontal services required to configure and run data movement operations, for example, the UI, configuration API, job scheduling, logging, alerting, etc., and is structured as a set of microservices. 

Connectors are independent modules that push/pull data to/from sources and destinations. Connectors are built under the Airbyte specification, which describes the interface with which data can be moved between a source and a destination using Airbyte. Connectors are packaged as Docker images, which allows total flexibility over the technologies used to implement them. 

 

Obstacles for data engineers & developers  

Collection and maintenance of data from different sources is itself a hectic task for data engineers and developers. Building a custom ELT pipeline for all of the data sources is a nightmare on top that not only consumes a lot of time for the engineers but also costs a lot. 

In this scenario, a unified environment to deal with the quick data ingestions from various sources to various destinations would be great to tackle the mentioned challenges.  

 

Methodology of Airbyte 

 Airbyte leverages DBT (data build tool) to manage and create SQL code that is used for transforming raw data in the destination. This step is sometimes referred to as normalization. An abstracted view of the data processing flow is given in the following figure: 

Airbyte methodology
Airbyte methodology

 

It is worth noting that the above illustration displays a core tenet of ELT philosophy, which is that data should be untouched as it moves through the extracting and loading stages so that the raw data is always available at the destination. Since an unmodified version of the data exists in the destination, it can be re-transformed in the future without the need for a resync of data from source systems. 

 

Major features

Airbyte supports hundreds of data sources and destinations including:  

  • Apache Kafka  
  • Azure Event Hub  
  • Paste Data  
  • Other custom sources  

By specifying credentials and adding extensions you can also ingest from and dump to:  

  • Azure Data Lake  
  • Google Cloud Storage  
  • Amazon S3 & Kinesis  

Other major features that Airbyte offers: 

  • High extensibility: Use existing connectors to your needs or build a new one with ease. 
  • Customization: Entirely customizable, starting with raw data or from some suggestion of normalized data. 
  • Full-grade scheduler: Automate your replications with the frequency you need. 
  • Real-time monitoring: Logs all the errors in full detail to help you understand better. 
  • Incremental updates: Automated replications are based on incremental updates to reduce your data transfer costs. 
  • Manual full refresh: Re-syncs all your data to start again whenever you want. 
  • Debugging: Debug and Modify pipelines as you see fit, without waiting. 

 

What does Data Science Dojo provide?   

Airbyte instance packaged by Data Science Dojo serves as a pre-configured ELT pipeline that makes data integration pipelines a commodity without the burden of installation. It offers efficient data migration and supports a variety of data sources and destinations to ingest and dump data.  

Features included in this offer:   

  • Airbyte service that is easily accessible from the web and has a rich user interface. 
  • Easy to operate and user-friendly. 
  • Strong community support due to the open-source platform. 
  • Free to use. 

 

Conclusion  

There are a ton of small services that aren’t supported on traditional data pipeline platforms. If you can’t import all your data, you may only have a partial picture of your business. Airbyte solves this problem through custom connectors that you can build for any platform and make them run quickly. 

Install the Airbyte offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science! 

Click on the button below to head over to the Azure Marketplace and deploy Airbyte for FREE by clicking below:

CTA - Try now  

January 27, 2023

Data Science Dojo is offering Apache Airflow for FREE on Azure Marketplace packaged with a pre-configured web environment of Airflow with various data analytics features.  

  

Introduction:  

In this era of tighter data restrictions, it is more important than ever to understand, analyze, and manage your data throughout its lifecycle. It is harder than ever as data volumes rise, and data pipelines get more complicated. A solution is needed Organizations or Individuals must have a complete, scalable, easy-to-analyze platform to manage and monitor the complex workflows and support several integrations. 

 

What is Apache Airflow?  

Apache Airflow, a powerful open-source tool for authoring, scheduling, and monitoring data and computational workflows. It provides a method that makes it easier to manage, schedule, and coordinate complicated data pipelines from several sources. 

 

What is DAG? 

A DAG, or Directed Acyclic Graph, in Airflow is a list of all the jobs you wish to execute, arranged to reflect their connections and dependencies. A Python script that expresses the DAG’s structure as code defines a DAG. Researchers’ priori ideas about the connections between and among variables in causal structures are encoded using DAGs. It contains directed edges (arrows), linking nodes (variables), and their paths. Hence A workflow is represented as a DAG, which consists of discrete units of work called Tasks that are ordered considering relationships and data flows. 

 

Apache Airflow Architecture: 

This powerful and scalable workflow scheduling software is made up of four key parts: 

  • Scheduler: The scheduler keeps track of all DAGs and the jobs they are connected to. To start, it frequently checks the list of open tasks. 
  • Web server: The user interface for Airflow is the web server (The default port Apache Airflow listens to is 8080). It displays the status of the jobs, gives the user access to the databases, and lets them read log files from other remote file stores like Microsoft Azure blobs. 
  • Database: To make sure the schedule retains metadata information, the state of the DAGs and the tasks they are connected to, are saved in the database. The scheduler scans each DAG and records essential data, including schedule intervals, run-by-run statistics, and task instances. 
  • Executors: There are various kinds of executors for different use cases. Few examples of Executors are  SequentialExecutor, LocalExecutor, CeleryExecutor, and KubernetesExecutor 

  

(With SequentialExecutor, just one task may be carried out at once. No parallel processing is possible. It is useful when testing or debugging. LocalExecutor supports hyperthreading and parallelism. It is excellent for using Airflow on a single node or a local workstation. CeleryExecutor is usually used for managing a distributed Airflow cluster. While using the Kubernetes API, the KubernetesExecutor creates temporary pods for each of the task instances to run in.) 

 

Key features Apache Airflow provides: 

  • Dynamic Pipelines can be constructed by Airflow dynamic, also as it is constructed in the form of code which gives an edge to dynamic behavior. 
  • Apache Airflow has a rich User Interface that helps the user to manage their workflow easily 
  • It gives a separate code view pallet that enables users to view their DAGs code as well.  
  • Allows users to visualize their DAGs in different forms like Gantt chart, Tree, and Graph. 
  • With ready to use operators in airflow, users can work with various cloud platforms like Microsoft Azure, AWS (Amazon Web Services) etc. 
  • Allows role-based user management to maintain Security and Accessibility.

 

Apache Airflow with Azure services: 

Apache Airflow leverages the power of Azure services to make the procedure of monitoring and managing complex workflows intuitively. Also with Azure, Airflow made it a more scalable data warehousing platform. Airflow enables users to work in a scalable environment. 

 

Conclusion:  

Other open-source Data Engineering solutions put intense competition on Apache Airflow. But it is one of the most robust platforms used by Data Engineers for orchestrating workflows or pipelines. Users can easily visualize your data pipelines’ dependencies, progress, logs, code, trigger tasks, and success status all in a single package.  

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We therefore know the importance of data and the encapsulated insights. Through this offer, we are confident that you can analyze, visualize, and query your data in a collaborative environment with greater ease. 

Install the Apache Airflow offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science! 

Click on the button below to head over to the Azure Marketplace and deploy Apache Airflow for FREE by clicking on “Try now.”  

 

CTA - Try now

 

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

December 2, 2022

Data Science Dojo is offering SnowSQL for FREE on Azure Marketplace packaged with pre-configured CLI for data manipulation at Snowflake warehouse 

 

What is Snowflake? 

Snowflake is a cloud computing-based data cloud platform. It is a user-friendly data warehousing product that supports both ETL and ELT functionalities. It supports multiple data workloads from data warehouses and data lakes to enable data storage, data engineering, processing, and analytics. Further, using Snowflake can help you with better data warehouse jobs in the near future.

It is relatively new, flexible, easier and provides a pure cloud SQL based warehouse. It is not created upon any database tech and has a high affordability rate. 

Challenges for developers 

Execution issues while attempting to load and query information, shortcomings in dealing with a variety of data, absence of central source causing conflicting corrupt data, and unfortunate data sharing were a few big obstacles encountered by data engineers and developers to look forward to. 

Therefore, a data cloud warehouse capable of storing variable-length records at vast scale maybe having some fault tolerance or availability or fast processing can reduce the task overhead by a big margin. Not to forget, the data able to be transformed using any standard open-source language would suffice. 

Snowflake: SnowSQL 

Using data engineers’ one of the favorite languages, SQL, developers can now load, transform and unload data at their Snowflake cloud. SnowSQL is an interactive command line scripting-cum-query tool that allows users to perform DML and DDL operations at the CLI level. It is produced with increased protection standards and has strong integration with Snowflake core architecture.

With this tool, you can connect to your Snowflake account and configure your databases, schemas, and warehouses using simple SQL. Users can import existing databases of the data cloud into local machines and then perform various transform operations. It also allows you to unload or dump your transformed data back into the database containers. 

Snowflake architecture - SnowSQL
Snowflake architecture – Data Science Dojo

What we provide 

Snowflake: SnowSQL packaged by Data Science Dojo serves as a pre-installed CLI environment for the Snowflake data cloud so it provides an ease to the developers to perform SQL operations on the data from the warehouse without the burden of installation. Offer listed by Data Science Dojo, provides the following elements: 

  • A Command Line Interface (CLI) installed with SnowSQL which can be connected to your Snowflake account 
  • Support for standard SQL 
  • Robust integrations 
  • Ability load and unload data having CSV, XML, JSON, Avro etc formats 
  • Includes syntax highlighting and auto-complete 

Significant characteristics of SnowSQL 

  • Centralization and Democratization: Snowflake combines warehouses and data lakes into a focal data storage and democratizes it to enable clients for better processing and analytics 
  • Smart Data Handling: Snowflake is able to manage exponential volumes, variety, and velocity of data. Using SnowSQL we can perform ETL and ELT operations using ANSI SQL 
  • Secure and Fault Tolerant: SnowSQL secures connections to Snowflake using TLS with OCSP checks. Your data is available even if one or more nodes go down as source is fault-tolerant 
  • Recovery and Time Travel: The undrop table command is unique as it restores the table back again. The time travel feature allows the recovery of the original version of an object to reverse the updates 
  • High Elastic Performance: With SnowSQL you create multiple virtual warehouses of variable sizes which ensures your ETL process go smoothly 

Conclusion 

SnowSQL leverages the power of Azure services and Snowflake Cloud to load and process large volumes of data with continuous availability, high scalability, and data distribution from the comfort of CLI. In this way, Azure increases the fault tolerance of Snowflake clusters.

The power of Azure ensures maximum performance and high throughput for Snowflake nodes by providing a low latency network. Since SnowSQL mirrors Snowflake which stores and computes data, the elastic nature of the cloud will allow it to load data faster or run a high volume of queries. 

Recovery of data has never been this easy. It also provides optimized query parsing and strong integration. Install the SnowSQL offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science! 

Click on the button below to head over to the Azure Marketplace and deploy SnowSQL for FREE by clicking on “Get it now”. 

 

SnowSQL Package

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

 

 

October 7, 2022

Learn how companies like Zillow predict the value of your home. Build a predictive model using Azure machine learning that estimates the real estate sales price of a house.

Ames housing dataset includes 81 features and 1460 observations. Each observation represents the sale of a home and each feature is an attribute describing the house or the circumstance of the sale.

Clone this experiment to build a predictive model

A full copy of this experiment has been posted to the Cortana Intelligence Gallery. Go to the link and click on “open in Studio.”

Preprocessing & data exploration

Drop low-value columns

Begin by identifying features (columns) that add little to no value for predictive modeling. These columns will be dropped using the “select columns from dataset” module.

The following columns were chosen to be “excluded” from the dataset:

Id, Street, Alley, PoolQC, Utilities, Condition2, RoofMatl, MiscVal, PoolArea, 3SsnPorch, LowQualFinSF, MiscFeature, LandSlope, Functional, BsmtHalfBath, ScreenPorch, BsmtFinSF2, EnclosedPorch.

These low-quality features were removed to improve the model’s performance. Low quality includes lack of representative categories, too many missing values, or noisy features.

Azure-machine-learning-real-estate-sales-price-predictive-model

 

Define categorical variables

We must now define which values are non-continuous by casting them as categorical. Mathematical approaches for continuous and non-continuous values differ greatly. Nominal categorical features were identified and cast to categorical data types using the metadata editor to ensure proper mathematical treatment by the machine learning algorithm.

The first edit metadata module will cast all strings. The column “MSSubClass” uses numeric integer codes to represent the type of building the house is, and therefore should not be treated as a continuous numeric value but rather a categorical feature. We will use another metadata editor to cast it into a category.

 

azure-machine-learning-clean-missing-data

Clean missing data

Most algorithms are unable to account for missing values and some treat it inconsistently from others. To address this, we must make sure our dataset contains no missing, “null,” or “NA” values.

Replacement of missing values is the most versatile and preferred method because it allows us to keep our data. It also minimizes collateral damage to other columns due to one cell’s bad behavior. Numerical values can easily be replaced with statistical values such as mean, median, or mode.

While categories can be commonly dealt with by replacing with the mode or a separate categorical value for unknowns.

For simplicity, all categorical missing values were cleaned with the mode and all numeric features were cleaned using the median. To further improve a model’s performance, custom cleaning functions should be tried and implemented on each individual feature rather than a blanket transformation of all columns.

 

Azure-machine-learning-edit-metadata

Machine learning – Model building

Statistical feature selection

Not every feature in its current form is expected to contain predictive value to the model and may mislead or add noise to the model. To filter these out we will perform a Pearson correlation to test all features against the response class (sales price) as a quick measure of their predictive strength, only picking the top X strongest features from this method, the remaining features will be left behind.

This number can be tuned for further model performance increases.

 

filter-based-feature-selection - algorithm

 

Select an algorithm

First, we must identify what kind of machine learning problem this is: classification, regression, clustering, etc. Since the response class (sales price) is a continuous numeric value, we can tell that it is a regression problem. We will use a linear regression model with regularization to reduce the over-fitting of the model.

  • To ensure a stable convergence of weight and biases, all features except the response class must be normalized to be placed into the same range.

 

Reguarlization

 

Model training and evaluation

The method of cross-validation will be used to evaluate the predictive performance of the model as well as that performance’s stability in regard to new data. Cross-validation will build ten different models on the same algorithm but with different and non-repeating subsets of the same dataset. The evaluation metrics on each of the ten models will be averaged and a standard deviation will infer the stability of the average performance.

 

Model-training-and-evalualtion--cross-validation-for-azure-machine-learning

 

 

cross-validation_pkmeou

Parameter tuning

This experiment will build a regression model that minimizes the mean RMSE of the cross-validation results with the lowest variance possible(but also considers bias-variance trade-offs).

  1. The first regression model was built using default parameters and produced a very inaccurate model ($124,942 mean RMSE) and was very unstable (11,699 standard deviation).
  2. The high bias and high variance of the previous model suggest the model is over-fitting to the outliers and is under-fitting the general population.
  3. The L2 regularization weight will be decreased to lower the penalty of higher coefficients. After lowering the L2 regularization weight, the model is more accurate with an average cross-validation RMSE of $42,366.
  4. The previous model is still quite unstable with a standard deviation of $8,121. Since this is a dataset with a small number of observations (1460), it may be better to increase the number of training epochs so that the algorithm has more passes to reach convergence.
  5. This will increase training times but also increase stability. The third linear model had the number of training epochs increased and saw a better mean cross-validation RMSE of $36,684 and a much more stable standard deviation of $3,849.
  6. The final model had a slight increase in the learning rate which improved both mean cross-validation RMSE and the standard deviation.

 

parameter-tuning

Deployment

The algorithm parameters that yielded the best results will be the ones that are shipped. The best algorithm (the last one) will be retrained using 100% of the data since cross-validation leaves 10% out each time for validation.

 

Train Model

Further improvement of this Azure machine learning model

Feature engineering was entirely left out of this experiment. Try engineering more features from the existing dataset to see if the model will improve. Some columns that were originally dropped may become useful when combined with other features. For example, try bucketing the years in which the house was built by decade. Clustering the data may also yield some hidden insights.

 

 

Written by Phuc Duong

June 15, 2022

Data Science Dojo has launched Grafana’s offering to the azure marketplace to help you harvest insights from your data. It leverages the power of Microsoft Azure services to visualize, query, and set alerts for your data while promoting teamwork and transparency.

Excel stopped working
Excel is Not Responding

Does the above visual seem familiar? How many times are you trying to meet your deadlines only to be met bng? After all, spreadsheets can deal with complex calculations only up to a certain threshold.

Drawbacks of spreadsheets

Spreadsheets offer you a lot of cool features involving data entry, calculations, and manipulation. But dealing with all the cells and formulas can get overwhelming, making it more prone to errors affecting the integrity of the model. 

There is a security and privacy issue when users store data in their individual spreadsheets and drive; elevated levels of collaboration also become a hassle when having data stored in different platforms. It is impossible to keep track of where the entries were altered or updated resulting in multiple versions of the same file undermining the overall confidence in the model. 

Finally, it is not possible to present a stack of spreadsheets to your audience because they require a story to be presented to them which cannot be conveyed via rows and columns of large data. All these problems can be overcome by using it to generate insightful dashboards that summarize all your data into easy-to-read visuals and alerts that make generating actionable items much easier!

What is Grafana?

Grafana Logo
Grafana Logo

Grafana is built on the principle that data should be accessible to everyone, it allows visualizations to be shared, promoting teamwork and transparency. It enables its customers to take any of their existing data and visualize it however they want. It offers services for advanced querying and transformation and enables customers to create customized dashboards and panels, catering to their specific needs. We here at Data Science Dojo deliver data science education, consulting, and technical services to increase the power of data.

Thus, we are adding Grafana’s instance to the azure marketplace to help you harvest insights from your data. It leverages the power of Microsoft Azure services to capture visits, events, and monitor user actions. Install our Grafana’s offering now and get started on your journey towards optimal analysis.

Why Grafana?

Unify your data from various platforms

Grafana offers you the option to integrate your data from various platforms, including both Azure and non-Azure services. That’s right! It doesn’t matter if your data is in google sheets or Azure Cosmos DB. You can connect to any of these sources at once! 

Search & query through your data

Imagine having to go through a thousand spreadsheets just to find one single entry that satisfied your condition. Is sound impossible? Not with Grafana. In its collaborative environment, you can write down your custom data analytics queries to filter out the data that fits your requirements.

Customized visualization & dashboards

Grafana Dashboard
Grafana Dashboard

Grafana offers you the option to generate highly customized visualizations that help you gain tactical insight from data that is often ignored. Leverage the power of Azure to collaborate and share various Grafana Dashboards with different stakeholders within and outside your organization.

Alerts

It can be difficult to constantly monitor your crucial KPIs and metrics and sometimes you may not realize your KPI has dipped below the margin before it is too late. Grafana lets you set up custom alerts to monitor these metrics and drop notifications on platforms such as slack and teams when it is the right time to act.

Try Grafana

June 10, 2022

Related Topics

Statistics
Resources
rag
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
AI
Agentic AI