For a hands-on learning experience to develop LLM applications, join our LLM Bootcamp today.
Early Bird Discount Ending Soon!

azure

Data Science Dojo Staff

Revolutionize data management with Meltano CLI – The ultimate open-source solution

Data Science Dojo is offering Meltano CLI for FREE on Azure Marketplace preconfigured with Meltano, a platform that provides flexibility and scalability. It comprises four features, it is customizable, observable with a full view of data visualization, testable and versionable to track changes, and can easily be rolled back if needed.

It is somewhat of a tiring process to install the technology. Then look after the integration and dependency issues. Already feeling tired? It is somehow confusing to resolve the installation errors. Not to worry as Data Science Dojo’s Meltano CLI instance fixes all of that. But before we delve further into it, let us get to know some basics.

What is Meltano?

Meltano is an open-source Command Line Interface (CLI) tool that offers a flexible and scalable solution for Extract, Load, and Transform (ELT) processes. It is designed to assist data engineers in transforming, converting, and validating data in a simplified manner while ensuring accuracy and reliability.

The Meltano CLI can efficiently handle complex data engineering tasks, providing a user-friendly interface that simplifies the ELT process. It can also integrate with different data sources, enabling users to extract data from various sources, load it into a target destination, and transform it according to their specific requirements.

In addition, it offers a range of plugins that extend its capabilities and allow users to customize their ELT workflows. These plugins include extractors, loaders, and transformers, among others.

Challenges for individuals

Before Meltano CLI, there were several challenges associated with data integration that made the process difficult and time-consuming. Here are a few of the main challenges:

Lack of Standardization: Data integration tools were often proprietary, which made it difficult to integrate different tools and workflows. This meant that organizations often had to use multiple tools to complete a data integration project.

Complexity: Many data integration tools were complex and required extensive knowledge of programming and data architecture to use effectively. This made it difficult for non-technical users to participate in data integration projects.

Scalability: As data volumes grew, many data integration tools struggled to handle the scale of the data. This led to slow and inefficient data integration processes.

Cost: Many data integration tools were expensive, which made them inaccessible for smaller organizations with limited budgets.
Limited Customization: Many data integration tools offered limited customization options, which made it difficult to adapt the tool to fit the unique needs of an organization.

All in all, it was designed to address many of these challenges by providing an open-source, flexible, and user-friendly tool that can be customized to fit the unique requirements of users.

*Meltano CLI for ELT – Data Science Dojo*

Why Meltano?

Meltano CLI stands out as a data engineering tool. It provides flexibility and scalability. It comprises of four features, it is customizable, observable with a full view of data visualization, testable and versionable to track changes, and can easily be rolled back if needed.

Meltano CLI has solved many struggles that make it a compelling choice for many users, including:

Open-source: It is free and open-source, which means that users can download, use, and modify the source code as per their needs.
Easy-to-use: It is designed to be easy to use with a simple command-line interface and intuitive user interface. Users can easily configure, execute, and monitor data integration pipelines.
Customizable: Meltano CLI offers a high degree of customization, allowing users to define custom transformations, connectors, and integrations.
Modern stack: It is built using modern open-source technologies such as Python, Flask, and Vue.js, making it easy to extend and integrate with other tools.
GitLab Integration: Meltano CLI is developed by GitLab, which means it can be easily integrated with GitLab for version control, collaboration, and continuous integration and deployment (CI/CD).

Overall, Meltano CLI is a powerful and flexible data integration tool that offers a unique set of features and benefits that may make it a good choice for certain data integration projects. However, the choice of tool ultimately depends on the specific needs and requirements of the project at hand.

Integrations

MeltanoHub is the primary location to find all plugins, including Singer taps and targets. It serves as a single source of truth for users, making it easy to discover and use plugins within Meltano. Additionally, users can contribute to the Hub by adding more plugins, which are immediately accessible.

The Hub is maintained by Meltano and the broader community, ensuring that it is continuously curated and up to date. This centralized platform simplifies the process of finding and using plugins, enabling users to enhance their data engineering workflows with ease.

Key features

Meltano CLI includes several features, including:

Easy to setup and easy to use
Pipeline creation and management
Extract, transform, and load (ETL) processes
Plugin management
Visualization
Configuration management
Version control
Testability
Integration with other tools: It seamlessly integrates with other tools such as dbt, Singer, and Airflow, among others, to enhance your workflow.

What Data Science Dojo has for you?

Azure Virtual Machine is preconfigured with CLI plug-and-play functionality, so you do not have to worry about setting up the environment.

Features include a zero-setup CLI platform that offers a high degree of customization, allowing users to define custom transformations, connectors, and integrations. It is designed to be easy to use with a simple command-line interface and intuitive user interface.
Meltano CLI helps you efficiently transform, convert, and validate your data using a simplified process for data engineering, with the assurance of accuracy and reliability.

And many others which you check by taking a quick peek here: Meltano CLI on Azure Marketplace sets it apart from others is that it is an open-source, flexible, and scalable CLI for ELT+. It is customizable. It is also observable, provides a full view with detailed pipeline logs and statistics, and allows inspection of code for debugging. Meltano is versionable which allows easy tracking and rollback of changes. It is testable and only deploys to production once everything is green.

Moreover, Meltano CLI is a powerful and flexible data integration tool that offers many benefits over other tools on the market. Its open-source nature, ease of use, integration with other tools, reconfigurability, and community support make it a compelling choice for data integration projects.

Conclusion

The Meltano CLI comes with pre-configured Ubuntu 20.04 and a ready-to-use project, allowing for a plug-and-play experience without any setup required. By using Azure, the fault tolerance of data pipelines is increased, resulting in higher performance and faster content delivery.

The Meltano CLI provides an open-source, flexible, and scalable CLI for ELT+, allowing for efficient data transformation, conversion, and validation with accuracy and reliability. When combined with Microsoft Azure services, Meltano outperforms traditional methods by performing data-intensive computations in the cloud. Collaboration and sharing of notebooks with stakeholders is also possible.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free project Environment dedicated specifically to Data Integration and ELT on Azure Market Place. Do not wait to install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!

Written by Insiyah Talib

March 15, 2023

Data Engineering

Data Science Dojo Staff

Memphis: A Game-Changer in Traditional Messaging Systems

Data Science Dojo is offering Memphis broker for FREE on Azure Marketplace preconfigured with Memphis, a platform that provides a P2P architecture, scalability, storage tiering, fault-tolerance, and security to provide real-time processing for modern applications suitable for large volumes of data.

Introduction

It is a cumbersome and tiring process to install Docker first and then install Memphis. Then look after the integration and dependency issues. Are you already feeling tired? It is somehow confusing to resolve the installation errors. Not to worry as Data Science Dojo’s Memphis instance fixes all of that. But before we delve further into it, let us get to know some basics.

What is Memphis?

Memphis is an open-source modern replacement for traditional messaging systems. It is a cloud-based messaging system with a comprehensive set of tools that makes it easy and affordable to develop queue-based applications. It is reliable, can handle large volumes of data, and supports modern protocols. It requires minimal operational maintenance and allows for rapid development, resulting in significant cost savings and reduced development time for data-focused developers and engineers.

Challenges for Individuals

Traditional messaging brokers, such as Apache Kafka, RabbitMQ, and ActiveMQ, have been widely used to enable communication between applications and services. However, there are several challenges with these traditional messaging brokers:

Scalability: Traditional messaging brokers often have limitations on their scalability, particularly when it comes to handling large volumes of data. This can lead to performance issues and message loss.
Complexity: Setting up and managing a traditional messaging broker can be complex, particularly when it comes to configuring and tuning it for optimal performance.
Single Point of Failure: Traditional messaging brokers can become a single point of failure in a distributed system. If the messaging broker fails, it can cause the entire system to go down.
Cost: Traditional messaging brokers can be expensive to deploy and maintain, particularly for large-scale systems.
Limited Protocol Support: Traditional messaging brokers often support only a limited set of protocols, which can make it challenging to integrate with other systems and technologies.
Limited Availability: Traditional messaging brokers can be limited in terms of the platforms and environments they support, which can make it challenging to use them in certain scenarios, such as cloud-based systems.

Overall, these challenges have led to the development of new messaging technologies, such as event streaming platforms, that aim to address these issues and provide a more flexible, scalable, and reliable solution for modern distributed systems.

Memphis As a Solution

Why Memphis?

“It took me three minutes to build in Memphis what took me a week and a half in Kafka.” Memphis and traditional messaging brokers are both software systems that facilitate communication between different components or systems in a distributed architecture. However, there are some key differences between the two:

Architecture: It uses a peer-to-peer (P2P) architecture, while traditional messaging brokers use a client-server architecture. In a P2P architecture, each node in the network can act as both a client and a server, while in a client-server architecture, clients send messages to a central server which distributes them to the appropriate recipients.
Scalability: It is designed to be highly scalable and can handle large volumes of messages without introducing significant latency, while traditional messaging brokers may struggle to scale to handle high loads. This is because Memphis uses a distributed hash table (DHT) to route messages directly to their intended recipients, rather than relying on a centralized message broker.
Fault tolerance: It is highly fault-tolerant, with messages automatically routed around failed nodes, while traditional messaging brokers may experience downtime if the central broker fails. This is because it uses a distributed consensus algorithm to ensure that all nodes in the network agree on the state of the system, even in the presence of failures.
Security: Memphis provides end-to-end encryption by default, while traditional messaging brokers may require additional configuration to ensure secure communication between nodes. This is because it is designed to be used in decentralized applications, where trust between parties cannot be assumed.

Overall, while both Memphis and traditional messaging brokers facilitate communication between different components or systems, they have different strengths and weaknesses and are suited to different use cases. It is ideal for highly scalable and fault-tolerant applications that require end-to-end encryption, while traditional messaging brokers may be more appropriate for simpler applications that do not require the same level of scalability and fault tolerance.

What Struggles does Memphis Solve?

Handling too many data sources can become overwhelming, especially with complex schemas. Analyzing and transforming streamed data from each source is difficult, and it requires using multiple applications like Apache Kafka, Flink, and NiFi, which can delay real-time processing.

Additionally, there is a risk of message loss due to crashes, lack of retransmits, and poor monitoring. Debugging and troubleshooting can also be challenging. Deploying, managing, securing, updating, onboarding, and tuning message queue systems like Kafka, RabbitMQ, and NATS is a complicated and time-consuming task. Transforming batch processes into real-time can also pose significant challenges.

Integrations:

Memphis Broker provides several integration options for connecting to diverse types of systems and applications. Here are some of the integrations available in Memphis Broker:

JMS (Java Message Service) Integration
.NET Integration
REST API Integration
MQTT Integration
AMQP Integration
Apache Camel, Apache ActiveMQ, and IBM WebSphere MQ.

Key features:

Fully optimized message broker in under 3 minutes
Easy-to-use UI, CLI, and SDKs
Dead-letter station (DLQ)
Data-level observability
Runs on your Docker or Kubernetes
Real-time event tracing
SDKs: Python, Go, Node.js, Typescript, Nest.JS, Kotlin, .NET, Java
Embedded schema management using Protobuf, JSON Schema, GraphQL, Avro
Slack integration

What Data Science Dojo has For You:

Azure Virtual Machine is preconfigured with plug-and-play functionality, so you do not have to worry about setting up the environment. Features include a zero-setup Memphis platform that offers you to:

Build a dead-letter queue
Create observability
Build a scalable environment
Create client wrappers
Handle back pressure. Client or queue side
Create a retry mechanism
Configure monitoring and real-time alerts

a
It stands out from other solutions because it can be set up in just three minutes, while others can take weeks. It’s great for creating modern queue-based apps with large amounts of streamed data and modern protocols, and it reduces costs and dev time for data engineers. Memphis has a simple UI, CLI, and SDKs, and offers features like automatic message retransmitting, storage tiering, and data-level observability.

Moreover, Memphis is a next-generation alternative to traditional message brokers. A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables cost-effective, fast, and reliable development of modern queue-based use cases.

Wrapping Up

Memphis comes pre-configured with Ubuntu 20.04, so users do not have to set up anything featuring a plug n play environment. It on the cloud guarantees high availability as data can be distributed across multiple data centers and availability zones on the go. In this way, Azure increases the fault tolerance of data pipelines.

The power of Azure ensures maximum performance and high throughput for the server to deliver content at low latency and faster speeds. It is designed to provide a robust messaging system for modern applications, along with high scalability and fault tolerance.

The flexibility, performance, and scalability provided by Azure virtual machine to Memphis make it possible to offer a production-ready message broker in under 3 minutes. They provide durability and stability and efficient performing systems.

When coupled with Microsoft Azure services and processing speed, it outperforms the traditional counterparts because data-intensive computations are not performed locally, but in the cloud. You can collaborate and share notebooks with various stakeholders within and outside the company while monitoring the status of each

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Memphis instance dedicated specifically for highly scalable and fault-tolerant applications that require end-to-end encryption on Azure Market Place. Do not wait to install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!

Written by Insiyah Talib

March 9, 2023

Data Science

Syed Umair Hasan

Creating a web app for Gradio application on Azure using Docker: A step-by-step guide

In this step-by-step guide, learn how to deploy a web app for Gradio on Azure with Docker. This blog covers everything from Azure Container Registry to Azure Web Apps, with a step-by-step tutorial for beginners.

‘I was searching for ways to deploy a Gradio application on Azure, but there wasn’t much information to be found online. After some digging, I realized that I could use Docker to deploy custom Python web applications, which was perfect since I had neither the time nor the expertise to go through the “code” option on Azure.

The process of deploying a web app begins by creating a Docker image, which contains all of the application’s code and its dependencies. This allows the application to be packaged and pushed to the Azure Container Registry, where it can be stored until needed.

From there, it can be deployed to the Azure App Service, where it is run as a container and can be managed from the Azure Portal. In this portal, users can adjust the settings of their app, as well as grant access to roles and services when needed.

Once everything is set and the necessary permissions have been granted, the web app should be able to properly run on Azure. Deploying a web app on Azure using Docker is an easy and efficient way to create and deploy applications, and can be a great solution for those who lack the necessary coding skills to create a web app from scratch!’

Comprehensive overview of creating a web app for Gradio

Gradio application

Gradio is a Python library that allows users to create interactive demos and share them with others. It provides a high-level abstraction through the Interface class, while the Blocks API is used for designing web applications.

Blocks provide features like multiple data flows and demos, control over where components appear on the page, handling complex data flows, and the ability to update properties and visibility of components based on user interaction. With Gradio, users can create a web application that allows their users to interact with their machine learning model, API, or data science workflow.

The two primary files in a Gradio Application are:

App.py: This file contains the source code for the application.
Requirements.txt: This file lists the Python libraries required for the source code to function properly.

Docker

Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers. It uses a container-based approach to package software, which enables applications to be isolated from each other, making it easier to deploy, run, and manage them in a variety of environments.

A Docker container is a lightweight, standalone, and executable software package that includes everything needed to run a specific application, including the code, runtime, system tools, libraries, and settings. Containers are isolated from each other and the host operating system, making them ideal for deploying microservices and applications that have multiple components or dependencies.

Docker also provides a centralized way to manage containers and share images, making it easier to collaborate on application development, testing, and deployment. With its growing ecosystem and user-friendly tools, Docker has become a popular choice for developers, system administrators, and organizations of all sizes.

Azure Container Registry

Azure Container Registry (ACR) is a fully managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. It allows you to store, manage, and deploy Docker containers in a secure and scalable way, making it an important tool for modern application development and deployment.

With ACR, you can store your own custom images and use them in your applications, as well as manage and control access to them with role-based access control. Additionally, ACR integrates with other Azure services, such as Azure Kubernetes Service (AKS) and Azure DevOps, making it easy to deploy containers to production environments and manage the entire application lifecycle.

ACR also provides features such as image signing and scanning, which helps ensure the security and compliance of your containers. You can also store multiple versions of images, allowing you to roll back to a previous version if necessary.

Azure Web App

Azure Web Apps is a fully managed platform for building, deploying, and scaling web applications and services. It is part of the Azure App Service, which is a collection of integrated services for building, deploying, and scaling modern web and mobile applications.

With Azure Web Apps, you can host web applications written in a variety of programming languages, such as .NET, Java, PHP, Node.js, and Python. The platform automatically manages the infrastructure, including server resources, security, and availability, so that you can focus on writing code and delivering value to your customers.

Azure Web Apps supports a variety of deployment options, including direct Git deployment, continuous integration and deployment with Visual Studio Team Services or GitHub, and deployment from Docker containers. It also provides built-in features such as custom domains, SSL certificates, and automatic scaling, making it easy to deliver high-performing, secure, and scalable web applications.

A step-by-step guide to deploying a Gradio application on Azure using Docker

This guide assumes a foundational understanding of Azure and the presence of Docker on your desktop. Refer to the instructions for getting started on Mac,  Windows , or Linux for Docker.

Step 1: Create an Azure Container Registry resource

Go to Azure Marketplace, search for ‘container registry’, and hit ‘Create’.

Under the “Basics” tab, complete the required information and leave the other settings as the default. Then, click “Review + Create.”

Step 2: Create a Web App resource in Azure

In Azure Marketplace, search for “Web App”, select the appropriate resource as depicted in the image, and then click “Create”.

Under the “Basics” tab, complete the required information, choose the appropriate pricing plan, and leave the other settings as the default. Then, click “Review + Create.”

Web App for Gradio Step 2C — Web App for Gradio Step 2c

Upon completion of all deployments, the following three resources will be in your resource group.

Step 3: Create a folder containing the “App.py” file and its corresponding “requirements.txt” file

To begin, we will utilize an emotion detector application, the model for which can be found at https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion.

APP.PY

REQUIREMENTS.TXT

Step 4: Launch Visual Studio Code and open the folder

Step 5: Launch Docker Desktop to start Docker.

Step 6: Create a Dockerfile

A Dockerfile is a script that contains instructions to build a Docker image. This file automates the process of setting up an environment, installing dependencies, copying files, and defining how to run the application. With a Dockerfile, developers can easily package their application and its dependencies into a Docker image, which can then be run as a container on any host with Docker installed. This makes it easy to distribute and run the application consistently in different environments. The following contents should be utilized in the Dockerfile:

DOCKERFILE

Step 7: Build and run a local Docker image

Run the following commands in the VS Code terminal.

1. docker build -t demo-gradio-app

The “docker build” command builds a Docker image from a Docker file.

The “-t demo-gradio-app” option specifies the name and optionally a tag to the name of the image in the “name:tag” format.
The final “.” specifies the build context, which is the current directory where the Dockerfile is located.

2. docker run -it -d –name my-app -p 7000:7000 demo-gradio-app

The “docker run” command starts a new container based on a specified image.

The “-it” option opens an interactive terminal in the container and keeps the standard input attached to the terminal.

The “-d” option runs the container in the background as a daemon process.

The “–name my-app” option assigns a name to the container for easier management.

The “-p 7000:7000” option maps a port on the host to a port inside the container, in this case, mapping the host’s port 7000 to the container’s port 7000.

The “demo-gradio-app” is the name of the image to be used for the container.

This command will start a new container with the name “my-app” from the “demo-gradio-app” image in the background, with an interactive terminal attached, and port 7000 on the host mapped to port 7000 in the container.

To view your local app, navigate to the Containers tab in Docker Desktop, and click on link under Port.

Step 8: Tag & Push the Image to Azure Container Registry

First, enable ‘Admin user’ from the ‘Access Keys’ tab in Azure Container Registry.

STEP 8: Tag & Push Image to Azure Container Registry — Tag & Push Images to Azure Container Registry

Login to your container registry using the following command, login server, username, and password can be accessed from the above step.

docker login gradioappdemos.azurecr.io

Tag the image for uploading to your registry using the following command.

docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app

The command “docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app” is used to tag a Docker image.
“docker tag” is the command used to create a new tag for a Docker image.
“demo-gradio-app” is the source image name that you want to tag.
“gradioappdemos.azurecr.io/demo-gradio-app” is the new image name with a repository name and optionally a tag in the “repository:tag” format.
This command will create a new tag “gradioappdemos.azurecr.io/demo-gradio-app” for the “demo-gradio-app” image. This new tag can be used to reference the image in future Docker commands.

Push the image to your registry.

docker push gradioappdemos.azurecr.io/demo-gradio-app

“docker push” is the command used to upload a Docker image to a registry.
“gradioappdemos.azurecr.io/demo-gradio-app” is the name of the image with the repository name and tag to be pushed.
This command will push the Docker image “gradioappdemos.azurecr.io/demo-gradio-app” to the registry specified by the repository name. The registry is typically a place where Docker images are stored and distributed to others.

In the Repository tab, you can observe the image that has been pushed.

Web App for Gradio Step 8D — Web App for Gradio Step 8B

Step 9: Configure the Web App

Under the ‘Deployment Center’ tab, fill in the registry settings then hit ‘Save’.

In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’.

In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’.

After the image extraction is complete, you can view the web app URL from the Overview page.

Step 1O: Pushing Image to Docker Hub (Optional)

Here are the steps to push a local Docker image to Docker Hub:

docker login

Tag the local image using the following command, replacing [username] with your Docker Hub username and [image_name] with the desired image name:

docker tag [image_name] [username]/[image_name]

Push the image to Docker Hub using the following command:

docker push [username]/[image_name]

Verify that the image is now available in your Docker Hub repository by visiting https://hub.docker.com/ and checking your repositories.

Wrapping it up

In conclusion, deploying a web application using Docker on Azure is an easy and efficient way to create and deploy applications. This method is suitable for those who lack the necessary coding skills to create a web app from scratch. Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers.

Azure Container Registry is a fully managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. Azure Web Apps is a fully managed platform for building, deploying, and scaling web applications and services. By following the step-by-step guide provided in this article, users can deploy a Gradio application on Azure using Docker.

February 22, 2023

Programming

Ateeq ur Rehman

Airbyte: The ultimate workhorse for all your ELT pipelines

Data Science Dojo is offering Airbyte for FREE on Azure Marketplace packaged with a pre-configured web environment enabling you to quickly start the ELT process rather than spending time setting up the environment.

What is an ELT pipeline? 

An ELT pipeline is a data pipeline that extracts (E) data from a source, loads (L) the data into a destination, and then transforms (T) data after it has been stored in the destination. The ELT process that is executed by an ELT pipeline is often used by the modern data stack to move data from across the enterprise into analytics systems. 

In other words, in the ELT approach, the transformation (T) of the data is done at the destination after the data has been loaded. The raw data that contains the data from a source record is stored in the destination as a JSON blob.

Airbyte’s architecture:

Airbyte is conceptually composed of two parts: platform and connectors.

The platform provides all the horizontal services required to configure and run data movement operations, for example, the UI, configuration API, job scheduling, logging, alerting, etc., and is structured as a set of microservices.

Connectors are independent modules that push/pull data to/from sources and destinations. Connectors are built under the Airbyte specification, which describes the interface with which data can be moved between a source and a destination using Airbyte. Connectors are packaged as Docker images, which allows total flexibility over the technologies used to implement them.

Obstacles for data engineers & developers 

Collection and maintenance of data from different sources is itself a hectic task for data engineers and developers. Building a custom ELT pipeline for all of the data sources is a nightmare on top that not only consumes a lot of time for the engineers but also costs a lot.

In this scenario, a unified environment to deal with the quick data ingestions from various sources to various destinations would be great to tackle the mentioned challenges. 

Methodology of Airbyte

 Airbyte leverages DBT (data build tool) to manage and create SQL code that is used for transforming raw data in the destination. This step is sometimes referred to as normalization. An abstracted view of the data processing flow is given in the following figure:

It is worth noting that the above illustration displays a core tenet of ELT philosophy, which is that data should be untouched as it moves through the extracting and loading stages so that the raw data is always available at the destination. Since an unmodified version of the data exists in the destination, it can be re-transformed in the future without the need for a resync of data from source systems.

Major features

Airbyte supports hundreds of data sources and destinations including: 

Apache Kafka 
Azure Event Hub

Paste Data 
Other custom sources

By specifying credentials and adding extensions you can also ingest from and dump to: 

Azure Data Lake

Google Cloud Storage 
Amazon S3 & Kinesis

Other major features that Airbyte offers:

High extensibility: Use existing connectors to your needs or build a new one with ease.
Customization: Entirely customizable, starting with raw data or from some suggestion of normalized data.
Full-grade scheduler: Automate your replications with the frequency you need.
Real-time monitoring: Logs all the errors in full detail to help you understand better.
Incremental updates: Automated replications are based on incremental updates to reduce your data transfer costs.

Manual full refresh: Re-syncs all your data to start again whenever you want.
Debugging: Debug and Modify pipelines as you see fit, without waiting.

What does Data Science Dojo provide?  

Airbyte instance packaged by Data Science Dojo serves as a pre-configured ELT pipeline that makes data integration pipelines a commodity without the burden of installation. It offers efficient data migration and supports a variety of data sources and destinations to ingest and dump data. 

Features included in this offer:  

Airbyte service that is easily accessible from the web and has a rich user interface.
Easy to operate and user-friendly.
Strong community support due to the open-source platform.

Free to use.

Conclusion 

There are a ton of small services that aren’t supported on traditional data pipeline platforms. If you can’t import all your data, you may only have a partial picture of your business. Airbyte solves this problem through custom connectors that you can build for any platform and make them run quickly.

Install the Airbyte offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science! 

Click on the button below to head over to the Azure Marketplace and deploy Airbyte for FREE by clicking below:

January 27, 2023

Data Science

Ali Mohsin

Apache Airflow: Monitor and manage the data pipelines and complex workflows

Data Science Dojo is offering Apache Airflow for FREE on Azure Marketplace packaged with a pre-configured web environment of Airflow with various data analytics features.

Introduction:

In this era of tighter data restrictions, it is more important than ever to understand, analyze, and manage your data throughout its lifecycle. It is harder than ever as data volumes rise, and data pipelines get more complicated. A solution is needed Organizations or Individuals must have a complete, scalable, easy-to-analyze platform to manage and monitor the complex workflows and support several integrations.

What is Apache Airflow?

Apache Airflow, a powerful open-source tool for authoring, scheduling, and monitoring data and computational workflows. It provides a method that makes it easier to manage, schedule, and coordinate complicated data pipelines from several sources.

What is DAG?

A DAG, or Directed Acyclic Graph, in Airflow is a list of all the jobs you wish to execute, arranged to reflect their connections and dependencies. A Python script that expresses the DAG’s structure as code defines a DAG. Researchers’ priori ideas about the connections between and among variables in causal structures are encoded using DAGs. It contains directed edges (arrows), linking nodes (variables), and their paths. Hence A workflow is represented as a DAG, which consists of discrete units of work called Tasks that are ordered considering relationships and data flows.

Apache Airflow Architecture:

This powerful and scalable workflow scheduling software is made up of four key parts:

Scheduler: The scheduler keeps track of all DAGs and the jobs they are connected to. To start, it frequently checks the list of open tasks.
Web server: The user interface for Airflow is the web server (The default port Apache Airflow listens to is 8080). It displays the status of the jobs, gives the user access to the databases, and lets them read log files from other remote file stores like Microsoft Azure blobs.
Database: To make sure the schedule retains metadata information, the state of the DAGs and the tasks they are connected to, are saved in the database. The scheduler scans each DAG and records essential data, including schedule intervals, run-by-run statistics, and task instances.
Executors: There are various kinds of executors for different use cases. Few examples of Executors are SequentialExecutor, LocalExecutor, CeleryExecutor, and KubernetesExecutor

(With SequentialExecutor, just one task may be carried out at once. No parallel processing is possible. It is useful when testing or debugging. LocalExecutor supports hyperthreading and parallelism. It is excellent for using Airflow on a single node or a local workstation. CeleryExecutor is usually used for managing a distributed Airflow cluster. While using the Kubernetes API, the KubernetesExecutor creates temporary pods for each of the task instances to run in.)

Key features Apache Airflow provides:

Dynamic Pipelines can be constructed by Airflow dynamic, also as it is constructed in the form of code which gives an edge to dynamic behavior.
Apache Airflow has a rich User Interface that helps the user to manage their workflow easily
It gives a separate code view pallet that enables users to view their DAGs code as well.
Allows users to visualize their DAGs in different forms like Gantt chart, Tree, and Graph.
With ready to use operators in airflow, users can work with various cloud platforms like Microsoft Azure, AWS (Amazon Web Services) etc.
Allows role-based user management to maintain Security and Accessibility.

Apache Airflow with Azure services:

Apache Airflow leverages the power of Azure services to make the procedure of monitoring and managing complex workflows intuitively. Also with Azure, Airflow made it a more scalable data warehousing platform. Airflow enables users to work in a scalable environment.

Conclusion:

Other open-source Data Engineering solutions put intense competition on Apache Airflow. But it is one of the most robust platforms used by Data Engineers for orchestrating workflows or pipelines. Users can easily visualize your data pipelines’ dependencies, progress, logs, code, trigger tasks, and success status all in a single package.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We therefore know the importance of data and the encapsulated insights. Through this offer, we are confident that you can analyze, visualize, and query your data in a collaborative environment with greater ease.

Install the Apache Airflow offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Click on the button below to head over to the Azure Marketplace and deploy Apache Airflow for FREE by clicking on “Try now.”

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account.

December 2, 2022

Data Engineering

Saad Shaikh

SnowSQL – A CLI provision by Snowflake cloud warehouse

Data Science Dojo is offering SnowSQL for FREE on Azure Marketplace packaged with pre-configured CLI for data manipulation at Snowflake warehouse

What is Snowflake?

Snowflake is a cloud computing-based data cloud platform. It is a user-friendly data warehousing product that supports both ETL and ELT functionalities. It supports multiple data workloads from data warehouses and data lakes to enable data storage, data engineering, processing, and analytics. Further, using Snowflake can help you with better data warehouse jobs in the near future.

It is relatively new, flexible, easier and provides a pure cloud SQL based warehouse. It is not created upon any database tech and has a high affordability rate.

Challenges for developers

Execution issues while attempting to load and query information, shortcomings in dealing with a variety of data, absence of central source causing conflicting corrupt data, and unfortunate data sharing were a few big obstacles encountered by data engineers and developers to look forward to.

Therefore, a data cloud warehouse capable of storing variable-length records at vast scale maybe having some fault tolerance or availability or fast processing can reduce the task overhead by a big margin. Not to forget, the data able to be transformed using any standard open-source language would suffice.

Snowflake: SnowSQL

Using data engineers’ one of the favorite languages, SQL, developers can now load, transform and unload data at their Snowflake cloud. SnowSQL is an interactive command line scripting-cum-query tool that allows users to perform DML and DDL operations at the CLI level. It is produced with increased protection standards and has strong integration with Snowflake core architecture.

With this tool, you can connect to your Snowflake account and configure your databases, schemas, and warehouses using simple SQL. Users can import existing databases of the data cloud into local machines and then perform various transform operations. It also allows you to unload or dump your transformed data back into the database containers.

Snowflake architecture - SnowSQL — Snowflake architecture – Data Science Dojo

What we provide

Snowflake: SnowSQL packaged by Data Science Dojo serves as a pre-installed CLI environment for the Snowflake data cloud so it provides an ease to the developers to perform SQL operations on the data from the warehouse without the burden of installation. Offer listed by Data Science Dojo, provides the following elements:

A Command Line Interface (CLI) installed with SnowSQL which can be connected to your Snowflake account
Support for standard SQL

Robust integrations
Ability load and unload data having CSV, XML, JSON, Avro etc formats
Includes syntax highlighting and auto-complete

Significant characteristics of SnowSQL

Centralization and Democratization: Snowflake combines warehouses and data lakes into a focal data storage and democratizes it to enable clients for better processing and analytics
Smart Data Handling: Snowflake is able to manage exponential volumes, variety, and velocity of data. Using SnowSQL we can perform ETL and ELT operations using ANSI SQL
Secure and Fault Tolerant: SnowSQL secures connections to Snowflake using TLS with OCSP checks. Your data is available even if one or more nodes go down as source is fault-tolerant
Recovery and Time Travel: The undrop table command is unique as it restores the table back again. The time travel feature allows the recovery of the original version of an object to reverse the updates
High Elastic Performance: With SnowSQL you create multiple virtual warehouses of variable sizes which ensures your ETL process go smoothly

Conclusion

SnowSQL leverages the power of Azure services and Snowflake Cloud to load and process large volumes of data with continuous availability, high scalability, and data distribution from the comfort of CLI. In this way, Azure increases the fault tolerance of Snowflake clusters.

The power of Azure ensures maximum performance and high throughput for Snowflake nodes by providing a low latency network. Since SnowSQL mirrors Snowflake which stores and computes data, the elastic nature of the cloud will allow it to load data faster or run a high volume of queries.

Recovery of data has never been this easy. It also provides optimized query parsing and strong integration. Install the SnowSQL offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Click on the button below to head over to the Azure Marketplace and deploy SnowSQL for FREE by clicking on “Get it now”.

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account.

October 7, 2022

Data Engineering

Guest Blog

Build a predictive model of your house with Azure machine learning

Learn how companies like Zillow predict the value of your home. Build a predictive model using Azure machine learning that estimates the real estate sales price of a house.

Ames housing dataset includes 81 features and 1460 observations. Each observation represents the sale of a home and each feature is an attribute describing the house or the circumstance of the sale.

Clone this experiment to build a predictive model

A full copy of this experiment has been posted to the Cortana Intelligence Gallery. Go to the link and click on “open in Studio.”

Preprocessing & data exploration

Drop low-value columns

Begin by identifying features (columns) that add little to no value for predictive modeling. These columns will be dropped using the “select columns from dataset” module.

The following columns were chosen to be “excluded” from the dataset:

Id, Street, Alley, PoolQC, Utilities, Condition2, RoofMatl, MiscVal, PoolArea, 3SsnPorch, LowQualFinSF, MiscFeature, LandSlope, Functional, BsmtHalfBath, ScreenPorch, BsmtFinSF2, EnclosedPorch.

These low-quality features were removed to improve the model’s performance. Low quality includes lack of representative categories, too many missing values, or noisy features.

Define categorical variables

We must now define which values are non-continuous by casting them as categorical. Mathematical approaches for continuous and non-continuous values differ greatly. Nominal categorical features were identified and cast to categorical data types using the metadata editor to ensure proper mathematical treatment by the machine learning algorithm.

The first edit metadata module will cast all strings. The column “MSSubClass” uses numeric integer codes to represent the type of building the house is, and therefore should not be treated as a continuous numeric value but rather a categorical feature. We will use another metadata editor to cast it into a category.

Clean missing data

Most algorithms are unable to account for missing values and some treat it inconsistently from others. To address this, we must make sure our dataset contains no missing, “null,” or “NA” values.

Replacement of missing values is the most versatile and preferred method because it allows us to keep our data. It also minimizes collateral damage to other columns due to one cell’s bad behavior. Numerical values can easily be replaced with statistical values such as mean, median, or mode.

While categories can be commonly dealt with by replacing with the mode or a separate categorical value for unknowns.

For simplicity, all categorical missing values were cleaned with the mode and all numeric features were cleaned using the median. To further improve a model’s performance, custom cleaning functions should be tried and implemented on each individual feature rather than a blanket transformation of all columns.

Machine learning – Model building

Statistical feature selection

Not every feature in its current form is expected to contain predictive value to the model and may mislead or add noise to the model. To filter these out we will perform a Pearson correlation to test all features against the response class (sales price) as a quick measure of their predictive strength, only picking the top X strongest features from this method, the remaining features will be left behind.

This number can be tuned for further model performance increases.

Select an algorithm

First, we must identify what kind of machine learning problem this is: classification, regression, clustering, etc. Since the response class (sales price) is a continuous numeric value, we can tell that it is a regression problem. We will use a linear regression model with regularization to reduce the over-fitting of the model.

To ensure a stable convergence of weight and biases, all features except the response class must be normalized to be placed into the same range.

Model training and evaluation

The method of cross-validation will be used to evaluate the predictive performance of the model as well as that performance’s stability in regard to new data. Cross-validation will build ten different models on the same algorithm but with different and non-repeating subsets of the same dataset. The evaluation metrics on each of the ten models will be averaged and a standard deviation will infer the stability of the average performance.

Parameter tuning

This experiment will build a regression model that minimizes the mean RMSE of the cross-validation results with the lowest variance possible(but also considers bias-variance trade-offs).

The first regression model was built using default parameters and produced a very inaccurate model ($124,942 mean RMSE) and was very unstable (11,699 standard deviation).
The high bias and high variance of the previous model suggest the model is over-fitting to the outliers and is under-fitting the general population.
The L2 regularization weight will be decreased to lower the penalty of higher coefficients. After lowering the L2 regularization weight, the model is more accurate with an average cross-validation RMSE of $42,366.
The previous model is still quite unstable with a standard deviation of $8,121. Since this is a dataset with a small number of observations (1460), it may be better to increase the number of training epochs so that the algorithm has more passes to reach convergence.
This will increase training times but also increase stability. The third linear model had the number of training epochs increased and saw a better mean cross-validation RMSE of $36,684 and a much more stable standard deviation of $3,849.
The final model had a slight increase in the learning rate which improved both mean cross-validation RMSE and the standard deviation.

Deployment

The algorithm parameters that yielded the best results will be the ones that are shipped. The best algorithm (the last one) will be retrained using 100% of the data since cross-validation leaves 10% out each time for validation.

Further improvement of this Azure machine learning model

Feature engineering was entirely left out of this experiment. Try engineering more features from the existing dataset to see if the model will improve. Some columns that were originally dropped may become useful when combined with other features. For example, try bucketing the years in which the house was built by decade. Clustering the data may also yield some hidden insights.

Written by Phuc Duong

June 15, 2022

Machine Learning

Muhammad Sameer Hussain

Grafana – Taking over legacy systems to new heights

Data Science Dojo has launched Grafana’s offering to the azure marketplace to help you harvest insights from your data. It leverages the power of Microsoft Azure services to visualize, query, and set alerts for your data while promoting teamwork and transparency.

Excel stopped working — Excel is Not Responding

Does the above visual seem familiar? How many times are you trying to meet your deadlines only to be met bng? After all, spreadsheets can deal with complex calculations only up to a certain threshold.

Drawbacks of spreadsheets

Spreadsheets offer you a lot of cool features involving data entry, calculations, and manipulation. But dealing with all the cells and formulas can get overwhelming, making it more prone to errors affecting the integrity of the model.

There is a security and privacy issue when users store data in their individual spreadsheets and drive; elevated levels of collaboration also become a hassle when having data stored in different platforms. It is impossible to keep track of where the entries were altered or updated resulting in multiple versions of the same file undermining the overall confidence in the model.

Finally, it is not possible to present a stack of spreadsheets to your audience because they require a story to be presented to them which cannot be conveyed via rows and columns of large data. All these problems can be overcome by using it to generate insightful dashboards that summarize all your data into easy-to-read visuals and alerts that make generating actionable items much easier!

What is Grafana?

Grafana Logo

Grafana is built on the principle that data should be accessible to everyone, it allows visualizations to be shared, promoting teamwork and transparency. It enables its customers to take any of their existing data and visualize it however they want. It offers services for advanced querying and transformation and enables customers to create customized dashboards and panels, catering to their specific needs. We here at Data Science Dojo deliver data science education, consulting, and technical services to increase the power of data.

Thus, we are adding Grafana’s instance to the azure marketplace to help you harvest insights from your data. It leverages the power of Microsoft Azure services to capture visits, events, and monitor user actions. Install our Grafana’s offering now and get started on your journey towards optimal analysis.

Why Grafana?

Unify your data from various platforms

Grafana offers you the option to integrate your data from various platforms, including both Azure and non-Azure services. That’s right! It doesn’t matter if your data is in google sheets or Azure Cosmos DB. You can connect to any of these sources at once!

Search & query through your data

Imagine having to go through a thousand spreadsheets just to find one single entry that satisfied your condition. Is sound impossible? Not with Grafana. In its collaborative environment, you can write down your custom data analytics queries to filter out the data that fits your requirements.

Customized visualization & dashboards

Grafana offers you the option to generate highly customized visualizations that help you gain tactical insight from data that is often ignored. Leverage the power of Azure to collaborate and share various Grafana Dashboards with different stakeholders within and outside your organization.

Alerts

It can be difficult to constantly monitor your crucial KPIs and metrics and sometimes you may not realize your KPI has dipped below the margin before it is too late. Grafana lets you set up custom alerts to monitor these metrics and drop notifications on platforms such as slack and teams when it is the right time to act.

June 10, 2022

Machine Learning

LLM - Online Courses

Reviews

Consulting

Community

azure

Data Science Dojo Staff

What is Meltano?

Challenges for individuals

Why Meltano?

Integrations

Key features

What Data Science Dojo has for you?

Conclusion

Data Science Dojo Staff

Introduction

What is Memphis?

Challenges for Individuals

Memphis As a Solution

Why Memphis?

What Struggles does Memphis Solve?

Integrations:

Key features:

What Data Science Dojo has For You:

Wrapping Up

Syed Umair Hasan

Comprehensive overview of creating a web app for Gradio

Gradio application

Docker

Azure Container Registry

Azure Web App

A step-by-step guide to deploying a Gradio application on Azure using Docker

Step 1: Create an Azure Container Registry resource

Step 2: Create a Web App resource in Azure

Step 3: Create a folder containing the “App.py” file and its corresponding “requirements.txt” file

Step 4: Launch Visual Studio Code and open the folder

Step 5: Launch Docker Desktop to start Docker.

Step 6: Create a Dockerfile

Step 7: Build and run a local Docker image

1. docker build -t demo-gradio-app

2. docker run -it -d –name my-app -p 7000:7000 demo-gradio-app

Step 8: Tag & Push the Image to Azure Container Registry

Step 9: Configure the Web App

Step 1O: Pushing Image to Docker Hub (Optional)

Wrapping it up

Ateeq ur Rehman

What is an ELT pipeline?

Airbyte’s architecture:

Obstacles for data engineers & developers

Methodology of Airbyte

Major features

What does Data Science Dojo provide?

Conclusion

Ali Mohsin

Introduction:

What is Apache Airflow?

What is DAG?

Apache Airflow Architecture:

Key features Apache Airflow provides:

Apache Airflow with Azure services:

Conclusion:

Saad Shaikh

What is Snowflake?

Challenges for developers

Snowflake: SnowSQL

What we provide

Significant characteristics of SnowSQL

Conclusion

Guest Blog

Clone this experiment to build a predictive model

Preprocessing & data exploration

Drop low-value columns

Define categorical variables

Clean missing data

Machine learning – Model building

Statistical feature selection

Select an algorithm

Model training and evaluation

Parameter tuning

Deployment

Further improvement of this Azure machine learning model

What is an ELT pipeline? 

Obstacles for data engineers & developers 

What does Data Science Dojo provide?  

Conclusion