Programming Language

Mastering mutable and immutable objects in Python
Shehryar Mallick
| March 13, 2023

This blog explores the difference between mutable and immutable objects in Python. 

Python is a powerful programming language with a wide range of applications in various industries. Understanding how to use mutable and immutable objects is essential for efficient and effective Python programming. In this guide, we will take a deep dive into mastering mutable and immutable objects in Python.

Mutable objects

In Python, an object is considered mutable if its value can be changed after it has been created. This means that any operation that modifies a mutable object will modify the original object itself. To put it simply, mutable objects are those that can be modified either in terms of state or contents after they have been created. The mutable objects that are present in python are lists, dictionaries and sets. 







Advantages of mutable objects 

  • They can be modified in place, which can be more efficient than recreating an immutable object. 
  • They can be used for more complex and dynamic data structures, like lists and dictionaries. 

Disadvantages of mutable objects 

  • They can be modified by another thread, which can lead to race conditions and other concurrency issues. 
  • They can’t be used as keys in a dictionary or elements in a set. 
  • They can be more difficult to reason about and debug because their state can change unexpectedly.

Want to start your EDA journey? Well you can always get yourself registered at Python for Data Science.

While mutable objects are a powerful feature of Python, they can also be tricky to work with, especially when dealing with multiple references to the same object. By following best practices and being mindful of the potential pitfalls of using mutable objects, you can write more efficient and reliable Python code.

Immutable objects 

In Python, an object is considered immutable if its value cannot be changed after it has been created. This means that any operation that modifies an immutable object returns a new object with the modified value. In contrast to mutable objects, immutable objects are those whose state cannot be modified once they are created. Examples of immutable objects in Python include strings, tuples, and numbers.

Immutable Objects Code 1
Immutable Objects Code 1


Immutable Objects Code 2
Immutable Objects Code 2


Immutable Objects Code 3
Immutable Objects Code 3


Advantages of immutable objects 

  • They are safer to use in a multi-threaded environment as they cannot be modified by another thread once created, thus reducing the risk of race conditions. 
  • They can be used as keys in a dictionary because they are hashable and their hash value will not change. 
  • They can be used as elements of a set because they are comparable, and their value will not change. 
  • They are simpler to reason about and debug because their state cannot change unexpectedly. 

Disadvantages of immutable objects

  • They need to be recreated if their value needs to be changed, which can be less efficient than modifying the state of a mutable object. 
  • They take up more memory if they are used in large numbers, as new objects need to be created instead of modifying the state of existing objects. 

How to work with mutable and immutable objects?

To work with mutable and immutable objects in Python, it is important to understand their differences. Immutable objects cannot be modified after they are created, while mutable objects can. Use immutable objects for values that should not be modified, and mutable objects for when you need to modify the object’s state or contents. When working with mutable objects, be aware of side effects that can occur when passing them as function arguments. To avoid side effects, make a copy of the mutable object before modifying it or use immutable objects as function arguments.

Wrapping up

In conclusion, mastering mutable and immutable objects is crucial to becoming an efficient Python programmer. By understanding the differences between mutable and immutable objects and implementing best practices when working with them, you can write better Python code and optimize your memory usage. We hope this guide has provided you with a comprehensive understanding of mutable and immutable objects in Python.


Mastering 10 essential SQL commands – A comprehensive guide to becoming an expert
Ruhma Khawaja
| March 10, 2023

As the amount of data being generated and stored by companies and organizations continue to grow, the ability to effectively manage and manipulate this data using databases has become increasingly important for developers. Among the plethora of programming languages, we have SQL. Also known as Structured Query Language, SQL is a programming language widely used for managing data stored in relational databases.

SQL commands enable developers to perform a wide range of tasks such as creating tables, inserting, modifying data, retrieving data, searching databases, and much more. In this guide, we will highlight the top basic SQL commands that every developer should be familiar with. 

What is SQL?

For the unversed, the programming language SQL is primarily used to manage and manipulate data in relational databases. Relational databases are a type of database that organizes data into tables with rows and columns, like a spreadsheet. SQL is used to create, modify, and query these tables and the data stored in them. 


With SQL commands, developers can create tables and other database objects, insert and update data, delete data, and retrieve data from the database using SELECT statements. Developers can also use SQL to create, modify and manage indexes, which are used to improve the performance of database queries.

The language is used by many popular relational database management systems such as MySQL, PostgreSQL, and Microsoft SQL Server. While the syntax of SQL commands may vary slightly between different database management systems, the basic concepts are consistent across most implementations. 

Types of SQL Commands 

There are several types of SQL commands that are commonly used in relational databases, each with a specific purpose and function. Some of the most used SQL commands include: 

  1. Data Definition Language (DDL) commands: These commands are used to define the structure of a database, including tables, columns, and constraints. Examples of DDL commands include CREATE, ALTER, and DROP.
  2. Data Manipulation Language (DML) commands: These commands are used to manipulate data within a database. Examples of DML commands include SELECT, INSERT, UPDATE, and DELETE.
  3. Data Control Language (DCL) commands: These commands are used to control access to the database. Examples of DCL commands include GRANT and REVOKE.
  4. Transaction Control Language (TCL) commands: These commands are used to control transactions in the database. Examples of TCL commands include COMMIT and ROLLBACK.

Essential SQL commands

There are several essential SQL commands that you should know in order to work effectively with databases. Here are some of the most important SQL commands to learn:


The CREATE statement is used to create a new table, view, or another database object. The basic syntax of a CREATE TABLE statement is as follows: 

The statement starts with the keyword CREATE, followed by the type of object you want to create (in this case, TABLE), and the name of the new object you’re creating (in place of “table_name”). Then you specify the columns of the table and their data types.

For example, if you wanted to create a table called “customers” with columns for ID, first name, last name, and email address, the CREATE TABLE statement might look like this:

This statement would create a table called “customers” with columns for ID, first name, last name, and email address, with their respective data types specified. The ID column is also set as the primary key for the table.


Used on one of multiple tables, the SELECT statement Is used to retrieve data. The basic syntax of a SELECT statement is as follows: 

The SELECT statement starts with the keyword SELECT, followed by a list of the columns you want to retrieve. You then specify the table or tables from which you want to retrieve the data, using the FROM clause. You can also use the JOIN clause to combine data from two or more tables based on a related column.

You can use the WHERE clause to filter the results of a query based on one or more conditions. Programmers can also use GROUP BY to manage the results by one or multiple columns. The HAVING clause is used to filter the groups based on a condition while the ORDER BY clause can be used to sort the results by one or more columns.  


INSERT is used to add new data to a table in a database. The basic syntax of an INSERT statement is as follows: 

INSERT is used to add data to a specific table and begins with the keywords INSERT INTO, followed by the name of the table where the data will be inserted. You then specify the names of the columns in which you want to insert the data, enclosed in parentheses. You then specify the values you want to insert, enclosed in parentheses, and separated by commas. 


Another common SQL command is the UPDATE statement. It is used to modify existing data in a table in a database. The basic syntax of an UPDATE statement is as follows: 

The UPDATE statement starts with the keyword UPDATE, followed by the name of the table you want to update. You then specify the new values for one or more columns using the SET clause and use the WHERE clause to specify which rows to update. 


Next up, we have another SQL command DELETE which is used to delete data from a table in a database. The basic syntax of a DELETE statement is as follows: 

In the above-mentioned code snippet, the statement begins with the keyword DELETE FROM. Then, we add the table name from which data must be deleted. You then use the WHERE clause to specify which rows to delete. 


The ALTER command in SQL is used to modify an existing table, database, or other database objects. It can be used to add, modify, or delete columns, constraints, or indexes from a table, or to change the name or other properties of a table, database, or another object. Here is an example of using the ALTER command to add a new column to a table called “tablename1”: 

In this example, the ALTER TABLE command is used to modify the “users” table. The ADD keyword is used to indicate that a new column is being added, and the column is called “email” and has a data type of VARCHAR with a maximum length of 50 characters. 


The DROP command in SQL is used to delete a table, database, or other database objects. When a table, database, or other object is dropped, all the data and structure associated with it is permanently removed and cannot be recovered. So, it is important to be careful when using this command. Here is an example of using the DROP command to delete a table called ” tablename1″: 

In this example, the DROP TABLE command is used to delete the ” tablename1″ table from the database. Once the table is dropped, all the data and structure associated with it are permanently removed and cannot be recovered. It is also possible to use the DROP command to delete a database, an index, a view, a trigger, a constraint, and a sequence using a similar syntax as above by replacing the table with the corresponding keyword. 


The SQL TRUNCATE command is used to delete all the data from a table. Simultaneously, this command also resets the auto-incrementing counter. Since it is a DDL operation, it is much faster than DELETE and does not generate undo logs, and does not fire any triggers associated with the table. Here is an example of using the TRUNCATE command to delete all data from a table called “customers”: 

In this example, the TRUNCATE TABLE command is used to delete all data from the “customers” table. Once the command is executed, the table will be empty, and the auto-incrementing counter will be reset. It is important to note that the TRUNCATE statement is not a substitute for the DELETE statement, TRUNCATE can only be used on tables and not on views or other database objects. 


The SQL INDEX command is used to create or drop indexes on one or more columns of a table. An index is a data structure that improves the speed of data retrieval operations on a table at the cost of slower data modification operations. Here is an example of using the CREATE INDEX command to create a new index on a table called ” tablename1″ on the column “first_name”: 

In this example, the CREATE INDEX command is used to create a new index called “idx_first_name” on the column “first_name” of the ” tablename1″ table. This index will improve the performance of queries that filter, or sort data based on the “first_name” column. 


Finally, we have a JOIN command that is primarily used to combine rows from two or more tables based on a related column between them.  It allows you to query data from multiple tables as if they were a single table. It is used for retrieving data that is spread across multiple tables, or for creating more complex reports and analyses.  

INNER JOIN – By implementing INNER JOIN, the database only returns/displays the rows that have matching values in both tables. For example, 

LEFT JOIN – LEFT JOIN command returns all rows from the left table. It also returns possible matching rows from the right table. If there is no match, NULL values will be returned for the right table’s columns. For example, 

RIGHT JOIN – In the RIGHT JOIN, the database returns all rows from the right table and possible matching rows from the left table. In case there is no match, NULL values will be returned for the left table’s columns. 

FULL OUTER JOIN – This type of JOIN returns all rows from both tables and any matching rows from both tables. If there is no match, NULL values will be returned for the non-matching columns. 

CROSS JOIN – This type of JOIN returns the Cartesian product of both tables, meaning it returns all combinations of rows from both tables. This can be useful for creating a matrix of data but can be slow and resource-intensive with large tables. 

Furthermore, it is also possible to use JOINs with subqueries and add ON or USING clauses to specify the columns that one wants to join.

Bottom line 

In conclusion, SQL is a powerful tool for managing and retrieving data in a relational database. The commands covered in this blog, SELECT, INSERT, UPDATE, and DELETE, are some of the most used in SQL commands and provide the foundation for performing a wide range of operations on a database. Understanding these commands is essential for anyone working with SQL commands and relational databases.

With practice and experience, you will become more proficient in using these commands and be able to create more complex queries to meet your specific needs. 



Creating a web app for Gradio application on Azure using Docker: A step-by-step guide
Syed Umair Hasan
| February 22, 2023

In this step-by-step guide, learn how to deploy a web app for Gradio on Azure with Docker. This blog covers everything from Azure Container Registry to Azure Web Apps, with a step-by-step tutorial for beginners.

I was searching for ways to deploy a Gradio application on Azure, but there wasn’t much information to be found online. After some digging, I realized that I could use Docker to deploy custom Python web applications, which was perfect since I had neither the time nor the expertise to go through the “code” option on Azure. 

The process of deploying a web app begins by creating a Docker image, which contains all of the application’s code and its dependencies. This allows the application to be packaged and pushed to the Azure Container Registry, where it can be stored until needed. From there, it can be deployed to the Azure App Service, where it is run as a container and can be managed from the Azure Portal. In this portal, users can adjust the settings of their app, as well as grant access to roles and services when needed. 

Once everything is set and the necessary permissions have been granted, the web app should be able to properly run on Azure. Deploying a web app on Azure using Docker is an easy and efficient way to create and deploy applications, and can be a great solution for those who lack the necessary coding skills to create a web app from scratch!’

Comprehensive overview

Gradio application 

Gradio is a Python library that allows users to create interactive demos and share them with others. It provides a high-level abstraction through the Interface class, while the Blocks API is used for designing web applications.

Blocks provides features like multiple data flows and demos, control over where components appear on the page, handling complex data flows, and the ability to update properties and visibility of components based on user interaction. With Gradio, users can create a web application that allows their users to interact with their machine learning model, API, or data science workflow. 

The two primary files in a Gradio Application are:

  1. App.py: This file contains the source code for the application.
  2. Requirements.txt: This file lists the Python libraries required for the source code to function properly.


Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers. It uses a container-based approach to package software, which enables applications to be isolated from each other, making it easier to deploy, run, and manage them in a variety of environments. 

A Docker container is a lightweight, standalone, and executable software package that includes everything needed to run a specific application, including the code, runtime, system tools, libraries, and settings. Containers are isolated from each other and from the host operating system, making them ideal for deploying microservices and applications that have multiple components or dependencies. 

Docker also provides a centralized way to manage containers and share images, making it easier to collaborate on application development, testing, and deployment. With its growing ecosystem and user-friendly tools, Docker has become a popular choice for developers, system administrators, and organizations of all sizes. 

Azure Container Registry 

Azure Container Registry (ACR) is a fully-managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. It allows you to store, manage, and deploy Docker containers in a secure and scalable way, making it an important tool for modern application development and deployment. 

With ACR, you can store your own custom images and use them in your applications, as well as manage and control access to them with role-based access control. Additionally, ACR integrates with other Azure services, such as Azure Kubernetes Service (AKS) and Azure DevOps, making it easy to deploy containers to production environments and manage the entire application lifecycle. 

ACR also provides features such as image signing and scanning, which helps ensure the security and compliance of your containers. You can also store multiple versions of images, allowing you to roll back to a previous version if necessary. 

Azure Web App 

Azure Web Apps is a fully-managed platform for building, deploying, and scaling web applications and services. It is part of the Azure App Service, which is a collection of integrated services for building, deploying, and scaling modern web and mobile applications. 

With Azure Web Apps, you can host web applications written in a variety of programming languages, such as .NET, Java, PHP, Node.js, and Python. The platform automatically manages the infrastructure, including server resources, security, and availability, so that you can focus on writing code and delivering value to your customers. 

Azure Web Apps supports a variety of deployment options, including direct Git deployment, continuous integration and deployment with Visual Studio Team Services or GitHub, and deployment from Docker containers. It also provides built-in features such as custom domains, SSL certificates, and automatic scaling, making it easy to deliver high-performing, secure, and scalable web applications. 

A step-by-step guide to deploying a Gradio application on Azure using Docker

This guide assumes a foundational understanding of Azure and the presence of Docker on your desktop. Refer to the Mac or Windows or Linux getting started instructions for Docker. 

Step 1: Create an Azure Container Registry resource 

Go to Azure Marketplace and search ‘container registry’ and hit ‘Create’. 

STEP 1: Create an Azure Container Registry resource
Create an Azure Container Registry resource

Under the “Basics” tab, complete the required information and leave the other settings as the default. Then, click “Review + Create.” 

Web App for Gradio Step 1A
Web App for Gradio Step 1A


Step 2: Create a Web App resource in Azure 

In Azure Marketplace, search for “Web App”, select the appropriate resource as depicted in the image, and then click “Create”. 

STEP 2: Create a Web App resource in Azure
Create a Web App resource in Azure


Under the “Basics” tab, complete the required information, choose the appropriate pricing plan and leave the other settings as the default. Then, click “Review + Create.”  

Web App for Gradio Step 2B
Web App for Gradio Step 2B


Web App for Gradio Step 2C
Web App for Gradio Step 2c


Upon completion of all deployments, the following three resources will be in your resource group. 

Web App for Gradio Step 2D
Web App for Gradio Step 2D

Step 3: Create a folder containing “App.py” file and its corresponding “requirements.txt” file 

To begin, we will utilize an emotion detector application, the model for which can be found at https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion. 



Step 4: Launch Visual Studio Code and open the folder

Step 4: Launch Visual Studio Code and open the folder. 
Step 4: Launch Visual Studio Code and open the folder.

Step 5: Launch Docker Desktop to start Docker. 

STEP 5: Launch Docker Desktop to start Docker
STEP 5: Launch Docker Desktop to start Docker.

Step 6: Create a Dockerfile 

A Dockerfile is a script that contains instructions to build a Docker image. This file automates the process of setting up an environment, installing dependencies, copying files, and defining how to run the application. With a Dockerfile, developers can easily package their application and its dependencies into a Docker image, which can then be run as a container on any host with Docker installed. This makes it easy to distribute and run the application consistently in different environments. The following contents should be utilized in the Dockerfile: 


STEP 6: Create a Dockerfile
STEP 6: Create a Dockerfile

Step 7: Build and run a local Docker image 

Run the following commands in the VS Code terminal. 

1. docker build -t demo-gradio-app 

  • The “docker build” command builds a Docker image from a Dockerfile. 
  • The “-t demo-gradio-app” option specifies the name and optionally a tag to the name of the image in the “name:tag” format. 
  • The final “.” specifies the build context, which is the current directory where the Dockerfile is located.


2. docker run -it -d –name my-app -p 7000:7000 demo-gradio-app 

  • The “docker run” command starts a new container based on a specified image. 
  • The “-it” option opens an interactive terminal in the container and keeps the standard input attached to the terminal. 
  • The “-d” option runs the container in the background as a daemon process. 
  • The “–name my-app” option assigns a name to the container for easier management. 
  • The “-p 7000:7000” option maps a port on the host to a port inside the container, in this case, mapping the host’s port 7000 to the container’s port 7000. 
  • The “demo-gradio-app” is the name of the image to be used for the container. 

This command will start a new container with the name “my-app” from the “demo-gradio-app” image in the background, with an interactive terminal attached, and port 7000 on the host mapped to port 7000 in the container. 

Web App for Gradio Step 7A
Web App for Gradio Step 7A


Web App for Gradio Step 7B
Web App for Gradio Step 7B


To view your local app, navigate to the Containers tab in Docker Desktop, and click on link under Port. 

Web App for Gradio Step 7C
Web App for Gradio Step 7C

Step 8: Tag & Push Image to Azure Container Registry 

First enable ‘Admin user’ from ‘Access Keys’ tab in Azure Container Registry. 

STEP 8: Tag & Push Image to Azure Container Registry
Tag & Push Image to Azure Container Registry


Login to your container registry using the following command, login server, username and password can be accessed from the above step. 

docker login gradioappdemos.azurecr.io

Web App for Gradio Step 8B
Web App for Gradio Step 8B


Tag the image for uploading to your registry using the following command. 


docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app 

  • The command “docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app” is used to tag a Docker image. 
  • “docker tag” is the command used to create a new tag for a Docker image. 
  • “demo-gradio-app” is the source image name that you want to tag. 
  • “gradioappdemos.azurecr.io/demo-gradio-app” is the new image name with a repository name and optionally a tag in the “repository:tag” format. 
  • This command will create a new tag “gradioappdemos.azurecr.io/demo-gradio-app” for the “demo-gradio-app” image. This new tag can be used to reference the image in future Docker commands. 

Push the image to your registry. 

docker push gradioappdemos.azurecr.io/demo-gradio-app 

  • “docker push” is the command used to upload a Docker image to a registry. 
  • “gradioappdemos.azurecr.io/demo-gradio-app” is the name of the image with the repository name and tag to be pushed. 
  • This command will push the Docker image “gradioappdemos.azurecr.io/demo-gradio-app” to the registry specified by the repository name. The registry is typically a place where Docker images are stored and distributed to others. 
Web App for Gradio Step 8C
Web App for Gradio Step 8C


In the Repository tab, you can observe the image that has been pushed. 

Web App for Gradio Step 8D
Web App for Gradio Step 8B

Step 9: Configure the Web App 

Under the ‘Deployment Center’ tab, fill in the registry settings then hit ‘Save’. 

STEP 9: Configure the Web App
Configure the Web App


In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’. 

Web App for Gradio Step 9B
Web App for Gradio Step 9B
Web App for Gradio Step 9C
Web App for Gradio Step 9C


Web App for Gradio Step 9D
Web App for Gradio Step 9D


In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’. 

Web App for Gradio Step 9E
Web App for Gradio Step 9E


After the image extraction is complete, you can the view the web app URL from the Overview page. 


Web App for Gradio Step 9F
Web App for Gradio Step 9F


Web App for Gradio Step 9G
Web App for Gradio Step 9G

Step 1O: Pushing Image to Docker Hub (Optional) 

Here are the steps to push a local Docker image to Docker Hub: 

  • Login to your Docker Hub account using the following command: 

docker login

  • Tag the local image using the following command, replacing [username] with your Docker Hub username and [image_name] with the desired image name: 

docker tag [image_name] [username]/[image_name]

  • Push the image to Docker Hub using the following command: 

docker push [username]/[image_name] 

  • Verify that the image is now available in your Docker Hub repository by visiting https://hub.docker.com/ and checking your repositories. 
Web App for Gradio Step 10A
Web App for Gradio Step 10A


Web App for Gradio Step 10B
Web App for Gradio Step 10B

Wrapping it up

In conclusion, deploying a web application using Docker on Azure is an easy and efficient way to create and deploy applications. This method is suitable for those who lack the necessary coding skills to create a web app from scratch. Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers.

Azure Container Registry is a fully-managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. Azure Web Apps is a fully-managed platform for building, deploying, and scaling web applications and services. By following the step-by-step guide provided in this article, users can deploy a Gradio application on Azure using Docker.


Discover your potential: 5 Data Science projects to help you stand out as a Python student
Nathan Piccini
| February 3, 2023

In this blog post, we’ll explore five project ideas that can help you build expertise in computer vision, natural language processing (NLP), sales forecasting, cancer detection, and predictive maintenance using Python. 

As a data science student, it is important to continually build and improve your skills by working on projects that are both challenging and relevant to the field. 


Computer vision with Python and OpenCV 

Computer vision is a field of artificial intelligence that focuses on the development of algorithms and models that can interpret and understand visual information. One project idea in this area could be to build a facial recognition system using Python and OpenCV.

The project would involve training a model to detect and recognize faces in images and video and comparing the performance of different algorithms. To get started, you’ll want to become familiar with the OpenCV library, which is a powerful tool for image and video processing in Python. 


NLP with Python and NLTK/spaCy 

NLP is a field of AI that deals with the interaction between computers and human language. A great project idea in this area would be to develop a text classification system to automatically categorize news articles into different topics.

This project could use Python libraries such as NLTK or spaCy to preprocess the text data, and then train a machine learning model to make predictions. The NLTK library has many useful functions for text preprocessing, such as tokenization, stemming and lemmatization, and the spaCy library is a modern library for performing complex NLP tasks. 


Learn more about Python project ideas for 2023


Sales forecasting with Python and Pandas 

Sales forecasting is an important part of business operations, and as a data science student, you should have a good understanding of how to build models that can predict future sales. A project idea in this area could be to create a sales forecasting model using Python and Pandas.

The project would involve using historical sales data to train a model that can predict future sales numbers for a particular product or market. To get started, you’ll want to become familiar with the Pandas library, which is a powerful tool for data manipulation and analysis in Python. 


Sales forecast using Python
Sales forecast using Python

Cancer detection with Python and scikit-learn 

Cancer detection is a critical area of healthcare, and machine learning can play an important role in this field. A project idea in this area could be to build a machine learning model to predict the likelihood of a patient having a certain type of cancer.

The project would use a dataset of patient medical records and explore the use of different features and algorithms for making predictions. The scikit-learn library is a powerful tool for building machine learning models in Python and it provides easy to use interface to train, test and evaluate your model. 


Learn about Python for Data Science and speed up with Python fundamentals 


Predictive maintenance with Python and Scikit-learn 

Predictive maintenance is a field of industrial operations that focuses on using data and machine learning to predict when equipment is likely to fail, so that maintenance can be scheduled in advance. A project idea in this area could be to develop a system that can analyze sensor data from the equipment, and use machine learning to identify patterns that indicate an imminent failure.

To get started, you’ll want to become familiar with the scikit-learn library and the concepts of clustering, classification and regression, as well as the python libraries for working with sensor data and machine learning. 


In a nutshell:

These are just a few project ideas to help you build your skills as a data science student. Each of these projects offers the opportunity to work with real-world data, use powerful Python libraries and tools, and develop models that can make predictions and solve complex problems. As you work on these projects, you’ll gain valuable experience that will help you advance your career in. 

Top 5 Python project ideas to start a career in programming
Ruhma Khawaja
| February 2, 2023

Are you looking for some great Python Project Ideas? Here is a list of the top 5 Python project ideas for students and aspiring people to practice.

Want to start a career in programming? Here are the top 5 Python project ideas 

If you keep tabs on the latest technologies, you are aware of how powerful and versatile Python is. It is widely used in numerous fields, from data science and machine learning to web development and game development. It is a widely used programming language in computer science. Its features have made it a popular choice among developers in 2022 and its trend is expected to continue in the future.  

The demand for using Python in IT projects is on the rise, due to its user-friendly nature and versatility in creating various technology applications. A growing number of individuals in the tech industry are looking for ways to improve their skills by taking on projects, volunteering, and internships using Python. As a student, learning Python can open many opportunities for you and help you build a wide range of projects that can highlight your skills and capabilities.  

Are you looking for some great Python Project Ideas? Here is a list of the top 5 Python project ideas for engineering students and aspiring coders to practice. 

Python project ideas
Python project ideas – Data Science Dojo

1. Game Development 

Game development is a fun and challenging way to learn about programming and Python is a great language for building games. Using the Pygame library, you can easily create 2D games with features such as animation, sound, and user input. It is built on top of the SDL library, which provides low-level access to audio, keyboard, mouse, and display functions.

To create a simple game using Pygame, you will need to understand the basics of game development such as game loop, event handling, and game mechanics. You can use Pygame’s built-in functions to create a game window and display 2D graphics. This project will help you learn how to use Python for game development and gain experience with 2D graphics, animation, sound, and game mechanics. It will also give you a chance to explore the possibilities of Pygame library and create your own game. 


2. Weather App 

Creating a weather app is a great project idea for those interested in building applications that interact with external APIs. API, short for Application Programming Interface, are a set of rules and protocols that allow software systems to communicate. In this case, we will be using a weather API that provides current weather information for a given location. To build this weather app, you will first need to find a weather API that you can use.

To build a weather app with the request’s library in Python, first you choose a weather API and sign up for an API key. Next, you install the requests library in Python and fetch weather data with requests.get() and parse with json.loads(). Then, use pandas and matplotlib to analyze and visualize data and then create a user interface with a library like tkinter or PyQt. Lastly, try-except blocks for error handling and deploy your project on a web server or cloud platform if desired. 


Enroll in ‘Python for Data Science’ To learn Python and its effective use in data analysis, analytics, machine learning, and data science. 


3. Data Analysis 

Data analysis is an essential skill for many fields, and Python is an excellent language for working with data. The pandas and matplotlib libraries are commonly used in data analysis and visualization. Pandas is a powerful library for working with data in Python. Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python. It is used to create a wide variety of plots, including line plots, scatter plots, histograms, and heat maps. It also allows you to customize the appearance of the plots to match your needs. 

To start this project, select a dataset so that you can use pandas to read the data into a Data Frame and perform various operations on it. Then, you must clean and filter the data. Next, you can use matplotlib to create various visualizations of the data. This project will help you learn how to work with data in Python, gain experience with data analysis and visualization, and learn to use the pandas and matplotlib libraries.  


4. Chatbot 

Another hot topic is creating a chatbot. A chatbot is a computer program that simulates human conversation, and it can be used in a wide range of applications, such as customer service, e-commerce, and personal assistants. To build a chatbot using Python, you will need to use a combination of NLP and ML techniques.

For NLP, you can use Python libraries such as NLTK and Spacy, which provide tools for tokenizing, stemming, and lemmatizing text, as well as for performing part-of-speech tagging and named entity recognition. This project can have good learning outcomes like learning usage of natural language processing and machine learning techniques in Python. 


Learn about Top Python Packages


5. Web Scraper 

Web scraping is the process of extracting data from websites and a web scraper is a tool that automates this process. Creating a web scraper using Python’s Beautiful Soup library is a great project idea for those interested in web development and data mining. To build a web scraper, you will first need to install the Beautiful Soup library and the requests library. Another way is Selenium, a tool used for automating web browsers to do several tasks. 

The requests library is used to send an HTTP request to a website and retrieve the HTML source code, while Beautiful Soup is used to parse the HTML and extract the data. Beautiful Soup’s methods and selectors are used to extract the data required. 


Bottom Line 

In conclusion, there are countless possibilities for Python projects, these are just a small selection of ideas to spark inspiration. The key to success is to find a project that aligns with your interests and start experimenting with the vast array of libraries and frameworks that Python has to offer. With a bit of creativity and persistence, you can create something truly remarkable and elevate your skills to new heights. 


Optimizing healthcare operations with Google OR-tools: A detailed case study in nurse scheduling
Umair Hasan
| January 25, 2023

Google OR-Tools is a software suite for optimization and constraint programming. It includes several optimization algorithms such as linear programming, mixed-integer programming, and constraint programming. These algorithms can be used to solve a wide range of problems, including scheduling problems, such as nurse scheduling.


Introducing the trio of software development, project management, and data science
Seif Sekalala
| January 24, 2023

In this blog post, the author introduces the new blog series about the titular three main disciplines or knowledge domains of software development, project management, and data science. Amidst the mercurial evolving global digital economy, how can job-seekers harness the lucrative value of those fields–esp. data science, vis-a-vis improving their employability?



To help us launch this blog series, I will gladly divulge two embarrassing truths. These are: 

  1. Despite my marked love of LinkedIn, and despite my decent / above-average levels of general knowledge, I cannot keep up with the ever-changing statistics or news reports vis-a-vis whether–at any given time, the global economy is favorable to job-seekers, or to employers, or is at equilibrium for all parties–i.e., governments, employers, and workers.
  2. Despite having rightfully earned those fancy three letters after my name, as well as a post-graduate certificate from the U. New Mexico & DS-Dojo, I (used to think I) hate math, or I (used to think I) cannot learn math; not even if my life depended on it!



Following my undergraduate years of college algebra and basic discrete math–and despite my hatred of mathematics since 2nd grade (chief culprit: multiplication tables!), I had fallen in love (head-over-heels indeed!) with the interdisciplinary field of research methods. And sure, I had lucked out in my Masters (of Arts in Communication Studies) program, as I only had to take the qualitative methods course.


Data Science Blog Series
A Venn-diagram depicting the disciplines/knowledge-domains of the new blog series.


But our instructor couldn’t really teach us about interpretive methods, ethnography, and qualitative interviewing etc., without at least “touching” on quantitative interviewing/surveys, quantitative data-analysis–e.g. via word counts, content-analysis, etc.

Fast-forward; year: 2012. Place: Drexel University–in Philadelphia, for my Ph.D. program (in Communication, Culture, and Media). This time, I had to face the dreaded mathematics/statistics monster. And I did, but grudgingly.

Let’s just get this over with, I naively thought; after all, besides passing this pesky required pre-qualifying exam course, who needs stats?!


About software development:

Fast-forward again; year: 2020. Place(s): Union, NJ and Wenzhou, Zhejiang Province; Hays, KS; and Philadelphia all over again. Five years after earning the Ph.D., I had to reckon with an unfair job loss, and chaotic seesaw-moves between China and the USA, and Philadelphia and Kansas, etc. 

Thus, one thing led to another, and soon enough, I was practicing algorithms and data-structures, learning about the basic “trouble-trio” of web-development–i.e., HTML, CSS, and JavaScript, etc.! 


Read more about Programming Languages


But like many other folks who try this route, I soon came face-to-face with that oh-so-debilitative monster: self-doubt! No way, I thought. I’m NOT cut out to be a software-engineer! I thus dropped out of the bootcamp I had enrolled in and continued my search for a suitable “plan-B” career.


About project management:

Eventually (around mid/late-2021), I discovered the interdisciplinary field of project management. Simply defined (e.g. by Te Wu, 2020; link), project management is

“A time-limited, purpose-driven, and often unique endeavor to create an outcome, service, product, or deliverable.”

One can also break down the constituent conceptual parts of the field (e.g. as defined by Belinda Goodrich, 2021; link) as: 

  • Project life cycle, 
  • Integration, 
  • Scope, 
  • Schedule, 
  • Cost, 
  • Quality, 
  • Resources, 
  • Communications, 
  • Risk, 
  • Procurement, 
  • Stakeholders, and 
  • Professional responsibility / ethics. 


Ah…yes! I had found my sweet spot, indeed. or, so I thought. 


Hard truths:

Eventually, I experienced a series of events that can be termed “slow-motion epiphanies” and hard truths. Among many, below are three prime examples.


Hard Truth 1: The quantifiability of life:

For instance, among other “random” models: one can generally presume–with about 95% certainty (ahem!)–that most of the phenomena we experience in life can be categorized under three broad classes:


  1. Phenomena we can easily describe and order, using names (nominal variables);
  2. Phenomena we can easily group or measure in discrete and evenly-spaced amounts (ordinal variables);
  3. And phenomena that we can measure more accurately, and which: i)–is characterized by trait number two above, and ii)–has a true 0 (e.g., Wrench et Al; link).


Hard Truth 2: The probabilistic essence of life:

Regardless of our spiritual beliefs, or whether or not we hate math/science, etc., we can safely presume that the universe we live in is more or less a result of probabilistic processes (e.g., Feynman, 2013). 


Hard truth 3: What was that? “Show you the money (!),” you demanded? Sure! But first, show me your quantitative literacy, and critical-thinking skills!

And finally, related to both the above realizations: while it is true indeed that there are no guarantees in life, we can nonetheless safely presume that professionals can improve their marketability by demonstrating their critical-thinking-, as well as quantitative literacy skills.


Bottomline; The value of data science:

Overall, the above three hard truths are prototypical examples of the underlying rationale(s) for this blog series. Each week, DS-Dojo will present our readers with some “food for thought” vis-a-vis how to harness the priceless value of data science and various other software-development and project-management skills / (sub-)topics. 


No, dear reader; please do not be fooled by that “OmG, AI is replacing us (!)” fallacy. Regardless of how “awesome” all these new fancy AI tools are, the human touch is indispensable!

How programming languages assist data analysts in reducing analysis bugs 
Erik Brooks
| December 23, 2022

In this blog, we are going to discuss the value addition provided by programming languages for data analysts.

Data analysts have one simple goal – to provide organizations with insights that inform better business decisions. And, to do this, the analytical process has to be successful. Unfortunately, as many data analysts would agree, encountering different types of analysis bugs when analyzing data is part of the data analytical process.

However, these bugs don’t have to be many if only preventive measures are taken every step of the way. This is where programming languages prove valuable for data analysts. Programming languages are one such valuable tool that helps data analysts to prevent and solve a number of data problems. These languages contain different bug-preventing attributes that make this possible. Here are some of these characteristics. 


Programming languages
Programming languages – Data Analysts

Type safety/strong typing 

When there is an inconsistency between varying data types for the variables, methods, and constants, the program behaves undesirably. In other words, type errors occur. For instance, this error can occur when a programmer treats a string as an integer or vice versa. 

Type safety is an attribute of programming languages that discourages type errors in a program. Type safety or type soundness demand programmers to define the type of each variable. This means that programmers must declare the data type that is meant to be in the box as well as give the box a variable name. This ensures that the programmer only interprets values as per the rules of the declared data type, which prevents confusion about the data type. 


If an object is immutable, then its value or state can’t be changed. Immutability in programming languages allows developers to use variables that can’t be muted or changed. This means that users can only create programs using constants. How does this prevent problems? Immutable objects ensure thread safety as compared to mutable objects. In a multithreaded application, a thread doesn’t have to worry about the other threads as it acts on an immutable object.

The reason here is that the thread knows that the object can’t be modified by anyone. The immutable approach in data analysis ensures that the original data set is not modified. In case a bug is identified in the code, the original data helps find a solution faster. In addition, immutability is valuable in creating safer data backups. In immutable data storage, data is safe from data corruption, deletion, and tampering.



Expressiveness in a programming language can be defined as the extent of ideas that can be communicated and represented in that language. If a language allows users to communicate their intent easily and detect errors early, that language can be termed as expressive. Programming languages that are expressive allow programmers to write shorter codes.

Moreover, a shorter code has less incidental complexity/ boilerplate, which makes it easier to identify errors. Talking of expressiveness, it is important to know that programming languages are English based.

When working with multilingual websites, it would be important to translate the languages to English for successful data analysis. However, there is the risk of distortion or meaning loss when applying analysis techniques to translated data. Working with professional translation companies eliminates these risks.

In addition, working in a language that they can understand makes it easy to spot errors. 

Static and dynamic typing 

These attributes of programming languages are used for error detection. They allow programmers to catch bugs and solve them before they cause havoc. The type-checking process in static typing happens at compile time.

If there is an error in the code such as invalid type arguments, missing functions or a discrepancy between the type of variable and data value assigned to it, static typing catches these bugs before the program runs the code. This means zero chances of running an erroneous code. 

On the other hand, in dynamic typing, type-checking occurs during runtime. However, it gives the programmer a chance to correct the code if it detects any bugs before the worst happens. 


Programming learning – Data analysts

Among the tools that data analysts require in their line of work are programming languages. Ideally, programming languages are every programmer’s defense against different types of bugs. This is because they come with characteristics that reduce the chances of writing codes that are prone to errors. These attributes include those listed above and are available in different programming languages such as Java, Python, and Scala, which are best suited for data analysts.   



Quickly learn drone programming in 10 minutes
Ebad Ullah Khan
| October 19, 2022

In this blog, we will be learning how to program some basic movements in a drone with the help of Python. The drone we will use is Dji Tello. We will learn drone programming with Scratch, Swift, and even Python.  

 A step-by-step guide to learning drone programming

We will go step by step through how to issue commands through the Wi-Fi network 

drone programming
Drone – Data Science Dojo


Installing Python libraries 

First, we will need some Python libraries installed onto our laptop. Let’s install them with the following two commands: 


pip install djitellopy 

pip install opencv-python 


The djitellopy is a python library making use of the official Tello sdk. The second command is to install opencv which will help us to look through the camera of the drone. Some other libraries this program will make use of are ‘keyboard’ and ‘time’. After installation, we import them into our project   


import keyboard as kp 

from djitellopy import tello 

import time 

import cv2 


 Read more about Machine Learning using Python in cloud


We must first instantiate the Tello class so we can use it afterward. For the following commands to work, we must switch the drone to On and find and connect to the Wi-Fi network generated by it on our laptop. The tel.connect() command lets us connect the drone to our program. After the connection of the drone to our laptop is successful, the following commands can be executed. 


tel = tello.Tello() 



Sending ending commands to the drone 

We will build a function which will send movement commands to the drone.  

def getKeyboardInput(img): 


    lr, fb, ud, yv = 0, 0, 0, 0 

    speed = 50 

    if kp.getKey("LEFT"): 

        lr = -speed 

    elif kp.getKey("RIGHT"): 

        lr = speed 


    if kp.getKey("UP"): 

        fb = speed 

    elif kp.getKey("DOWN"): 

        fb = -speed 


    if kp.getKey("w"): 

        ud = speed 

    elif kp.getKey("s"): 

        ud = -speed 


    if kp.getKey("a"): 

        yv = speed 

    elif kp.getKey("d"): 

        yv = -speed 


    if kp.getKey("l"): 


    if kp.getKey("t"): 



    if kp.getKey("z"): 

        cv2.imwrite("Resources/images/{time.time}.jpg", img) 


    return [lr, fb, ud, yv] 




The drone takes 4 inputs to move so we first take four values and assign a 0 to them. The speed must be set to an initial value for the drone to take off. Now we map the keyboard keys to our desired values and assign those values to the four variables. For example, if the keyboard key is “LEFT” then assign the speed with a value of -50. If the “RIGHT” key is pressed, then assign a value of 50 to the speed variable, and so on. The code block below explains how to map the keyboard keys to the variables: 

if kp.getKey("LEFT"): 

        lr = -speed 

    elif kp.getKey("RIGHT"): 

        lr = speed 



This program also takes two extra keys for landing and taking off (l and t). A keyboard key “z” is also assigned if we want to take a picture from the drone. As the drone’s video will be on, whenever we click on “z” key, opencv will save the image in a folder specified by us. After providing all the combinations, we must return the values in a 1D array. Also, don’t forget to run tel.streamon() to turn on the video streaming.     

We must make the drone take commands until and unless we press the “l” key for landing. So, we have a while True loop in the following code segment: 


Calling the function


while True: 

    img = tel.get_frame_read().frame 

    img = cv2.resize(img,(360,360)) 


    vals = getKeyboardInput(img) 






The get_frame_read() function reads the video frame by frame (just like an image) so we can resize it and show it on the laptop screen. The process will be so fast that it will completely look like a video being displayed.  

The last thing we must do is to call the function we created above. Remember, we have a list being returned from it. Each value of the list must be sent as a separate index value to the send_rc_control method of the tel object 




Before running the code, confirm that the laptop is connected to the drone via Wi-Fi. 

Now, execute the python file and then press “t” for the drone to take off. From there, you can press the keyboard keys for it to move in your desired direction. When you want the drone to take pictures, press “z” and when you want it to land, press “l” 




In this blog, we learned how to issue basic keyboard commands for the drone to move. Furthermore, we can also add more keys for inbuilt Tello functions like “flip” and “move away”. Videos can be captured from the drone and stored locally on our laptop 

Harikrishna Kundariya
| September 27, 2022

Most people often think of JavaScript (JS) as just a programming language; however, JavaScript, as well as JavaScript frameworks, JavaScript code have multiple applications besides web applications. That includes mobile applications, desktop applications, backend development, and embedded systems.

Looking around, you might also discover that a growing number of developers are leveraging JavaScript frameworks to learn new machine learning (ML) applications. JS frameworks, like Node JS, are capable of developing and running various machine learning models and concepts. 

Learn more about Introduction to Python for Data Science

NodeJS - Programming language
                                                                                         JavaScript – Programming language – Data Science Dojo


Best NodeJS libraries and tools for machine learning

To help you understand better, let’s discuss some of the best NodeJS libraries and tools for machine learning.


1. BrainJS:

BrainJS is a fast-running JavaScript-written library for neural networking and machine learning. Developers can use this library in both NodeJS and the web browser. BrainJS offers various kinds of networks for various tasks. It is fast and easy to use as it performs computations with the help of GPU.

If GPU isn’t available, BrainJS falls back to pure JS and continues computation. It offers numerous implementations on a neural network and encourages developing and building these neural nets on the server side with NodeJS. That is a major reason why a development agency uses this library for the simple execution of their machine learning projects. 


  • BrainJS helps create interesting functionality using fewer code lines and a reliable dataset.
  • The library can also operate on client-side JavaScript.
  • It’s a great library for quick development of a simple NN (Neural Network) wherein you can reap the benefits of accessing the wide variety of open-source libraries. 


  • There is not much possibility for a softmax layer or other such structures.
  • It restricts the developer’s network architecture and only allows simple applications. 

Cracking captcha with neural networks is a good example of a machine learning application that uses BrainJS. 


2. TensorflowJS:

TensorflowJS is a hardware-accelerated open-sourced cross platform to develop and implement deep learning and machine learning models. The library makes it easy for you to utilize flexible APIs for developing models with the help of high-level layer API or low-level JS linear algebra. That is what makes TensorflowJS a popular library for every JavaScript project that is based on ML.

There are an array of guides and tutorials on this library on its official website. It even offers model converters for running the pre-existing Tensorflow models under JavaScript or in the web browser directly. The developers also get the option to convert default Tensorflow models into certain Python models.


  • TensorflowJS can be implemented on several hardware machines, from computers to cellular devices with complicated setups
  • It offers quick updates, frequent new features, releases, and seamless performance
  • It has a better computational graph visualization


  • TensorflowJS does not support Windows OS
  • It has no GPU support besides Nvidia

NodeJS: Pitch Prediction is one of the best use cases for TensorflowJS.


3. Synaptic:

Developed by MIT, Synaptic is another popular JavaScript-based library for machine learning. It is known for its pre-manufactured structure and general architecture-free algorithm. This feature makes it convenient for developers to train and build any kind of second or first-order neural net architecture.

Developers can use this library easily if they don’t know comprehensive details about machine learning techniques and neural networks. Synaptic also helps import and export ML models using JSON format. Besides, it comes with a few interesting pre-defined networks such as multi-layer perceptions, Hopfield networks, and LSTMs (long short-term memory networks).


  • Synaptic can develop recurrent and second-order networks.
  • It features pre-defined networks.
  • There’s documentation available for layers, networks, neurons, architects, and trainers. 


  • Synaptic isn’t maintained actively anymore.
  • It has a slow runtime compared to the other libraries. 

Painting a Picture and Solving an XOR are some of the common Synaptic use cases.


4. MLJS:

MLJS is a general-purpose, comprehensive JavaScript machine learning library that makes ML approachable for all target audiences. The library provides access to machine learning models and algorithms in web browsers. However, the developers who want to work with MLJS in the JS environment can add their dependencies. 

MLJS offers mission-critical and straightforward utilities and models for unsupervised and supervised issues. It’s an easy-to-use, open-source library that can handle memory management in ML algorithms and GPU-based mathematical operations. The library supports other routines, too, like hash tables, arrays, statistics, cross-validation, linear algebra, etc. 


  • MLJS provides a routine for array manipulation, optimizations, and algebra
  • It facilitates BIT operations on hash tables, arrays, and sorting
  • MLJS extends support to cross-validation


  • MLJS doesn’t offer default file system access in the host environment of the web browser
  • It has restricted hardware acceleration support

Naïve-Bayes Classification is a good example that uses utilities from the MLJS library.


5. NeuroJS:

NeuroJS is another good JavaScript-based library to develop and train deep learning models majorly used in creating chatbots and AI technologies. Several developers leverage NeuroJS to create and train ML models and implement them in NodeJS or the web application. 

A major advantage of the NeuroJS library is that it provides support for real-time classification, online learning, and classification of multi-label forms while developing machine learning projects. The simple and performance-driven nature of this library makes machine learning practical and accessible to those using it. 


  • NeuroJS offers support for online learning and reinforcement learning
  • High-performance
  • It also supports the classification of multi-label forms


  • NeuroJS does not support backpropagation and LSTM through time

A good example of NeuroJS being used along with React can be discovered here.


6. Stdlib:

Stdlib is a large JavaScript-based library used to create advanced mathematical models and ML libraries. Developers can also use this library to conduct graphics and plotting functionalities for data analysis and data visualization.

You can use this library to develop scalable, and modular APIs for other developers and yourself within minutes, sans having to tackle gateways, servers, domains, build SDKs, or write documentation.


  • Stdlib offers robust, and rigorous statistical and mathematical functions
  • It comes with auto-generated documentation
  • The library offers easy-API access control and sharing


  • Stdlib doesn’t support developing project builds that don’t feature runtime assertions.
  • It does not support computing inverse hyperbolic secant.

Main, mk-stack, and From the Farmer, are three companies that reportedly use Stdlib in their technology stack.


7. KerasJS:

KerasJS is a renowned neural network JavaScript library used to develop and prepare profound deep learning and machine learning models. The models developed using Keras are mostly run in a web application. However, to run the models, you can only use CPU mode for it. There won’t be any GPU acceleration.

Keras is known as a JavaScript alternative for AI (Artificial Intelligence) library. Besides, as Keras uses numerous frameworks for backend, it allows you to train the models in TensorFlow, CNTK, and a few other frameworks.


  • Using Keras, models can be trained in any backend
  • It can exploit GPU support offered by the API of WebGL 3D designs
  • The library is capable of running Keras models in programs


  • Keras is not that useful if you wish to create your own abstract layer for research purposes
  • It can only run in CPU mode

A few well-known scientific organizations, like CERN, and NASA, are using this library for their AI-related projects.


Wrapping up:

This article covers the top five NodeJS libraries you can leverage when exploring machine learning. JavaScript may not be that popular in machine learning and deep learning yet; however, the libraries listed in the article prove that it is not behind the times when it comes to progressing in the machine learning space.

Moreover, developers having and utilizing the correct libraries and tools for machine learning jobs can help them put up algorithms and solutions capable of tapping the various strengths of their machine learning project.

We hope this article helps you learn and use the different libraries listed above in your project. 

Umair Hasan
| September 26, 2022

In this tutorial, you will learn how to create an attractive voice-controlled chatbot application with a small amount of coding in python. To build our application we’ll first create a good-looking user interface through the built-in Tkinter library in Python and then we will create some small functions to achieve our task. 


Here is a sneak peek of what we are going to create. 


Voice controlled chatbot
Voice controlled chatbot using coding in Python – Data Science Dojo

Before kicking off, I hope you already have a brief idea about web scraping, if not then read the following article talking about Python web scraping 


PRO-TIP: Join our 5-day instructor-led Python for Data Science training to enhance your deep learning


Pre-requirements for building a voice chatbot

Make sure that you are using Python 3.8+ and the following libraries are installed on it 

  • Pyttsx3 (pyttsx3 is a text-to-speech conversion library in Python) 
  • SpeechRecognition (Library for performing speech recognition) 
  • Requests (The requests module allows you to send HTTP requests using Python) 
  • Bs4 (Beautiful Soup is a library that is used to scrape information from web pages) 
  • pyAudio (With PyAudio, you can easily use Python to play and record audio) 


If you are still facing installation errors or incompatibility errors, then you can try downloading specific versions of the above libraries as they are tested and working currently in the application. 


  • Python 3.10 
  • pyttsx3==2.90 
  • SpeechRecognition==3.8.1 
  • requests==2.28.1
  • beautifulsoup4==4.11.1 
  • beautifulsoup4==4.11.1 


Now that we have set everything it is time to get started. Open a fresh new py file and name it VoiceChatbot.py. Import the following relevant libraries on the top of the file. 


  • from tkinter import * 
  • import time
  • import datetime
  • import pyttsx3
  • import speech_recognition as sr
  • from threading import Thread
  • import requests
  • from bs4 import BeautifulSoup 


The code is divided into the GUI section, which uses the Tkinter library of python and 7 different functions. We will start by declaring some global variables and initializing instances for text-to-speech and Tkinter. Then we start creating the windows and frames of the user interface. 


The user interface 

This part of the code loads images initializes global variables, and instances and then it creates a root window that displays different frames. The program starts when the user clicks the first window bearing the background image. 


if __name__ == “__main__”: 


#Global Variables 

loading = None
query = None
flag = True
flag2 = True


#initalizng text to speech and setting properties 

engine = pyttsx3.init() # Windows voices = engine.getProperty('voices') engine.setProperty('voice', voices[1].id) rate = engine.getProperty('rate') engine.setProperty('rate', rate-10) 


#loading images 

    img1= PhotoImage(file='chatbot-image.png') 
    img2= PhotoImage(file='button-green.png') 
    img3= PhotoImage(file='icon.png') 
    img4= PhotoImage(file='terminal.png') 
    front_image = PhotoImage(file="front2.png") 


#creating root window 

    root.title("Intelligent Chatbot") 


#Placing frame on root window and placing widgets on the frame 

    f = Frame(root,width = 1360, height = 690) 


#first window which acts as a button containing the background image 

    okVar = IntVar() 
    btnOK = Button(f, image=front_image,command=lambda: okVar.set(1)) 
    background_label = Label(root, image=background_image) 
    background_label.place(x=0, y=0) 


#Frame that displays gif image 

    frames = [PhotoImage(file='chatgif.gif',format = 'gif -index %i' %(i)) for i in range(20)] 
    canvas = Canvas(root, width = 800, height = 596) 
    canvas.create_image(0, 0, image=img1, anchor=NW) 


#Question button which calls ‘takecommand’ function 

    question_button = Button(root,image=img2, bd=0, command=takecommand) 


#Right Terminal with vertical scroll 

    canvas2.config(width=500,height=596, background="black") 
    canvas2.create_image(0,0, image=img4, anchor="nw") 
    task = Thread(target=main_window) 


The main window functions 

This is the first function that is called inside a thread. It first calls the wishme function to wish the user. Then it checks whether the query variable is empty or not. If the query variable is empty, then it checks the contents of the query variable. If there is a shutdown or quit or stop word in query, then it calls the shutdown function, and the program exits. Else, it calls the web_scraping function. This function calls another function with the name wishme. 


def main_window(): 
    global query 
    while True: 
        if query != None: 
            if 'shutdown' in query or 'quit' in query or 'stop' in query or 'goodbye' in query: 
                query = None 


The wish me function 

This function checks the current time and greets users according to the hour of the day and it also updates the canvas. The contents in the text variable are passed to the ‘speak’ function. The ‘transition’ function is also invoked at the same time in order to show the movement effect of the bot image, while the bot is speaking. This synchronization is achieved through threads, which is why these functions are called inside threads. 


def wishme(): 
    hour = datetime.datetime.now().hour 
    if 0 <= hour < 12: 
        text = "Good Morning sir. I am Jarvis. How can I Serve you?" 
    elif 12 <= hour < 18: 
        text = "Good Afternoon sir. I am Jarvis. How can I Serve you?" 
        text = "Good Evening sir. I am Jarvis. How can I Serve you?" 
    canvas2.create_text(10,10,anchor =NW , text=text,font=('Candara Light', -25,'bold italic'), fill="white",width=350) 
    p2 = Thread(target=transition) 


The speak function 

This function converts text to speech using pyttsx3 engine. 

def speak(text): 
    global flag 


The transition functions 

The transition function is used to create the GIF image effect, by looping over images and updating them on canvas. The frames variable contains a list of ordered image names.  


def transition(): 
    global img1 
    global flag 
    global flag2 
    global frames 
    global canvas 
    local_flag = False 
    for k in range(0,5000): 
        for frame in frames: 
            if flag == False: 
                canvas.create_image(0, 0, image=img1, anchor=NW) 
                flag = True 
                canvas.create_image(0, 0, image=frame, anchor=NW) 


The web scraping function 

This function is the heart of this application. The question asked by the user is then searched on google using the ‘requests’ library of python. The ‘beautifulsoap’ library extracts the HTML content of the page and checks for answers in four particular divs. If the webpage does not contain any of the four divs, then it searches for answers on Wikipedia links, however, if that is also not successful, then the bot apologizes.  


def web_scraping(qs): 
    global flag2 
    global loading 
    URL = 'https://www.google.com/search?q=' + qs 
    page = requests.get(URL) 
    soup = BeautifulSoup(page.content, 'html.parser') 
    div0 = soup.find_all('div',class_="kvKEAb") 
    div1 = soup.find_all("div", class_="Ap5OSd") 
    div2 = soup.find_all("div", class_="nGphre") 
    div3  = soup.find_all("div", class_="BNeawe iBp4i AP7Wnd") 

    links = soup.findAll("a") 
    all_links = [] 
    for link in links: 
       link_href = link.get('href') 
       if "url?q=" in link_href and not "webcache" in link_href: 

    flag= False 
    for link in all_links: 
       if 'https://en.wikipedia.org/wiki/' in link: 
           wiki = link 
           flag = True 
    if len(div0)!=0: 
        answer = div0[0].text 
    elif len(div1) != 0: 
       answer = div1[0].text+"\n"+div1[0].find_next_sibling("div").text 
    elif len(div2) != 0: 
       answer = div2[0].find_next("span").text+"\n"+div2[0].find_next("div",class_="kCrYT").text 
    elif len(div3)!=0: 
        answer = div3[1].text 
    elif flag==True: 
       page2 = requests.get(wiki) 
       soup = BeautifulSoup(page2.text, 'html.parser') 
       title = soup.select("#firstHeading")[0].text
       paragraphs = soup.select("p") 
       for para in paragraphs: 
           if bool(para.text.strip()): 
               answer = title + "\n" + para.text 
        answer = "Sorry. I could not find the desired results"
    canvas2.create_text(10, 225, anchor=NW, text=answer, font=('Candara Light', -25,'bold italic'),fill="white", width=350) 
    flag2 = False 
    p2 = Thread(target=transition) 


The take command function 

This function is invoked when the user clicks the green button to ask any question. The speech recognition library listens for 5 seconds and converts the audio input to text using google recognize API. 


def takecommand(): 
    global loading 
    global flag 
    global flag2 
    global canvas2 
    global query 
    global img4 
    if flag2 == False: 
        canvas2.create_image(0,0, image=img4, anchor="nw")  
    speak("I am listening.") 
    flag= True 
    r = sr.Recognizer() 
    r.dynamic_energy_threshold = True 
    r.dynamic_energy_adjustment_ratio = 1.5 
    #r.energy_threshold = 4000 
    with sr.Microphone() as source: 
        #r.pause_threshold = 1 
        audio = r.listen(source,timeout=5,phrase_time_limit=5) 
        #audio = r.listen(source) 
        query = r.recognize_google(audio, language='en-in') 
        print(f"user Said :{query}\n") 
        query = query.lower() 
        canvas2.create_text(490, 120, anchor=NE, justify = RIGHT ,text=query, font=('fixedsys', -30),fill="white", width=350) 
        global img3 
        loading = Label(root, image=img3, bd=0) 
        loading.place(x=900, y=622) 
    except Exception as e: 
        speak("Say that again please") 
        return "None"


The shutdown function 

This function farewells the user and destroys the root window in order to exit the program. 

def shut_down(): 
    p1=Thread(target=speak,args=("Shutting down. Thankyou For Using Our Sevice. Take Care, Good Bye.",)) 
    p2 = Thread(target=transition) 



It is time to wrap up, I hope you enjoyed our little application. This is the power of Python, you can create small attractive applications in no time with a little amount of code. Keep following us for more cool python projects! 


Code - CTA


Hands-on deep learning using Python in Cloud
Ali Mohsin
| August 3, 2022

Data Science Dojo has launched  Jupyter Hub for Deep Learning using Python offering to the Azure Marketplace with pre-installed Deep Learning libraries and pre-cloned GitHub repositories of famous Deep Learning books and collections which enables the learner to run the example codes provided.

What is Deep Learning?

Deep learning is a subfield of machine learning and artificial intelligence (AI) that mimics how people gain specific types of knowledge. Deep learning algorithms are incredibly complex and the structure of these algorithms, where each neuron is connected to the other and transmits information, is quite similar to that of the nervous system.

Also, there are different types of neural networks to address specific problems or datasets, for example, Convolutional neural networks (CNNs) and Recurrent neural networks (RNNs).

While in the field of Data Science, which also encompasses statistics and predictive modeling, deep learning contains a key component. This procedure is made quicker and easier by deep learning, which is highly helpful for data scientists who are tasked with gathering, processing, and interpreting vast amounts of data.

Deep Learning using Python

Python, a high-level programming language that was created in 1991 and has seen a rise in popularity, is compatible with deep learning, which has contributed to its development. While several languages, including C++, Java, and LISP, can be used with deep learning, Python continues to be the preferred option for millions of developers worldwide.

Additionally, data is the essential component in all deep learning algorithms and applications, both as training data and as input. Python is a great tool to employ for managing large volumes of data for training your deep learning system, inputting input, or even making sense of its output because it is primarily used for data management, processing, and forecasting.

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your deep learning skills.

deep learning

Challenges for individuals

Individuals who want to upgrade their path from Machine Learning to Deep Learning and want to start with it usually lack the resources to gain hands-on experience with Deep Learning. A beginner in Deep Learning also faces compatibility issues while installing libraries.

What we provide

Jupyter Hub for Deep Learning using Python solves all the challenges by providing you an effortless coding environment in the cloud with pre-installed Deep Learning python libraries which reduces the burden of installation and maintenance of tasks hence solving the compatibility issues for an individual.

Moreover, this offer provides the user with repositories of famous authors and books on Deep Learning which contain chapter-wise notebooks with some exercises which serve as a learning resource for a user in gaining hands-on experience with Deep Learning.

The heavy computations required for Deep Learning applications are not performed on the user’s local machine. Instead, they are performed in the Azure cloud, which increases responsiveness and processing speed.

Listed below are the pre-installed python libraries related to Deep learning and the sources of repositories of Deep Learning books provided by this offer:

Python libraries:

  • NumPy
  • Matplotlib
  • Pandas
  • Seaborn
  • TensorFlow
  • Tflearn
  • PyTorch
  • Keras
  • Scikit Learn
  • Lasagne
  • Leather
  • Theano
  • D2L
  • OpenCV


  • GitHub repository of book Deep Learning with Python 2nd Edition, by author François Chollet.
  • GitHub repository of book Hands-on Deep Learning Algorithms with Python, by author Sudharsan Ravichandran.
  • GitHub repository of book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, by author Geron Aurelien.
  • GitHub repository of collection on Deep Learning Models, by author Sebastian Raschka.


Jupyter Hub for Deep Learning using Python provides an in-browser coding environment with just a single click, hence providing ease of installation. Through this offer, a user can work on a variety of Deep Learning applications self-driving cars, healthcare, fraud detection, language translations, auto-completion of sentences, photo descriptions, image coloring and captioning, object detection, and localization.

This Jupyter Hub for Deep Learning instance is ideal to learn more about Deep Learning without the need to worry about configurations and computing resources.

The heavy resource requirement to deal with large datasets and perform the extensive model training and analysis for these applications is no longer an issue as heavy computations are now performed on Microsoft Azure which increases processing speed.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data.

We are therefore adding a free Jupyter Notebook Environment dedicated specifically to Deep Learning using Python. Install the Jupyter Hub offer now from the Azure Marketplace, your ideal companion in your journey to learn data science!

Try Now!

Hands-on ethical web scraping using Python
Syed Saad Peerzada
| August 4, 2022

What is web scraping?

Web scraping is the act of extracting the content and data from a website. The vast amount of data available on the internet is not open and available to download. As a result, ethical web scraping is the most effective technique to collect this data. There is also a debate about the legality of web scraping as the content may get stolen or the website can crash as a result of web scraping.

Ethical Web Scraping is the act of harvesting data legally by following ethical rules about web scraping. There are certain rules in ethical web scraping that when followed ensure trust between the website owner and web scraper.

Web scraping using Python

In Python, a learner can write a small piece of code to do large tasks. Since web scraping is used to save time, a small code written in Python can save a lot of time. Also, Python is simple and easy to understand and provides an extensive set of libraries for web scraping and further manipulation required on extracted data.

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your web scraping skills.

Challenges for individuals

Individuals who are new to web scraping and wish to flourish in their field usually lack the necessary computing and learning resources to obtain hands-on expertise. Also, they may face compatibility issues when installing libraries.

What we provide

With just a single click, Jupyter Hub for Ethical Web Scraping using Python comes with pre-installed Web Scraping python libraries, which gives the learner an effortless coding environment in the Azure cloud and reduces the burden of installation. Moreover, this offer provides the learner with a repository of the famous book on web scraping which contains chapter-wise notebooks which serve as a learning resource for a user in gaining hands-on experience with web scraping.

Through this offer, a learner can collect data from various sources legally by following the best practices for ethical web scraping mentioned in the latter section of this blog. Once the data is collected, it can be further analyzed to get valuable insights into almost everything while all the heavy computations are performed on Microsoft Azure hence saving the user from the trouble of running high computations on the local machine.

Python libraries:

Listed below are the pre-installed web scraping python libraries and the sources of repositories of web scraping book provided by this offer:

  •          Pandas
  •          NumPy
  •          Scikit-learn
  •          Beautifulsoup4
  •          lxml
  •          MechanicalSoup
  •          Requests
  •          Scrapy
  •          Selenium
  •          urllib


  •          GitHub repository of book Web Scraping with Python 2nd Edition,
    by author Ryan Mitchell.

Best practices for ethical web scraping

Globally, there is a debate about whether web scraping is an ethical concept or not. The reason it is unethical is that when a website is queried repeatedly by the same user (in this case bot), too many requests land on the server simultaneously and all resources of the server may be consumed in generating responses for each request, preventing it from responding to other legitimate users.

In this way, the server denies responses to any further users, commonly known as a Denial of Service (DoS) attack.

Below are the best practices for ethical web scraping, and compliance with these will allow a web scraper to work ethically.

1.   Check out for ROBOTS.TXT

Robots.txt file, also known as the Robots Exclusion Standard, is used to inform the web scrapers if the website can be crawled or not, if yes then how to index the website. A legitimate web scraper is expected to respect the instructions in this file and not disobey the website owner’s allowed instructions.

2.   Check for website APIs

An ethical web scraper is expected to first look for the public API of the website in question instead of scraping it all together. Many website owners provide public API access which can be used by anyone looking to gain from the information available on the website. Provision of public API works in the best interests of both the ethical scrapper as well as the website owner, avoiding web scraping altogether.

3.   Avoid repeated requests

Vigorous scraping can occasionally cause functionality issues, resulting in a poor user experience for humans. As a result, it is always advised to scrape during off-peak hours. An ethical web scraper is expected to delay recurrent requests to avoid a DoS attack.

4.   Provide your identity

It is always a good idea to take responsibility for one’s actions. An ethical web scraper never hides his or her identity and provides it in a user-agent string. Not only does this make the intentions of the scraper clear but also provides a means of contact for any questions or concerns of the website owner.

5.   Avoid fake ownership

The content scraped through web scraper should always be respected and never passed on under the fake information of scraper as the author. This act can be regarded as highly unethical as well as illegal since the website owner may file a copyright claim. It also damages the reputation of genuine web scrapers and hurts the trust of the website owner.

6.  Ask for permission

Since the website information belongs to the owner, one should never presume it to be free and ask politely to use it for their means. An ethical web scraper always seeks permission from the website owner to avoid any future problems. The website owner should be given the choice of whether she agrees to scrape the data.

 7.  Give due credit

To encourage the website owner as a token of thanks, the web scraper should give due credit wherever possible. This can be done in many ways such as providing a link to the original website on any blog, article, or social media post by generating traffic for the original website.

Ethical web scraping


Ethical web scraping is a two-way street in which the website owner should be mindful of the global availability of the data, similarly, the scraper should not harm the website in any way and also first seek permission from the website owner. If a web scraper abides by the above-mentioned practices, I.e., he/she works ethically, the web owner may not only allow scraping his or her website but also provide helpful means to the scraper in the form of Meta data or a public API.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for Ethical Web Scraping using Python. Install the Jupyter Hub offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try now - CTA

 Programming has an extremely vast package ecosystem. It provides robust tools to master all the core skill sets of data science.

For someone like me, who has only some programming experience in Python, the syntax of R programming felt alienating, initially. However, I believe it’s just a matter of time before you adapt to the unique logicality of a new language. The grammar of R flows more naturally to me after having to practice for a while. I began to grasp its kind of remarkable beauty, a beauty that has captivated the heart of countless statisticians throughout the years.

If you don’t know what R programming is, it’s essentially a programming language created for statisticians by statisticians. Hence, it easily becomes one of the most fluid and powerful tools in the field of data science.

Here I’d like to walk through my study notes with the most explicit step-by-step directions to introduce you to the world of R.

Why learn R for data science?

Before diving in, you might want to know why should you learn R for Data Science. There are two major reasons:

1. Powerful analytic packages for data science

Firstly, R programming has an extremely vast package ecosystem. It provides robust tools to master all the core skill sets of Data Science, from data manipulation, and data visualization, to machine learning. The vivid community keeps the R language’s functionalities growing and improving.

2. High industry popularity and demand

With its great analytical power, R programming is becoming the lingua franca for data science. It is widely used in the industry and is in heavy use at several of the best companies that are hiring Data Scientists including Google and Facebook. It is one of the highly sought-after skills for a Data Science job.

You can also learn Python for data science.

Quickstart installation guide

To start programming with R on your computer, you need two things: R and RStudio.

Install R language

You have to first install the R language itself on your computer (It doesn’t come by default). To download R, go to CRANhttps://cloud.r-project.org/ (the comprehensive R archive network). Choose your system and select the latest version to install.

Install RStudio

You also need a hefty tool to write and compile R code. RStudio is the most robust and popular IDE (integrated development environment) for R. It is available on http://www.rstudio.com/download (open source and for free!).

Overview of RStudio

Now you have everything ready. Let’s have a brief overview at RStudio. Fire up RStudio, the interface looks as such:



Go to File > New File > R Script to open a new script file. You’ll see a new section appear at the top left side of your interface. A typical RStudio workspace composes of the 4 panels you’re seeing right now:


R script

RStudio interface

Here’s a brief explanation of the use of the 4 panels in the RStudio interface:


This is where your main R script located.


This area shows the output of code you run from script. You can also directly write codes in the console.


This space displays the set of external elements added, including dataset, variables, vectors, functions etc.


This space displays the graphs created during exploratory data analysis. You can also seek help with embedded R’s documentation here.

Running R codes

After knowing your IDE, the first thing you want to do is to write some codes.

Using the console panel

You can use the console panel directly to write your codes. Hit Enter and the output of your codes will be returned and displayed immediately after. However, codes entered in the console cannot be traced later. (i.e. you can’t save your codes) This is where the script comes to use. But the console is good for the quick experiment before formatting your codes in the script.

Using the script panel

To write proper R programming codes, console script panel

you start with a new script by going to File > New File > R Script, or hit Shift + Ctrl + N. You can then write your codes in the script panel. Select the line(s) to run and press Ctrl + Enter. The output will be shown in the console section beneath. You can also click on little Run button located at the top right corner of this panel. Codes written in script can be saved for later review (File > Save or Ctrl + S).

saving codes


Basics of R programming

Finally, with all the set-ups, you can  write your first piece of R script. The following paragraphs introduce you to the basics of R.

A quick tip before going: all lines after the symbol # will be treated as a comment and will not be rendered in the output.


Let’s start with some basic arithmetics. You can do some simple calculations with the arithmetic operators:


Arithmetic operators


Addition +, subtraction -, multiplication *, division / should be intuitive.

# Addition
1 + 1
#[1] 2

# Subtraction
2 - 2
#[1] 0

# Multiplication
3 * 2
#[1] 6

# Division
4 / 2
#[1] 2

The exponentiation operator ^ raises the number to its left to the power of the number to its right: for example 3 ^ 2 is 9.

# Exponentiation
2 ^ 4
#[1] 16

The modulo operator %% returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or  5 %% 3 is 2.

# Modulo
5 %% 2
#[1] 1

Lastly, the integer division operator %/% returns the maximum times the number on the left can be divided by the number on its right, the fractional part is discarded, for example, 9 %/% 4 is 2.

# Integer division
5 %/% 2
#[1] 2

You can also add brackets () to change the order of operation. Order of operations is the same as in mathematics (from highest to lowest precedence):

  • Brackets
  • Exponentiation
  • Division
  • Multiplication
  • Addition
  • Subtraction
      # Brackets
      (3 + 5) * 2
      #[1] 16

Variable assignment

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

Create new variables

Create a new object with the assignment operator <-. All R statements where you create objects and assignment statements have the same form: object_name <- value.

num_var <- 10

chr_var <- "Ten"

To access the value of the variable, simply type the name of the variable in the console.

  #[1] 10

#[1] "Ten"

You can access the value of the variable anywhere you call it in the R script, and perform further operations on them.

first_var <- 1
second_var <- 2

first_var + second_var
#[1] 3

sum_var <- first_var + second_var
#[1] 3

Naming variables

Not all kinds of names are accepted in R programming. Variable names must start with a letter, and can only contain lettersnumbers. and _. Also, bear in mind that R is case-sensitive, i.e. Cat would not be identical to cat.

Your object names should be descriptive, so you’ll need a convention for multiple words. It is recommended to snake case where you separate lowercase words with _.


Assignment operators

If you’ve been programming in other languages before, you’ll notice that the  assignment operator in R programming is quite strange. It uses <- instead of the commonly used equal sign = to assign objects.

Indeed, using = will still work in R, but it will cause confusion later. So you should always follow the convention and use <- for assignment.

<- is a pain to type as you’ll have to make lots of assignments. To make life easier, you should remember RStudio’s awesome keyboard shortcut Alt + – (the minus sign) and incorporate it into your regular workflow.


Look at the environment panel in the upper right corner, you’ll find all of the objects that you’ve created.


environment panel - R programming


Basic data types

You’ll work with numerous data types in R. Here are some of the most basic ones:


Data type in R programming

Knowing the data type of an object is important, as different data types work with different functions, and you perform different operations on them. For example, adding a numeric and a character together will throw an error.

To check an object’s data type, you can use the class() function.

# usage class(x)
 # description   Prints the vector of names of classes an object inherits from. # arguments  : An R object.   x

Here is an example:

int_var <- 10
#[1] "numeric"

dbl_var <- 10.11
#[1] "numeric"

lgl_var <- TRUE
#[1] "logical"

chr_var <- "Hello"
#[1] "character"


Functions are the fundamental building blocks of R. In programming, a named section of a program that performs a specific task is a function. In this sense, a function is a type of procedure or routine.

R comes with a prewritten set of functions that are kept in a library. (class() as demonstrated in the previous section is a built-in function.) You can use additional functions in other libraries by installing  packages.You can also write your own functions to perform specialized tasks.

Here is the typical form of an R function:

function_name(arg1 = val1, arg2 = val2, ...)

function_name is the name of the function. arg1 and arg2 are arguments. They’re variables to be passed into the function. The type and number of arguments depend on the definition of the function.  val1 and val2 are values of the arguments correspondingly.

Passing arguments

R can match arguments both by position > and by  name. So you don’t necessarily have to supply the names of the arguments if you have the positions of the arguments placed correctly.

class(x = 1)
#[1] "numeric"

#[1] "numeric"

Functions are always accompanied with loads of arguments for configurations. However, you don’t have to supply all of the arguments for a function to work.

Here is documentation of the sum() function.

# usage
sum(..., na.rm = FALSE)

# description     Returns the sum of all the values present in its arguments. # arguments     ... : Numeric or complex or logical vectors.     na.rm : Logical. Should missing values (including NaN) be removed? 

From the documentation, we learned that there are two arguments for the sum() function: ... and na.rm Notice that na.rm contains a default value FALSE. This makes it an optional argument. If you don’t supply any values to the optional arguments, the function will automatically fill in the default value to proceed.

sum(2, 10)
#[1] 12

sum(2, 10, NaN)
#[1] NaN

sum(2, 10, NaN, na.rm = TRUE)
#[1] 12

Getting help

There is a large collection of  functions in R and you’ll never remember all of them. Hence, knowing how to get help is important.

RStudio has a handy tool ? to help you in recalling the use of the functions:


Look how magical it is to show the R documentation directly at the output panel for quick reference.


output panel


Last but not least, if you get stuck, Google it! For beginners like us, our confusions must have gone through numerous R learners before and there will always be something helpful and insightful on the web.

Contributors: Cecilia Lee

Cecilia Lee is a junior data scientist based in Hong Kong

Guest Writer
| July 22, 2018

The dplyr package in R is a powerful tool to do data munging and data manipulation, perhaps more so than many people would initially realize, making it extremely useful in data science.
Shortly after I embarked on the data science journey earlier this year, I came to increasingly appreciate the handy utilities of dplyr, particularly the mighty combo functions of group_by() and summarize (). Below, I will go through the first project I completed as a budding data scientist using the package along with ggplot. I will demonstrate some convenient features of both.

I obtained my dataset from Kaggle. It has 150,930 observations containing wine ratings from across the world. The data had been scraped from Wine Enthusiast during the week of June 15th, 2017. Right off the bat, we should recognize one caveat when deriving any insight from this data: the magazine only posted reviews on wines receiving a grade of 80 or more (out of 100).

As a best practice, any data analysis should be done with limitations and constraints of the data in mind. The analyst should bear in mind the conclusions he or she draws from the data will be impacted by the inherent limitations in breadth and depth of the data at hand.

After reading the dataset in RStudio and naming it “wine,” we’ll get started by installing and loading the packages.

Install and load packages (dplyr, ggplot)

# Please do install.packages() for these two libraries if you don't have them


Data preparation

First, we want to clean the data. As I will leave textual data out of this analysis and not touch on NLP techniques in this post, I will drop the “description” column using the select () function from dplyr that lets us select columns by name. As you would’ve probably guessed, the minus sign in front of it indicates we want to exclude this column.

As select() is a non-mutating function, don’t forget to reassign the data frame to overwrite it (or you could create a new name for the new data frame if you want to keep the original one for reference). A convenient way to pass functions with dplyr is the pipe operator, %>%, which allows us to call multiple functions on an object sequentially and will take the immediately preceding output as the object of each function.

wine = wine %>% select(-c(description))

There is quite a range of producer countries in the list, and I want to find out which countries are most represented in the dataset. This is the first instance where we encounter one of my favorites uses in R: the group-by aggregation using “group_by” followed by “summarize”:

wine %>% group_by(country) %>% summarize(count=n()) %>% arrange(desc(count))
## # A tibble: 49 x 2

## country count


## 1 US 62397

## 2 Italy 23478

## 3 France 21098

## 4 Spain 8268

## 5 Chile 5816

## 6 Argentina 5631

## 7 Portugal 5322

## 8 Australia 4957

## 9 New Zealand 3320

## 10 Austria 3057

## # ... with 39 more rows

We want to only focus our attention on the top producers; say we want to select only the top ten countries. We’ll again turn to the powerful group_by()
and summarize() functions for group-by aggregation, followed by another select() command to choose the column we want from the newly created data frame.

Note* that after the group-by aggregation, we only retain the relevant portion of the original data frame. In this case, since we grouped by country and summarized the count per country, the result will only be a two-column data frame consisting of “country” and the newly named variable “count.” All other variables in the original set, such as “designation” and “points” were removed.

Furthermore, the new data frame only has as many rows as there were unique values in the variable grouped by – in our case, “country.” There were 49 unique countries in this column when we started out, so this new data frame has 49 rows and 2 columns. From there, we use arrange () to sort the entries by count. Passing desc(count) as an argument ensures we’re sorting from the largest to the smallest value, as the default is the opposite.

The next step top_n(10) selects the top ten producers. Finally, select () retains only the “country” column and our final object “selected_countries” becomes a one-column data frame. We transform it into a character vector using as.character() as it will become handy later on.

selected_countries = wine %>% group_by(country) %>% summarize(count=n ()) %>% arrange(desc(count)) %>% top_n(10) %>% select(country)
selected_countries = as.character(selected_countries$country)

So far we’ve already learned one of the most powerful tools from dplyr, group-by aggregation, and a method to select columns. Now we’ll see how we can select rows.

# creating a country and points data frame containing only the 10 selected countries' data select_points=wine %>% filter (country %in% selected_countries) %>% select(country, points) %>% arrange(country)

In the above code, filter(country %in% selected_countries) ensures we’re only selecting rows where the “country” variable has a value that’s in the “selected_countries” vector we created just a moment ago. After subsetting these rows, we use select() them to select the two columns we want to keep and arrange to sort the values. Not that the argument passed into the latter ensures we’re sorting by the “country” variable, as the function by default sorts by the last column in the data frame – which would be “points” in our case since we selected that column after “country.”

Data exploration and visualization

At a high level, we want to know if higher-priced wines are really better, or at least as judged by Wine Enthusiast. To achieve this goal we create a scatterplot of “points” and “price” and add a smoothed line to see the general trajectory.

ggplot(wine, aes(points,price)) + geom_point() + geom_smooth()

Data exploration of Wine enthusiasts

It seems overall expensive wines tend to be rated higher, and the most expensive wines tend to be among the highest-rated as well.

Let’s further explore possible visualizations with ggplot, and create a panel of boxplots sorted by the national median point received. Passing x=reorder(country,points,median) creates a reordered vector for the x-axis, ranked by the median “points” value by country. aes(fill=country) fills each boxplot with a distinct color for the country represented. xlab() and ylab() give labels to the axes, and ggtitle()gives the whole plot a title.

Finally, passing element_text(hjust = 0.5) to the theme() function essentially moves the plot title to horizontally centered, as “hjust”controls horizontal justification of the text’s positioning on the graph.

gplot(select_points, aes(x=reorder(country,points,median),y=points)) + geom_boxplot(aes(fill=country)) + xlab("Country") +

ylab(“Points”) + ggtitle(“Distribution of Top 10 Wine Producing Countries”) + theme(plot.title = element_text(hjust = 0.5))

Distribution | Data Science Dojo
When we ask the question “which countries may be hidden dream destinations for an oenophile?” we can subset rows of countries that aren’t in the top ten producer list. When we pass a new parameter into summarize() and assign it a new value based on a function of another variable, we create a new feature – “median” in our case. Using arrange(desc()) ensures we’re sorting by descending order of this new feature.

As we grouped by country and created one new variable, we end up with a new data frame containing two columns and however many rows there were that had values for “country” not listed in “selected_countries.”

wine %>% filter(!(country %in% selected_countries)) %>% group_by(country) %>% summarize(median=median(points))
%>% arrange(desc(median))

## # A tibble: 39 x 2
## country median
## 1 England 94.0
## 2 India 89.5
## 3 Germany 89.0
## 4 Slovenia 89.0
## 5 Canada 88.5
## 6 Morocco 88.5
## 7 Albania 88.0
## 8 Serbia 88.0
## 9 Switzerland 88.0
## 10 Turkey 88.0
## # ... with 29 more rows

We find England, India, Germany, Slovenia, and Canada as top-quality producers, despite not being the most prolific ones. If you’re an oenophile like me, this may shed light on some ideas for hidden treasures when we think about where to find our next favorite wines. Beyond the usual suspects like France and Italy, maybe our next bottle will come from Slovenia or even India.

Which countries produce a large quantity of wine but also offer high-quality wines? We’ll create a new data frame called “top” that contains the countries with the highest median “points” values. Using the intersect() function and subsetting the observations that appear in both the “selected_countries” and “top” data frames, we can find out the answer to that question.

top=wine %>% group_by(country) %>% summarize(median=median(points)) %>% arrange(desc(median))
##  [1] "Austria"     "France"      "Australia"   "Italy"       "Portugal"
## [6] "US" "New Zealand" "Spain" "Argentina" "Chile"

We see there are ten countries that appear in both lists. These are the real deals not highly represented just because of their mass production. Note that we transformed “top” from a data frame structure to a vector one, just like we had done for “selected_countries,” prior to intersecting the two.

Next, let’s turn from the country to the grape, and find the top ten most represented grape varietals in this set:

topwine = wine %>% group_by(variety) %>% summarize(number=n()) %>% arrange(desc(number)) %>% top_n(10)
##  [1] "Chardonnay"               "Pinot Noir"
## [3] "Cabernet Sauvignon" "Red Blend"
## [5] "Bordeaux-style Red Blend" "Sauvignon Blanc"
## [7] "Syrah" "Riesling"
## [9] "Merlot" "Zinfandel"

The pipe operator doesn’t work just with dplyr functions. Below we’ll examine graphs with ggplot functions that work seamlessly with dplyr syntax.

wine %>% filter(variety %in% topwine) %>% group_by(variety)%>% summarize(median=median(points)) %>% ggplot(aes(reorder(variety,median),median))
+ geom_col(aes(fill=variety)) + xlab('Variety') + ylab('Median Point') + scale_x_discrete(labels=abbreviate)

dplyr functions with ggplot

Finally, we’d be interested in learning which wines provide the best value, meaning priced toward the bottom rung but ranked in the top rung:

top15percent=wine %>% arrange(desc(points)) %>% filter(points > quantile(points, prob = 0.85))
cheapest15percent=wine %>% arrange(price) %>% head(nrow(top15percent))
goodvalue = intersect(top15percent,cheapest15percent)
## 2  Portugal Picos do Couto Reserva     92    11     Dão
## 3        US                            92    11       Washington
## 4        US                            92    11       Washington
## 5    France                            92    12         Bordeaux
## 6        US                            92    12           Oregon
## 7    France        Aydie l'Origine     93    12 Southwest France
## 8        US       Moscato d'Andrea     92    12       California
## 9        US                            92    12       California
## 10       US                            93    12       Washington
## 11    Italy             Villachigi     92    13          Tuscany
## 12 Portugal            Dona Sophia     92    13             Tejo
## 13   France       Château Labrande     92    13 Southwest France
## 14 Portugal              Alvarinho     92    13            Minho
## 15  Austria                  Andau     92    13       Burgenland
## 16 Portugal             Grand'Arte     92    13           Lisboa
##                region_1          region_2                  variety
## 1                                                   Portuguese Red
## 2                                                   Portuguese Red
## 3  Columbia Valley (WA)   Columbia Valley                 Riesling
## 4  Columbia Valley (WA)   Columbia Valley                 Riesling
## 5            Haut-Médoc                   Bordeaux-style Red Blend
## 6     Willamette Valley Willamette Valley               Pinot Gris
## 7               Madiran                      Tannat-Cabernet Franc
## 8           Napa Valley              Napa           Muscat Canelli
## 9           Napa Valley              Napa          Sauvignon Blanc
## 10 Columbia Valley (WA)   Columbia Valley    Johannisberg Riesling
## 11              Chianti                                 Sangiovese
## 12                                                  Portuguese Red
## 13               Cahors                                     Malbec
## 14                                                       Alvarinho
## 15                                                        Zweigelt
## 16                                                Touriga Nacional
##                       winery
## 1              Pedra Cancela
## 2          Quinta do Serrado
## 3                Pacific Rim
## 4                   Bridgman
## 5  Château Devise d'Ardilley
## 6                      Lujon
## 7            Château d'Aydie
## 8              Robert Pecota
## 9               Honker Blanc
## 10             J. Bookwalter
## 11            Chigi Saracini
## 12    Quinta do Casal Branco
## 13           Jean-Luc Baldès
## 14                   Aveleda
## 15              Scheiblhofer
## 16                DFJ Vinhos

Now that you’ve learned some handy tools you can use with dplyr, I hope you can go off into the world and explore something of interest to you. Feel free to make a comment below and share what other dplyr features you find helpful or interesting.

Watch the video below

Contributor: Ningxi Xu

Ningxi holds a MS in Finance with honors from Georgetown McDonough School of Business, and graduated magna cum laude with a BA from the George Washington University.

Graphs play a very important role in the data science workflow. Learn how to create dynamic professional-looking plots with Plotly.py.

We use plots to understand the distribution and nature of variables in the data and use visualizations to describe our findings in reports or presentations to both colleagues and clients. The importance of plotting in a data scientist’s work cannot be overstated.

Learn more about visualizing your data at Data Science Dojo’s Introduction to Python for Data Science!

Plotting with Matplotlib

If you have worked on any kind of data analysis problem in Python you will probably have encountered matplotlib, the default (sort of) plotting library. I personally have a love-hate relationship with it — the simplest plots require quite a bit of extra code but the library does offer flexibility once you get used to its quirks. The library is also used by pandas for its built-in plotting feature. So even if you haven’t heard of matplotlib, if you’ve used df.plot(), then you’ve unknowingly used matplotlib.

Plotting with Seaborn

Another popular library is seaborn, which is essentially a high-level wrapper around matplotlib and provides functions for some custom visualizations, these require quite a bit of code to create in the standard matplotlib. Another nice feature seaborn provides is sensible defaults for most options like axis labels, color schemes, and sizes of shapes.

Introducing Plotly

Plotly might sound like the new kid on the block, but in reality, it’s nothing like that. Plotly originally provided functionality in the form of a JavaScript library built on top of D3.js and later branched out into frontends for other languages like R, MATLAB and, of course, Python. plotly.py is the Python interface to the library.

As for usability, in my experience Plotly falls in between matplotlib and seaborn. It provides a lot of the same high-level plots as seaborn but also has extra options right there for you to tweak, such as matplotlib. It also has generally much better defaults than matplotlib.

Plotly’s interactivity

The most fascinating feature of Plotly is the interactivity. Plotly is fundamentally different from both matplotlib and seaborn because plots are rendered as static images by both of them while Plotly uses the full power of JavaScript to provide interactive controls like zooming in and panning out of the visual panel. This functionality can also be extended to create powerful dashboards and responsive visualizations that could convey so much more information than a static picture ever could.

First, let’s see how the three libraries differ in their output and complexity of code. I’ll use common statistical plots as examples.

To have a relatively even playing field, I’ll use the built-in seaborn theme that matplotlib comes with so that we don’t have to deduct points because of the plot’s looks.

fig, ax = plt.subplots(figsize=(8,6))

for species, species_df in iris.groupby('species'):
    ax.scatter(species_df['sepal_length'], species_df['sepal_width'], label=species);

ax.set(xlabel='Sepal Length', ylabel='Sepal Width', title='A Wild Scatterplot appears');


Wild Scatterplot


fig, ax = plt.subplots(figsize=(8,6))

sns.scatterplot(data=iris, x='sepal_length', y='sepal_width', hue='species', ax=ax);

ax.set(xlabel='Sepal Length', ylabel='Sepal Width', title='A Wild Scatterplot appears');


statistical plot


fig = go.FigureWidget()

for species, species_df in iris.groupby('species'):
    fig.add_scatter(x=species_df['sepal_length'], y=species_df['sepal_width'],
                    mode='markers', name=species);

fig.layout.hovermode = 'closest'
fig.layout.xaxis.title = 'Sepal Length'
fig.layout.yaxis.title = 'Sepal Width'
fig.layout.title = 'A Wild Scatterplot appears'


Looking at the plots, the matplotlib and seaborn plots are basically identical, the only difference is in the amount of code. The seaborn library has a nice interface to generate a colored scatter plot based on the hue argument, but in matplotlib we are basically creating three scatter plots on the same axis. The different colors are automatically assigned in both (default color cycle but can also be specified for customization). Other relatively minor differences are in the labels and legend, where seaborn creates these automatically. This, in my experience, is less useful than it seems because very rarely do datasets have nicely formatted column names. Usually they contain abbreviations or symbols so you still have to assign ‘proper’ labels.

But we really want to see what Plotly has done, don’t we? This time I’ll start with the code. It’s eerily similar to matplotlib, apart from not sharing the exact syntax of course and the hovermode option. Hovering? Does that mean…? Yes, yes it does. Moving the cursor over a point reveals a tooltip showing the coordinates of the point and the class label. The tooltip can also be customized to show other information about the particular point. To the top right of the panel, there are controls to zoom, select and pan across the plot. The legend is also interactive, it acts sort of like checkboxes. You can click on a class to hide/show all the points of that class.

Since the amount or complexity of code isn’t that drastically different from the other two options and we get all these interactivity options, I’d argue this is basically free benefits.

fig, ax = plt.subplots(figsize=(8,6))

grouped_df = iris.groupby('species').mean()

ax.set(xlabel='Species', ylabel='Average Sepal Length', title='A Wild Barchart appears');


wild bar chart


fig, ax = plt.subplots(figsize=(8,6))

sns.barplot(data=iris, x='species', y='sepal_length', estimator=np.mean, ax=ax);

ax.set(xlabel='Species', ylabel='Average Sepal Length', title='A Wild Barchart appears');


bar chart - Python plots


fig = go.FigureWidget()

grouped_df = iris.groupby('species').mean()
fig.add_bar(x=grouped_df.index, y=grouped_df['sepal_length']);

fig.layout.xaxis.title = 'Species'
fig.layout.yaxis.title = 'Average Sepal Length'
fig.layout.title = 'A Wild Barchart appears'


bar chart - python plot


The bar chart story is similar to the scatter plots. In this case, again, seaborn provides the option within the function call to specify the metric to be shown on the y axis using the x variable as the grouping variable. For the other two, we have to do this ourselves using pandasPlotly still provides interactivity out of the box.

Now that we’ve seen that Plotly can hold its own against our usual plotting options, let’s see what other benefits it can bring to the table. I will showcase some trace types in Plotly that are useful in a data science workflow, and how interactivity can make them more informative.


fig = go.FigureWidget()

cor_mat = car_crashes.corr()

fig.layout.width = 500
fig.layout.height = 500
fig.layout.yaxis.automargin = True
fig.layout.title = 'A Wild Heatmap appears'


heatmap - python

Heatmaps are commonly used to plot correlation or confusion matrices. As expected, we can hover over the squares to get more information about the variables. I’ll paint a picture for you. Suppose you have trained a linear regression model to predict something from this dataset. You can then show the appropriate coefficients in the hover tooltips to get a better idea of which correlations in the data the model has captured.

Parallel coordinates plot

fig = go.FigureWidget()

parcords = fig.add_parcoords(dimensions=[{'label':n.title(),
                                          'range':[0,8]} for n in iris.columns[:-2]])

fig.data[0].dimensions[0].constraintrange = [4,8]
parcords.line.color = iris['species_id']
parcords.line.colorscale = make_plotly(cl.scales['3']['qual']['Set2'], repeat=True)

parcords.line.colorbar.title = ''
parcords.line.colorbar.tickvals = np.unique(iris['species_id']).tolist()
parcords.line.colorbar.ticktext = np.unique(iris['species']).tolist()
fig.layout.title = 'A Wild Parallel Coordinates Plot appears'


parralel coordinates plot .gif


I suspect some of you might not yet be familiar with this visualization, as I wasn’t a few months ago. This is a parallel coordinates plot of four variables. Each variable is shown on a separate vertical axis. Each line corresponds to a row in the dataset and the color obviously shows which class that row belongs to. A thing that should jump out at you is that the class separation in each variable axis is clearly visible. For instance, the Petal_Length variable can be used to classify all the Setosa flowers very well.

Since the plot is interactive, the axes can be reordered by dragging to explore interconnectedness between the classes and how it affects the class separations. Another interesting interaction is the constrained range widget (the bright pink object on the Sepal_Length axis). It can be dragged up or down to decolor the plot. Imagine having these on all axes and finding a sweet spot where only one class is visible. As a side note, the decolored plot has a transparency effect on the lines so the density of values can be seen.

A version of this type of visualization also exists for categorical variables in Plotly. It is called Parallel Categories.

Choropleth plot

fig = go.FigureWidget()

choro = fig.add_choropleth(locations=gdp['CODE'],
                           z=gdp['GDP (BILLIONS)'],
                           text = gdp['COUNTRY'])

choro.marker.line.width = 0.1
choro.colorbar.tickprefix = '$'
choro.colorbar.title = 'GDP<br>Billions US$'
fig.layout.geo.showframe = False
fig.layout.geo.showcoastlines = False
fig.layout.title = 'A Wild Choropleth appears<br>Source:\
                    <a href="https://www.cia.gov/library/publications/the-world-factbook/fields/2195.html">\
                    CIA World Factbook</a>'


Choropleth | Data Science Dojo


A choropleth is a very commonly used geographical plot. The benefit of the interactivity should be clear in this one. We can only show a single variable using the color but the tooltip can be used for extra information. Zooming in is also very useful in this case, allowing us to look at the smaller countries. The plot title contains HTML which is being rendered properly. This can be used to create fancier labels.

Interactive scatter plot

fig = go.FigureWidget()

scatter_trace = fig.add_scattergl(x=diamonds['carat'], y=diamonds['price'],
                                  mode='markers', marker={'opacity':0.2});

fig.layout.hovermode = 'closest'
fig.layout.xaxis.title = 'Carat'
fig.layout.yaxis.title = 'Price'
fig.layout.title = 'A Wild Scatterplot appears'




I’m using the scattergl trace type here. This is a version of the scatter plot which uses WebGL in the background so that the interactions don’t get laggy even with larger datasets.

There is quite a bit of over-plotting here even with the aggressive transparency, so let’s zoom into the densest part to take a closer look. Zooming in reveals that the carat variable is quantized and there are clean vertical lines.

def selection_handler(trace, points, selector):
    data_mean = np.mean(points.ys)
    fig.data[0].figure.layout.title.text = f'A Wild Scatterplot appears - mean price: ${data_mean:.1f}'



scatter plot


Selecting a bunch of points in this scatter plot will change the title of the plot to show the mean price of the selected points. This could prove to be very useful in a plot where there are groups and you want to visually see some statistics of a cluster.

This behavior is easily implemented using callback functions attached to predefined event handlers for each trace.

More interactivity

Let’s do something fancier now.

fig1 = go.FigureWidget()
fig1.add_scattergl(x=exports['beef'], y=exports['total exports'],
fig1.layout.hovermode = 'closest'
fig1.layout.xaxis.title = 'Beef Exports in Million US$'
fig1.layout.yaxis.title = 'Total Exports in Million US$'
fig1.layout.title = 'A Wild Scatterplot appears'

fig2 = go.FigureWidget()
                    z=exports['total exports'].astype('float64'),
fig2.data[0].marker.line.width = 0.1
fig2.data[0].marker.line.color = 'white'
fig2.data[0].marker.line.width = 2
fig2.data[0].colorbar.title = 'Exports Millions USD'
fig2.layout.geo.showframe = False
fig2.layout.geo.scope = 'usa'
fig2.layout.geo.showcoastlines = False
fig2.layout.title = 'A Wild Choropleth appears'

def do_selection(trace, points, selector):
    if trace is fig2.data[0]:
        fig1.data[0].selectedpoints = points.point_inds
        fig2.data[0].selectedpoints = points.point_inds

HBox([fig1, fig2])


scatterplot choropleth linked


We have already seen how to make scatter and choropleth plots so let’s put them to use and plot the same data-frame. Then, using the event handlers we also saw before, we can link both plots together and interactively explore which states produce which kinds of goods.

This kind of interactive exploration of different slices of the dataset is far more intuitive and natural than transforming the data in pandas and then plotting it again.

fig = go.FigureWidget()
                  histnorm='probability density');
fig.layout.xaxis.title = 'Sepal Length'
fig.layout.yaxis.title = 'Probability Density'
fig.layout.title = 'A Wild Histogram appears'

def change_binsize(s):
    fig.data[0].xbins.size = s
slider = interactive(change_binsize, s=(0.1,1,0.1))
label = Label('Bin Size: ')

VBox([HBox([label, slider]),




Using the ipywidgets module’s interactive controls different aspects of the plot can be changed to gain a better understanding of the data. Here the bin size of the histogram is being controlled.

fig = go.FigureWidget()

scatter_trace = fig.add_scattergl(x=diamonds['carat'], y=diamonds['price'],
                                  mode='markers', marker={'opacity':0.2});

fig.layout.hovermode = 'closest'
fig.layout.xaxis.title = 'Carat'
fig.layout.yaxis.title = 'Price'
fig.layout.title = 'A Wild Scatterplot appears'

def change_opacity(x):
    fig.data[0].marker.opacity = x
slider = interactive(change_opacity, x=(0.1,1,0.1))
label = Label('Marker Opacity: ')

VBox([HBox([label, slider]),


scatter opacity


The opacity of the markers in this scatter plot is controlled by the slider. These examples only control the visual or layout aspects of the plot. We can also change the actual data which is being shown using dropdowns. I’ll leave you to explore that on your own.

What have we learned about Python plots

Let’s take a step back and sum up what we have learned. We saw that Plotly can reveal more information about our data using interactive controls, which we get for free and with no extra code. We saw a few interesting, slightly more complex visualizations available to us. We then combined the plots with custom widgets to create custom interactive workflows.

All this is just scratching the surface of what Plotly is capable of. There are many more trace types, an animations framework, and integration with Dash to create professional dashboards and probably a few other things that I don’t even know of.


Kickstart web scraping in 30-minutes with Python and BeautifulSoup
Data Science Dojo Staff
| February 7, 2023

Use Python and BeautifulSoup to web scrape. Web scraping is a very powerful tool to learn for any data professional. Make the entire internet your database.

Web scraping tutorial

With web scraping, the entire internet becomes your database. In this tutorial, we show you how to parse a web page into a data file (csv) using a Python package called BeautifulSoup.

web scraping

There are many services out there that augment their business data or even build out their entire business by using web scraping. For example there is a steam sales website that tracks and ranks steam sales, updated hourly. Companies can also scrape product reviews from places like Amazon to stay up-to-date with what customers are saying about their products.

The code

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client
#URl to web scrap from.
#in this example we web scrap graphics cards from Newegg.com
page_url = "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
#opens the connection and downloads html page from url
uClient = uReq(page_url)
#parses html into a soup data structure to traverse html
#as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
#finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})
#name the output file to write to local disk
out_filename = "graphics_cards.csv"
#header of csv file to be written
headers = "brand,product_name,shippingn"
#opens file, and writes headers
f = open(out_filename, "w")
#loops over each product and grabs attributes about
#each product
for container in containers:
# Finds all link tags "a" from within the first div.
make_rating_sp = container.div.select("a")
# Grabs the title from the image title attribute
# Then does proper casing using .title()
brand = make_rating_sp[0].img["title"].title()
# Grabs the text within the second "(a)" tag from within
# the list of queries.
product_name = container.div.select("a")[2].text
# Grabs the product shipping information by searching
# all lists with the class "price-ship".
# Then cleans the text of white space with strip()
# Cleans the strip of "Shipping $" if it exists to just get number
shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip().replace("$", "").replace(" Shipping", "")
# prints the dataset to console
print("brand: " + brand + "n")
print("product_name: " + product_name + "n")
print("shipping: " + shipping + "n")
# writes the dataset to file
f.write(brand + ", " + product_name.replace(",", "|") + ", " + shipping + "n")
f.close()  # Close the file

The video (enjoy!)

For more info, there’s a script that does the same thing in R

Want to learn more data science techniques in Python? Take a look at this introduction to Python for Data Science

Ali Mohsin
| July 18, 2022

Data Science Dojo has launched  Jupyter Hub for Computer Vision using Python offering to the Azure Marketplace with pre-installed libraries and pre-cloned GitHub repositories of famous Computer Vision books and courses which enables the learner to run the example codes provided.

What is computer vision?

It is a field of artificial intelligence that enables machines to derive meaningful information from visual inputs.

Computer vision using Python

In the world of computer vision, Python is a mainstay. Even if you are a beginner or the language application you are reviewing was created by a beginner, it is straightforward to understand code. Because the majority of its code is extremely difficult, developers can devote more time to the areas that need it.


computer vision python
Computer vision using Python

Challenges for individuals

Individuals who want to understand digital images and want to start with it usually lack the resources to gain hands-on experience with Computer Vision. A beginner in Computer Vision also faces compatibility issues while installing libraries along with the following:

  1. Image noise and variability: Images can be noisy or low quality, which can make it difficult for algorithms to accurately interpret them.
  2. Scale and resolution: Objects in an image can be at different scales and resolutions, which can make it difficult for algorithms to recognize them.
  3. Occlusion and clutter: Objects in an image can be occluded or cluttered, which can make it difficult for algorithms to distinguish them.
  4. Illumination and lighting: Changes in lighting conditions can significantly affect the appearance of objects in an image, making it difficult for algorithms to recognize them.
  5. Viewpoint and pose: The orientation of objects in an image can vary, which can make it difficult for algorithms to recognize them.
  6. Occlusion and clutter: Objects in an image can be occluded or cluttered, which can make it difficult for algorithms to distinguish them.
  7. Background distractions: Background distractions can make it difficult for algorithms to focus on the relevant objects in an image.
  8. Real-time performance: Many applications require real-time performance, which can be a challenge for algorithms to achieve.


What we provide

Jupyter Hub for Computer Vision using the language solves all the challenges by providing you an effortless coding environment in the cloud with pre-installed computer vision python libraries which reduces the burden of installation and maintenance of tasks hence solving the compatibility issues for an individual.

Moreover, this offer provides the learner with repositories of famous books and courses on the subject which contain helpful notebooks which serve as a learning resource for a learner in gaining hands-on experience with it.

The heavy computations required for its applications are not performed on the learner’s local machine. Instead, they are performed in the Azure cloud, which increases responsiveness and processing speed.

Listed below are the pre-installed python libraries and the sources of repositories of Computer Vision books provided by this offer:

Python libraries

  • Numpy
  • Matplotlib
  • Pandas
  • Seaborn
  • OpenCV
  • Scikit Image
  • Simple CV
  • PyTorch
  • Torchvision
  • Pillow
  • Tesseract
  • Pytorchcv
  • Fastai
  • Keras
  • TensorFlow
  • Imutils
  • Albumentations


  • GitHub repository of book Modern Computer Vision with PyTorch, by author V Kishore Ayyadevara and Yeshwanth Reddy.
  • GitHub repository of Computer Vision Nanodegree Program, by Udacity.
  • GitHub repository of book OpenCV 3 Computer Vision with Python Cookbook, by author Aleksandr Rybnikov.
  • GitHub repository of book Hands-On Computer Vision with TensorFlow 2, by authors Benjamin Planche and Eliot Andres.


Jupyter Hub for Computer Vision using Python provides an in-browser coding environment with just a single click, hence providing ease of installation. Through this offer, a learner can dive into the world of this industry to work with its various applications including automotive safety, self-driving cars, medical imaging, fraud detection, surveillance, intelligent video analytics, image segmentation, and code and character reader (or OCR).

Jupyter Hub for Computer Vision using Python offered by Data Science Dojo is ideal to learn more about the subject without the need to worry about configurations and computing resources. The heavy resource requirement to deal with large Images, and process and analyzes those images with its techniques is no more an issue as data-intensive computations are now performed on Microsoft Azure which increases processing speed.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for it using Python. Install the Jupyter Hub offer now from the Azure Marketplace, your ideal companion in your journey to learn data science!

Try Now!

Waasif Nadeem
| July 21, 2022

This blog explains what transfer learning is, its benefits in image processing and how it’s applied. We will implement a model and train it for transfer learning using Keras.

What is transfer learning?

What if I told you that a network that classifies 10 different types of vehicles can provide useful knowledge for a classification problem with 3 different types of cars? This is called transfer learning – a method that uses pre-trained neural networks to solve a new, similar problem.

Over the years, people have been trying to produce different methods to train neural networks with small amounts of data. Those methods are used to generate more data for training. However, transfer learning provides an alternative by learning from existing architectures (trained on large datasets) and further training them for our new problem. This method reduces the training time and gives us a high accuracy in results for small datasets.

In image processing, the initial layers of the convolutional neural network (CNN) tend to learn basic features like the edges and boundaries in the image, while the deeper layers learn more complex features like tires of a vehicle, eyes of an animal, and various others describing the image in minute detail. The features learned by the initial layers are almost the same for different problems.

This is why, when using transfer learning, we only train the latter layers of the network. Since we only have to train the network for a few layers now, the learning is much faster, and we can achieve high accuracy even with a smaller dataset.

CNN Layer Transfer Learning image
An image from the second hidden layer of VGG19 model showing that the filters in the layer detect the edges in the image

Getting started with the implementation

Now, let’s look at an example of applying transfer learning using the Keras library in Python. Keras provides us with many pre-trained networks by default that we can simply load and use for our tasks. For our implementation, we will use VGG19 as our reference pre-trained model. We want to train a model, so it can predict if the input image is a rickshaw (3-wheeler closed vehicle), tanga (horse/ donkey cart), or qingqi (3-wheeler open vehicle).

Data preparation

First, we will clone the dataset from GitHub, with images of rickshaw, qingqi, and tanga in our working directory using the following command.

!git clone https://github.com/MMFa666/VehicleDataset.git

Now, we will import all the libraries that we need for our task

CV2 : Library to read image as matrix
numpy : Provides with mathematical toolkit
os : Helps reading files from their given paths
matplotlib : Used for plotting
import cv2
import numpy as np
import os
import matplotlib.pyplot as plt
keras : Provides us with the toolkit to build and train the model
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras import Model
from tensorflow import keras 

At this point, the images in the dataset are labeled as strings mentioning the names of the vehicle types. We need to convert the labels into quantitative formats by vectorizing each vehicle type.

In a vector with 3 values, each value can be either 1 or 0. The first value corresponds to the vehicle type ‘qingqi’, the second corresponds to vehicle type ‘rickshaw and the third to ‘tanga’. If the label is ‘qingqi’, the first value of the vector would be 1 and the rest would be 0 representing the unique vehicle type in the vector. These vectors are also called one-hot vectors because they have only one entry as ‘1’. It is an essential step for evaluating our model in the training process.


Once we have vectorized, we will store the complete paths of all the images in a list so that they can be used to read the image as a matrix.

list_qingqi = [('/content/VehicleDataset/train/qingqi/' + i) for i in os.listdir('/content/VehicleDataset/train/qingqi')]
list_rickshaw =[('/content/VehicleDataset/train/rickshaw/' + i) for i in os.listdir('/content/VehicleDataset/train/rickshaw')]
list_tanga = [('/content/VehicleDataset/train/tanga/' + i) for i in os.listdir('/content/VehicleDataset/train/tanga')]
paths = list_qingqi + list_rickshaw + list_tanga 

Using the CV2 library, we will now read each image in the form of a matrix from its path. Moreover, each image is then re-sized to 224 x 224 which is the input shape for our network. These matrices are stored in a list named X.

Similarly, using the one-hot vectors we implemented above, we will make a list of Y labels that correspond to each image in X. The two lists, X and Y, are then randomly shuffled keeping the correspondence of X and Y unchanged. This prevents our model training to be biased towards any specific output label.

Finally, we use train_test_split function from sklearn to split our dataset into training and testing data.

perm = np.random.permutation(len(paths))
X = np.array([cv2.resize(cv2.imread(j) / 255, (224,224)) for j in paths])[perm]
Y = []
for i in range(len(paths)):
if i < len(list_qingqi):
elif i < len(list_rickshaw):
Y = np.array(Y)[perm]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=42) 

Model set up

Step 1:

Keras library in Python provides us with various pre-trained networks by default. For our case, we would simply load the VGG19 model.

pretrained_model = keras.applications.vgg19.VGG19()

Step 2:

VGG19 was built to classify 19 different objects. However, we need the model to predict only 3 different objects. So, we initialize a new model and copy all the layers from VGG19 except the output layer having 19 nodes as in our case we only have 3 categories, hence, we create our own output layer with 3 outputs.

model = keras.Sequential()
for layer in pretrained_model.layers[:-1]:
model.add(Dense(3, activation = 'softmax'))

Step 3:

Since we do not want to train all the layers of the network but only the latter ones, we freeze the weights for the first 15 layers of the network and train only the last 10 layers.

for layer in model.layers:
layer.trainable = False


Step 1

Define the hyper parameters that we need for the training.

code>learning_rate = 0.0001
batch_size = 32
epochs = 100
input_shape = (224, 224, 3)

Note: Make sure that the learning rate that you choose is small otherwise the network might get over-fitted on the training data.

Step 2

Compile and fit the model to the training data using the hyperparameters and the loss function as ‘binary_crossentropy’.

model.compile(optimizer = RMSProp(learning_rate), loss = 'binary_crossentropy', metrics = ['accuracy'])
hist = model.fit(x = X_train, y = Y_train, batch_size= batch_size, epochs = epochs) 

Note: You might need to experiment with the loss functions and the hyper parameters depending upon your problem.

Step 3 (Optional)

Visualize the loss and accuracy corresponding to the iteration number while training the network.

plt.plot(np.arange(epochs), hist.history['loss'])
plt.title('Loss plot')
plt.plot(np.arange(epochs), hist.history['accuracy'])
plt.title('Accuracy plot')
Loss plot
Loss Plot


We have now successfully trained a network using transfer learning to identify images of tanga, qingqi and rickshaw. Let’s test the network to see if it works well with the unseen data.

  1. Use the testing data we separated earlier to make predictions.
  2. The network predicts the probability of the image belonging to a class.
  3. Convert the probabilities into one-hot vectors to assign a vehicle type to the image.
  4. Calculate the accuracy of the network on unseen data.
predictions = model.predict(X_test)
predict_vec = np.zeros_like(predictions)
predict_vec[np.arange(len(predictions)), predictions.argmax(1)] = 1
total = 0
true  = 0
for i in range(len(predict_vec)):
if np.sum(abs(np.array(predict_vec[i])- np.array(Y_test[i]))) == 0 :
true += 1
total += 1
accuracy_test = true / total 

We get an accuracy of 91% on the unseen data, which is quite promising, as we only used 216 images for our training. Moreover, we were able to train the network within one minute using a GPU accelerator, while originally the VGG19 model takes a couple of hours to train.


Transfer learning models focus on storing knowledge gained while solving one problem and applying it to a different but related problem. Nowadays, many industries like gaming, healthcare, and autonomous driving are using transfer learning. It would be too early to comment on whether transfer learning is the ultimate solution to the classification problems with small datasets. However, it has surely shown us a direction to move ahead.

Upgrade your data science skillset with our Python for Data Science and Data Science Bootcamp training!

Top Python packages for data science and how to best use them

Finding the top python packages and libraries that aren’t only popular, but get the job done isn’t easy. Here’s a list to help you out.

Out of all the Python scientific libraries and packages available, which ones are not only popular but the most useful in getting the job done?

Python packages and libraries

To help you filter down a list of libraries and packages worth adding to your data science toolbox, we have compiled our top picks for aspiring and practicing data scientists. But you’ll also want to know how to best use these tools for tricky, real-world data problems. So instead of leaving you with yet another top choice list among a quintillion list, we explain how to make the most of these libraries using real-world examples.

You can learn more about how these packages fit into data science with Data Science Dojo’s introduction to Python course.

Data manipulation


There’s a reason why pandas consistently tops published ranks on data science related libraries in Python. The library can help you with a variety of tasks, but it is particularly useful for data manipulation or data wrangling. It can save you a lot of leg work in not only your typical rudimentary data manipulation tasks, but in handling some pretty tricky problems you might encounter when slicing and filtering.

Multi-indexed data can be one of these tricky tasks. The library pandas takes care of advanced indexing, including multi-indexing, where you might need to work with higher-dimensional data or multiple index levels. For example, number of user interactions might be indexed by 1) product category, 2) time of day user interacted with the product, and 3) location of the user.

Instead of your typical table of rows and columns to represent the data, you might find it better to organize the number of user interactions into all cases that fall under x product category, with y time of day, and z location. This way you can easily see user interactions across each condition of product category, time of day, and user location. This saves you from having to apply a filter or group for all combinations of conditions in your traditional row-and-table structure.

Here is one way to multi-index data in pandas. With less than a few lines of code, pandas makes this easy to implement in Python:

import pandas as pd

data_multi_indx = table_data.set_index(['Product', 'Day of Week'])
                      Location  Num User Interactions
Product   Day of Week
Product 1 Morning            A                      3
          Morning            B                     90
          Morning            C                      7
          Afternoon          A                     17
          Afternoon          B                      1
          Afternoon          C                     82
Product 2 Morning            A                     27
          Morning            B                     70
          Morning            C                      3
          Afternoon          A                      1
          Afternoon          B                      1
          Afternoon          C                     98
Product 3 Morning            A                     94
          Morning            B                      5
          Morning            C                      1
          Afternoon          A                      0
          Afternoon          B                      7
          Afternoon          C                     93

For the more rudimentary data manipulation tasks, pandas doesn’t require much effort on your part. You can simply use the functions available for imputing missing values, one-hot encoding, dropping columns and rows, and so on.

Here are a few example classes and functions in pandas that make rudimentary data manipulation easy in a few lines of code, at most.

For more lessons with Pandas, visit Data Independent.

Feature Description
fillna(value) Fill in missing values on a column or the whole data frame with a value such as the mean, median, or mode.
isna(data)/isnull(data) Check for missing values.
get_dummies(data_frame['Column']) Apply one-hot encoding on a column.
to_numeric(data_frame['Column']) Convert a column of values from strings to numeric values.
to_string(data_frame['Column']) Convert a column of values from numeric values to strings.
to_datetime(data_frame['Column']) Convert a column of datetimes in string format to standard datetime format.
drop(columns=['Column0','Column1']) Drop specific columns or useless columns in your data frame.
drop(data.frame.index[[rownum0,rownum1]]) Drop specific rows or useless rows in your data frame.


Another library that keeps topping the ranks is numpy. This library can handle many tasks, but it is particularly useful when working with multi-dimensional arrays and performing calculations on these arrays. This can be tricky to do in more conventional ways, where you need to find the index of a value or certain values inside another index, with multiple indices.


Read about Top Python projects to choose in 2023


This is where numpy shows its strength. Its array() function means standard arrays can be simply added and nicely bundled into a multi-dimensional array. Calculations on these arrays can also be easily implemented using numpy’s vast array (pun intended) of mathematical functions.

Let’s picture an example where numpy’s multi-dimensional arrays are useful. A company tracks or records if a user was/was not shown a mobile product in the morning, afternoon, and night, delivered through a mobile notification. Based on the level of user interaction with the shown product, the company also records a user engagement score.

Data points on each user’s shown product and engagement score are stored inside an array; each array stores these values for each user. The company would like to quickly and simply bundle all user arrays.

In addition to this, using engagement score and purchase history, the company would like to calculate and identify the minimum distance (or difference) across all users’ data points so that users who follow a similar pattern can be categorized and targeted accordingly.

numpy’s array() makes it easy to bundle user arrays into a multi-dimensional array and argmin() and linalg.norm() find the min Euclidean distance between users, as an example of the kinds of calculations that can be done on a multi-dimensional array:

import numpy as np

# Records tracking whether user was/was not shown product during
# morning, afternoon, and night, and user engagement score
user_0 = [0,0,1,0.7]
user_1 = [0,1,0,0.4]
user_2 = [1,0,0,0.0]
user_3 = [0,0,1,0.9]
user_4 = [0,1,0,0.3]
user_5 = [1,0,0,0.0]
# Create a multi-dimensional array to bundle all users
# Can use arrays with mixed data types by specifying 
# the object data type in numpy multi-dimensional arrays
users_multi_dim = np.array([user_0,user_1,user_2,user_3,user_4,user_5],dtype=object)
[[0 0 1 0.7]
 [0 1 0 0.4]
 [1 0 0 0.0]
 [0 0 1 0.9]
 [0 1 0 0.3]
 [1 0 0 0.0]]
# To view which user was/was not shown the product
# either morning, afternoon or night, pandas easily 
# allows you to index and label the data
row_names = [_ for _ in ['User 0','User 1','User 2','User 3','User 4','User 5']]
col_names = [_ for _ in ['Product Shown Morning','Product Shown Afternoon',
                         'Product Shown Night','User Engagement Score']]
users_df_indexed = pd.DataFrame(users_multi_dim,index=row_names,columns=col_names)
       Product Shown Morning Product Shown Afternoon Product Shown Night User Engagement Score
User 0                     0                       0                   1                   0.7
User 1                     0                       1                   0                   0.4
User 2                     1                       0                   0                     0
User 3                     0                       0                   1                   0.9
User 4                     0                       1                   0                   0.3
User 5                     1                       0                   0                     0
# Find which existing user is closest to the engagement 
# and purchase behavior of a new user by calculating the 
# min Euclidean distance on a numpy multi-dimensional array
user_0 = [0.7,51.90,2]
user_1 = [0.4,25.95,1]
user_2 = [0.0,0.00,0]
user_3 = [0.9,77.85,3]
user_4 = [0.3,25.95,1]
user_5 = [0.0,0.00,0]
users_multi_dim = np.array([user_0,user_1,user_2,user_3,user_4,user_5])
new_user = np.array([0.8,77.85,3])
closest_to_new = np.argmin(np.linalg.norm(users_multi_dim-new_user,axis=1))
print('User', closest_to_new, 'is closest to the new user')
User 3 is closest to the new user

Data modeling


The main strength of statsmodels is its focus on statistics, going beyond the ‘machine learning out-of-the-box’ approach. This makes it a popular choice for data scientists. Conducting statistical tests to find significantly different variables, checking for normality in your data, checking the standard errors, and so on, cannot be underestimated when trying to build the most effective model you can build. Your model is only as good as your inputs, and statsmodels is designed to help you better understand and customize your inputs.

The library also covers an exhaustive list of predictive models to choose from, depending on your predictors and outcome variable(s). It covers your classic Linear Regression models (including ordinary least squares, weighted least squares, recursive least squares, and more), Generalized Linear models, Linear Mixed Effects models, Binomial and Poisson Bayesian models, Logit and Probit models, Time Series models (including autoregressive integrated moving average, dynamic factor, unobserved component,and more), Hidden Markov models, Principal Components and other techniques for Multivariate models, Kernel Density estimators, and lots more.

Here are the classes and functions in statsmodels that cover the main modeling techniques useful for many prediction tasks.

Classes and functions in statsmodel - Python packages


Any library that makes machine learning more accessible and easier to implement is bound to make the top choice list among aspiring and practicing data scientists. The library scikit-learn not only allows models to be easily implemented out-of-the-box but also offers some auto fine tuning.

Finding the best possible combination of model parameters is a key example of fine tuning. The library offers a few good ways to search for the optimal set of parameters, given the algorithm and problem to solve. The grid search and random search algorithms in scikit-learn evaluate different combinations of parameters until they find the best combo that results in the best outcome, or a better performing model.

The grid search goes through every possible combination, whereas the random search randomly samples the parameters over a fixed number of times/iterations. Cross validating your model on many subsets of data is also easy to implement using scikit-learn. With this kind of automation, the library offers data scientists a massive time saver when building models.

The library also covers all the essential machine learning models from classification (including Support Vector Machine, Random Forest, etc), to regression (including Ridge Regression, Lasso Regression, etc), and clustering (including k-Means, Mean Shift, etc).

Here are the classes and functions in scikit-learn that cover the main modeling techniques useful for many prediction tasks.

Feature Description


Classification models: Support Vector Machine, Gaussian Naïve Bayes, Logistic Regression, Decision Tree, Random Forest, Stochastic Gradient Descent, Multi-Layer Perceptron


Regression models: Ridge Regression, Lasso Regression, Support Vector Machine, Decision Tree, Random Forest, Stochastic Gradient Descent, Multi-Layer Perceptron
KMeans()AffinityPropagation()MeanShift()AgglomerativeClustering Clustering models: k-Means, Affinity Propagation, Mean Shift, Agglomerative Hierarchical Clustering

Data visualization


The libraries matplotlib and seaborn will easily take care of your basic static plot functions, which are important for your own internal exploration or understanding of the data. But when presenting visual insights to business folks or users, interactivity is where we are headed these days.

Using JavaScript functionality, plotly renders interactive graphs in the form of zooming in and panning out of the graph panel, hovering over objects for more information, and dragging objects into position to further explore relationships in the data. Graphs can be customized to your heart’s content.

Here are just a few of many tricks that plotly offers:

Feature Description
hovermodehoverinfo Controls the mode and text when a user hovers over an object.
on_selection()on_click() Allows a user to select or click on an object and have that selected object change color, for example.
update Modifies a graph’s layout and data such as titles and annotations.
animate Creates an animated graph.


Much like plotlybokeh also offers interactive graphs. But one feature that stands out in bokeh is linked interactions. This is useful when keeping separate graphs in unison, where the user interacts with one graph and needs to compare with the other while they are in sync. For example, a user zooms into a graph, effectively changing the range of the graph, and then would like to compare with the second graph. The second graph would need to automatically update its range so that both graphs can be easily compared like-for-like.

Here are some key tricks that bokeh offers:

Feature Description
figure() Creates a new plot and allows linking to the range of another plot.
HoverTool()hover_glyph Allows user to hover over an object for more information.
selection_glyph Selects a particular glyph object for styling.
Slider() Creates a slider to dynamically update the plot based on the slide range.

Related Topics

Programming Language
Machine Learning
Events and Conferences
DSD Insights
Development and Operations
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision

Finding our reads interesting?

Become a contributor today and share your data science insights with the community

Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.