For a hands-on learning experience to develop LLM applications, join our LLM Bootcamp today.
Early Bird Discount Ending Soon!

matplotlib

Data Science Dojo Staff

15 Python Packages You Must Know to Maximize Your Coding Productivity

Python is a versatile and powerful programming language! Whether you’re a seasoned developer or just stepping into coding, Python’s simplicity and readability make it a favorite among programmers.

One of the main reasons for its popularity is the vast array of libraries and packages available for data manipulation, analysis, and visualization. But what truly sets it apart is the vast ecosystem of Python packages. It makes Python the go-to language for countless applications.

Learn the top 6 Popular Python libraries for Data Science

While its clean syntax and dynamic nature allow developers to bring their ideas to life with ease, the true magic it offers is in the form of Python packages. It is similar to having a toolbox filled with pre-built solutions for all of your problems.

In this blog, we’ll explore the top 15 Python packages that every developer should know about. So, buckle up and enhance your Python journey with these incredible tools! However, before looking at the list, let’s understand what Python packages are.

What are Python Packages?

Python packages are a fundamental aspect of the Python programming language. These packages are designed to organize and distribute code efficiently. These are collections of modules that are bundled together to provide a particular functionality or feature to the user.

Understand the difference between Java and Python

Common examples of widely used Python packages include pandaswhich groups modules for data manipulation and analysis, while matplotlib organizes modules for creating visualizations.

The Structure of a Python Package

A Python package refers to a directory that contains multiple modules and a special file named __init__.py. This file is crucial as it signals Python that the directory should be treated as a package. These packages enable you to logically group and distribute functionality, making your projects modular, scalable, and easier to maintain.

Here’s a simple breakdown of a typical package structure:

1. Package Directory: This is the main folder that holds all the components of the package.

2. `__init__.py` File: This file can be empty or contain an initialization code for the package. Its presence is what makes the directory a package.

3. Modules: These are individual Python files within the package directory. Each module can contain functions, classes, and variables that contribute to the package’s overall functionality.

4. Sub-packages: Packages can also contain sub-packages, which are directories within the main package directory. These sub-packages follow the same structure, with their own `__init__.py` files and modules.

The above structure is useful for developers to:

Reuse code: Write once and use it across multiple projects
Organize projects: Keep related functionality grouped together
Prevent conflicts: Use namespaces to avoid naming collisions between modules

Thus, the modular approach not only enhances code readability but also simplifies the process of managing large projects. It makes Python packages the building blocks that empower developers to create robust and scalable applications.

Top 15 Python Packages You Must Explore

Let’s navigate through a list of some of the top Python packages that you should consider adding to your toolbox. For 2025, here are some essential Python packages to know across different domains, reflecting the evolving trends in data science, machine learning, and general development:

Core Libraries for Data Analysis

1. NumPy

Numerical Python, or NumPy, is a fundamental package for scientific computing in Python, providing support for large, multi-dimensional arrays and matrices. It is a core library widely used in data analysis, scientific computing, and machine learning.

NumPy introduces the ndarray object for efficient storage and manipulation of large datasets, outperforming Python’s built-in lists in numerical operations. It also offers a comprehensive suite of mathematical functions, including arithmetic operations, statistical functions, and linear algebra operations for complex numerical computations.

NumPy’s key features include broadcasting for arithmetic operations on arrays of different shapes. It can also interface with C/C++ and Fortran, integrating high-performance code with Python and optimizing performance.

NumPy arrays are stored in contiguous memory blocks, ensuring efficient data access and manipulation. It also supports random number generation for simulations and statistical sampling. As the foundation for many other data analysis libraries like Pandas, SciPy, and Matplotlib, NumPy ensures seamless integration and enhances the capabilities of these libraries.

2. Pandas

Pandas is a widely-used open-source library in Python that provides powerful data structures and tools for data analysis. Built on top of NumPy, it simplifies data manipulation and analysis with its two primary data structures: Series and DataFrame.

A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional table-like structure with labeled axes. These structures allow for efficient data alignment, indexing, and manipulation, making it easy to clean, prepare, and transform data.

Pandas also excels in handling time series data, performing group by operations, and integrating with other libraries like NumPy and Matplotlib. The package is essential for tasks such as data wrangling, exploratory data analysis (EDA), statistical analysis, and data visualization.

It offers robust input and output tools to read and write data from various formats, including CSV, Excel, and SQL databases. This versatility makes it a go-to tool for data scientists and analysts across various fields, enabling them to efficiently organize, analyze, and visualize data trends and patterns.

Learn to use Pandas agent of time-series analysis

3. Dask

Dask is a robust Python library designed to enhance parallel computing and efficient data analysis. It extends the capabilities of popular libraries like NumPy and Pandas, allowing users to handle larger-than-memory datasets and perform complex computations with ease.

Dask’s key features include parallel and distributed computing, which utilizes multiple cores on a single machine or across a distributed cluster to speed up data processing tasks. It also offers scalable data structures, such as arrays and dataframes, that manage datasets too large to fit into memory, enabling out-of-core computation.

Dask integrates seamlessly with existing Python libraries like NumPy, Pandas, and Scikit-learn, allowing users to scale their workflows with minimal code changes. Its dynamic task scheduler optimizes task execution based on available resources.

With an API that mirrors familiar libraries, Dask is easy to learn and use. It supports advanced analytics and machine learning workflows for training models on big data. Dask also offers interactive computing, enabling real-time exploration and manipulation of large datasets, making it ideal for data exploration and iterative analysis.

Visualization Tools

4. Matplotlib

Matplotlib is a plotting library for Python to create static, interactive, and animated visualizations. It is a foundational tool for data visualization in Python, enabling users to transform data into insightful graphs and charts.

It enables the creation of a wide range of plots, including line graphs, bar charts, histograms, scatter plots, and more. Its design is inspired by MATLAB, making it familiar to users, and it integrates seamlessly with other Python libraries like NumPy and Pandas, enhancing its utility in data analysis workflows.

Learn about easily building AI-based chatbots in Python

Key features of Matplotlib include its ability to produce high-quality, publication-ready figures in various formats such as PNG, PDF, and SVG. It also offers extensive customization options, allowing users to adjust plot elements like colors, labels, and line styles to suit their needs.

Matplotlib supports interactive plots, enabling users to zoom, pan, and update plots in real time. It provides a comprehensive set of tools for creating complex visualizations, such as subplots and 3D plots, and supports integration with graphical user interface (GUI) toolkits, making it a powerful tool for developing interactive applications.

Master the creation of a rule-based chatbot in Python

5. Seaborn

Seaborn is a Python data visualization library built on top of Matplotlib for aesthetically pleasing and informative statistical graphics. It provides a high-level interface for drawing attractive and informative statistical graphics. It simplifies the process of creating complex visualizations by offering built-in themes and color palettes.

The Python package is well-suited for visualizing data frames and arrays, integrating seamlessly with Pandas to handle data efficiently. Its key features include the ability to create a variety of plot types, such as heatmaps, violin plots, and pair plots, which are useful for exploring relationships in data.

Seaborn also supports complex visualizations like multi-plot grids, allowing users to create intricate layouts with minimal code. Its integration with Matplotlib ensures that users can customize plots extensively, combining the simplicity of Seaborn with the flexibility of Matplotlib to produce detailed and customized visualizations.

Also, read about Large Language Models and their Applications

6. Plotly

Plotly is a useful Python library for data analysis and presentation through interactive and dynamic visualizations. It allows users to create interactive plots that can be embedded in web applications, shared online, or used in Jupyter notebooks.

It supports diverse chart types, including line plots, scatter plots, bar charts, and more complex visualizations like 3D plots and geographic maps. Plotly’s interactivity enables users to hover over data points to see details, zoom in and out, and even update plots in real time, enhancing the user experience and making data exploration more intuitive.

Build a Recommendation System using Python easily

It enables users to produce high-quality, publication-ready graphics with minimal code with a user-friendly interface. It also integrates well with other Python libraries such as Pandas and NumPy.

Plotly also supports a wide array of customization options, enabling users to tailor the appearance of their plots to meet specific needs. Its integration with Dash, a web application framework, allows users to build interactive web applications with ease, making it a versatile tool for both data visualization and application development.

Machine Learning and Deep Learning

7. Scikit-learn

Scikit-learn is a Python library for machine learning with simple and efficient tools for data mining and analysis. Built on top of NumPy, SciPy, and Matplotlib, it provides a robust framework for implementing a wide range of machine-learning algorithms.

It is known for ease of use and clean API, making it accessible for both beginners and experienced practitioners. It supports various supervised and unsupervised learning algorithms, including classification, regression, clustering, and dimensionality reduction, allowing users to tackle diverse ML tasks.

Understand Machine Learning using Python in Cloud

Its comprehensive suite of tools for model selection, evaluation, and validation, such as cross-validation and grid search helps in optimizing model performance. It also offers utilities for data preprocessing, feature extraction, and transformation, ensuring that data is ready for analysis.

While Scikit-learn is primarily focused on traditional ML techniques, it can be integrated with deep learning frameworks like TensorFlow and PyTorch for more advanced applications. This makes Scikit-learn a versatile tool in the ML ecosystem, suitable for a range of projects from academic research to industry applications.

8. TensorFlow

TensorFlow is an open-source software library developed by Google dataflow and differentiable programming across various tasks. It is designed to be highly scalable, allowing it to run efficiently on multiple CPUs and GPUs, making it suitable for both small-scale and large-scale machine learning tasks.

It supports a wide array of neural network architectures and offers high-level APIs, such as Keras, to simplify the process of building and training models. This flexibility and robust performance make TensorFlow a popular choice for both academic research and industrial applications.

One of the key strengths of TensorFlow is its ability to handle complex computations and its support for distributed computing. It also provides tools for deploying models on various platforms, including mobile and edge devices, through TensorFlow Lite.

Moreover, TensorFlow’s community and extensive documentation offer valuable resources for developers and researchers, fostering innovation and collaboration. Its versatility and comprehensive features make TensorFlow an essential tool in the machine learning and deep learning landscape.

9. PyTorch

PyTorch is an open-source library developed by Facebook’s AI Research lab. It is known for dynamic computation graphs that allow developers to modify the network architecture, making it highly flexible for experimentation. This feature is especially beneficial for researchers who need to test new ideas and algorithms quickly.

It integrates seamlessly with Python for a natural and easy-to-use interface that appeals to developers familiar with the language. PyTorch also offers robust support for distributed training, enabling the efficient training of large models across multiple GPUs.

Through frameworks like TorchScript, it enables users to deploy models on various platforms like mobile devices. Its strong community support and extensive documentation make it accessible for both beginners and experienced developers.

Explore more about Retrieval Augmented Generation

Natural Language Processing (NLP)

10. NLTK

NLTK, or the Natural Language Toolkit, is a comprehensive Python library designed for working with human language data. It provides a range of tools and resources, including text processing libraries for tokenization, parsing, classification, stemming, tagging, and semantic reasoning.

It also includes a vast collection of corpora and lexical resources, such as WordNet, which are essential for linguistic research and development. Its modular design allows users to easily access and implement various NLP techniques, making it an excellent choice for both educational and research purposes.

Explore Natural Language Processing and its Applications

Beyond its extensive functionality, NLTK is known for its ease of use and well-documented tutorials, helping newcomers to grasp the basics of NLP. The library’s interactive features, such as graphical demonstrations and sample datasets, provide a hands-on learning experience.

11. SpaCy

SpaCy is a powerful Python library designed for production use, offering fast and accurate processing of large volumes of text. It offers features like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more.

Unlike some other NLP libraries, SpaCy is optimized for performance, making it ideal for real-time applications and large-scale data processing. Its pre-trained models support multiple languages, allowing developers to easily implement multilingual NLP solutions.

One of SpaCy’s standout features is its focus on providing a seamless and intuitive user experience. It offers a straightforward API that simplifies the integration of NLP capabilities into applications. It also supports deep learning workflows, enabling users to train custom models using frameworks like TensorFlow and PyTorch.

SpaCy includes tools for visualizing linguistic annotations and dependencies, which can be invaluable for understanding and debugging NLP models. With its robust architecture and active community, it is a popular choice for both academic research and commercial projects in the field of NLP.

Web Scraping

12. BeautifulSoup

BeautifulSoup is a Python library designed for web scraping purposes, allowing developers to extract data from HTML and XML files with ease. It provides simple methods to navigate, search, and modify the parse tree, making it an excellent tool for handling web page data.

It is useful for parsing poorly-formed or complex HTML documents, as it automatically converts incoming documents to Unicode and outgoing documents to UTF-8. This flexibility ensures that developers can work with a wide range of web content without worrying about encoding issues.

BeautifulSoup integrates seamlessly with other Python libraries like requests, which are used to fetch web pages. This combination allows developers to efficiently scrape and process web data in a streamlined workflow.

The library’s syntax and comprehensive documentation make it accessible to both beginners and experienced programmers. Its ability to handle various parsing tasks, such as extracting specific tags, attributes, or text, makes it a versatile tool for projects ranging from data mining to web data analysis.

Bonus Additions to the List!

13. SQLAlchemy

SQLAlchemy is a Python library that provides a set of tools for working with databases using an Object Relational Mapping (ORM) approach. It allows developers to interact with databases using Python objects, making database operations more intuitive and reducing the need for writing raw SQL queries.

SQLAlchemy supports a wide range of database backends, including SQLite, PostgreSQL, MySQL, and Oracle, among others. Its ORM layer enables developers to define database schemas as Python classes, facilitating seamless integration between the application code and the database.

It offers a powerful Core system for those who prefer to work with SQL directly. This system provides a high-level SQL expression language for developers to construct complex queries. Its flexibility and extensive feature set make it suitable for both small-scale applications and large enterprise systems.

Learn how to evaluate time series in Python model predictions

14. OpenCV

OpenCV, short for Open Source Computer Vision Library, is a Python package for computer vision and image processing tasks. Originally developed by Intel, it was later supported by Willow Garage and is now maintained by Itseez. OpenCV is available for C++, Python, and Java.

It enables developers to perform operations on images and videos, such as filtering, transformation, and feature detection.

It supports a variety of image formats and is capable of handling real-time video capture and processing, making it an essential tool for applications in robotics, surveillance, and augmented reality. Its extensive functionality allows developers to implement complex algorithms for tasks like object detection, facial recognition, and motion tracking.

OpenCV also integrates well with other libraries and frameworks, such as NumPy, enhancing its performance and flexibility. This allows for efficient manipulation of image data using array operations.

Moreover, its open-source nature and active community support ensure continuous updates and improvements, making it a reliable choice for both academic research and industrial applications.

15. urllib

Urllib is a module in the standard Python library that provides a set of simple, high-level functions for working with URLs and web protocols. It allows users to open and read URLs, download data from the web, and interact with web services.

It supports various protocols, including HTTP, HTTPS, and FTP, enabling seamless communication with web servers. The library is particularly useful for tasks such as web scraping, data retrieval, and interacting with RESTful APIs.

The urllib package is divided into several modules, each serving a specific purpose. For instance:

urllib.request is used for opening and reading URLs
urllib.parse provides functions for parsing and manipulating URL strings
urllib.error handles exceptions related to URL operations
urllib.robotparser helps in parsing robots.txt files to determine if a web crawler can access a particular site

With its comprehensive functionality and ease of use, urllib is a valuable tool for developers looking to perform network-related tasks in Python, whether for simple data fetching or more complex web interactions.

Explore the top 6 Python libraries for data science

What is the Standard vs Third-Party Packages Debate?

In the Python ecosystem, packages are categorized into two main types: standard and third-party. Each serves a unique purpose and offers distinct advantages to developers. Before we dig deeper into the debate, let’s understand what is meant by these two types of packages.

What are Standard Packages?

These are the packages found in Python’s standard library and maintained by the Python Software Foundation. These are also included with every Python installation, providing essential functionalities like file I/O, system calls, and data manipulation. These are reliable, well-documented, and ensure compatibility across different versions.

What are Third-Party Packages?

These refer to packages developed by the Python community and are not a part of the standard library. They are often available through package managers like pip or repositories like Python Package Index (PyPI). These packages cover a wide range of functionalities.

Key Points of the Debate

While we understand the main difference between standard and third-party packages, their comparison can be analyzed from three main aspects.

Scope vs. Stability: Standard library packages excel in providing stable, reliable, and broadly applicable functionality for common tasks (e.g., file handling, basic math). However, for highly specialized requirements, third-party packages provide superior solutions, but at the cost of additional risk.
Innovation vs. Trust: Third-party packages are the backbone of innovation in Python, especially in fast-moving fields like AI and web development. They provide developers with the latest features and tools. However, this innovation comes with the downside of requiring extra caution for security and quality.
Ease of Use: For beginners, Python’s standard library is the most straightforward way to start, providing everything needed for basic projects. For more complex or specialized applications, developers tend to rely on third-party packages with additional setup but greater flexibility and power.

It is crucial to understand these differences as you choose a package for your project. As for the choice you make, it often depends on the project’s requirements, but in many cases, a combination of both is used to access the full potential of Python.

Wrapping up

In conclusion, these Python packages are some of the most popular and widely used libraries in the Python data science ecosystem. They provide powerful and flexible tools for data manipulation, analysis, and visualization, and are essential for aspiring and practicing data scientists.

With the help of these Python packages, data scientists can easily perform complex data analysis and machine learning tasks, and create beautiful and informative visualizations.

Learn how to build AI-based chatbots in Python

If you want to learn more about data science and how to use these Python packages, we recommend checking out Data Science Dojo’s Python for Data Science course, which provides a comprehensive introduction to Python and its data science ecosystem.

December 13, 2024

Programming

Data Science Dojo Staff

Demystifying heatmaps: A comprehensive beginner’s guide

Heatmaps are a type of data visualization that uses color to represent data values. For the unversed,
data visualization is the process of representing data in a visual format. This can be done through charts, graphs, maps, and other visual representations.

What are heatmaps?

A heatmap is a graphical representation of data in which values are represented as colors on a two-dimensional plane. Typically, heatmaps are used to visualize data in a way that makes it easy to identify patterns and trends. 

Heatmaps are often used in fields such as data analysis, biology, and finance. In data analysis, heatmaps are used to visualize patterns in large datasets, such as website traffic or user behavior.

In biology, heatmaps are used to visualize gene expression data or protein-protein interaction networks. In finance, heatmaps are used to visualize stock market trends and performance.  This diagram shows a random 10×10 heatmap using `NumPy` and `Matplotlib`.  

Advantages of heatmaps

Visual representation: Heatmaps provide an easily understandable visual representation of data, enabling quick interpretation of patterns and trends through color-coded values.
Large data visualization: They excel at visualizing large datasets, simplifying complex information and facilitating analysis.
Comparative analysis: They allow for easy comparison of different data sets, highlighting differences and similarities between, for example, website traffic across pages or time periods.
Customizability: They can be tailored to emphasize specific values or ranges, enabling focused examination of critical information.
User-friendly: They are intuitive and accessible, making them valuable across various fields, from scientific research to business analytics.
Interactivity: Interactive features like zooming, hover-over details, and data filtering enhance the usability of heatmaps.
Effective communication: They offer a concise and clear means of presenting complex information, enabling effective communication of insights to stakeholders.

Creating heatmaps using “Matplotlib”  

We can create heatmaps using Matplotlib by following the aforementioned steps: 

To begin, we import the necessary libraries, namely Matplotlib and NumPy.
Following that, we define our data as a 3×3 NumPy array.
Afterward, we utilize Matplotlib’s imshow function to create a heatmap, specifying the color map as ‘coolwarm’.
To enhance the visualization, we incorporate a color bar by employing Matplotlib’s colorbar function.
Subsequently, we set the title and axis labels using Matplotlib’s set_title, set_xlabel, and set_ylabel functions.
Lastly, we display the plot using the show function.

Bottom line: This will create a simple 3×3 heatmap with a color bar, title, and axis labels. 

Customizations available in Matplotlib for heatmaps  

Following is a list of the customizations available for Heatmaps in Matplotlib: 

Changing the color map  
Changing the axis labels  
Changing the title  
Adding a color bar  
Adjusting the size and aspect ratio  
Setting the minimum and maximum values 
Adding annotations  
Adjusting the cell size 
Masking certain cells  
Adding borders

 These are just a few examples of the many customizations that can be done in heatmaps using Matplotlib. Now, let’s see all the customizations being implemented in a single example code snippet: 

In this example, the heatmap is customized in the following ways: 

Set the colormap to ‘coolwarm’ 
Set the minimum and maximum values of the colormap using `vmin` and `vmax` 
Set the size of the figure using `figsize` 
Set the extent of the heatmap using `extent` 
Set the linewidth of the heatmap using `linewidth` 
Add a colorbar to the figure using the `colorbar` 
Set the title, xlabel, and ylabel using `set_title`, `set_xlabel`, and `set_ylabel`, respectively 
Add annotations to the heatmap using `text` 
Mask certain cells in the heatmap by setting their values to `np.nan` 
Show the frame around the heatmap using `set_frame_on(True)`

Creating heatmaps using “Seaborn” 

We can create heatmaps using Seaborn by following the aforementioned steps: 

First, we import the necessary libraries: seaborn, matplotlib, and numpy.
Next, we generate a random 10×10 matrix of numbers using NumPy’s rand function and store it in the variable data.
We create a heatmap by using Seaborn’s heatmap function. It takes the data as input and specifies the color map using the cmap parameter. Additionally, we set the annot parameter to True to display the values in each cell of the heatmap.
To enhance the plot, we add a title, x-label, and y-label using Matplotlib’s title, xlabel, and ylabel functions.
Finally, we display the plot using the show function from Matplotlib.

Overall, the code generates a random heatmap using Seaborn with a color map, annotations, and labels using Matplotlib. 

Customizations available in Seaborn for heatmaps:

Following is a list of the customizations available for Heatmaps in Seaborn: 

Change the color map  
Add annotations to the heatmap cells 
Adjust the size of the heatmap  
Display the actual numerical values of the data in each cell of the heatmap 
Add a color bar to the side of the heatmap 
Change the font size of the heatmap  
Adjust the spacing between cells  
Customize the x-axis and y-axis labels 
Rotate the x-axis and y-axis tick labels

Now, let’s see all the customizations being implemented in a single example code snippet:

In this example, the heatmap is customized in the following ways: 

Set the color palette to “Blues”. 
Add annotations with a font size of 10. 
Set the x and y labels and adjust font size. 
Set the title of the heatmap. 
Adjust the figure size. 
Show the heatmap plot.

Limitations of heatmaps:

Heatmaps are a useful visualization tool for exploring and analyzing data, but they do have some limitations that you should be aware of: 

Limited to two-dimensional data: They are designed to visualize two-dimensional data, which means that they are not suitable for visualizing higher-dimensional data. 
Limited to continuous data: They are best suited for continuous data, such as numerical values, as they rely on a color scale to convey the information. Categorical or binary data may not be as effectively visualized using heatmaps. 
May be affected by color blindness: Some people are color blind, which means that they may have difficulty distinguishing between certain colors. This can make it difficult for them to interpret the information in a heatmap.

Can be sensitive to scaling: The color mapping in a heatmap is sensitive to the scale of the data being visualized. Therefore, it is important to carefully choose the color scale and to consider normalizing or standardizing the data to ensure that the heatmap accurately represents the underlying data. 
Can be misleading: They can be visually appealing and highlight patterns in the data, but they can also be misleading if not carefully designed. For example, choosing a poor color scale or omitting important data points can distort the visual representation of the data.

It is important to consider these limitations when deciding whether or not to use a heatmap for visualizing your data. 

Conclusion

Heatmaps are powerful tools for visualizing data patterns and trends. They find applications in various fields, enabling easy interpretation and analysis of large datasets. Matplotlib and Seaborn offer flexible options to create and customize heatmaps. However, it’s essential to understand their limitations, such as two-dimensional data representation and sensitivity to color perception. By considering these factors, heatmaps can be a valuable asset in gaining insights and communicating information effectively.

Written by Safia Faiz

June 12, 2023

Data Visualization

Data Science Dojo Staff

Mastering Histograms: A Beginner’s Comprehensive Guide

Histograms are a fundamental tool in data visualization, offering a simple yet powerful way to understand the distribution of data. Whether you’re new to data analysis or looking to sharpen your skills, histograms are a crucial tool for summarizing and visualizing data points.

They allow you to easily spot trends, patterns, and outliers in your dataset. In this comprehensive guide, we’ll explore what histograms are, why they are important, and how to create and interpret them. By the end of this guide, you’ll be equipped with the knowledge to use histograms effectively in your own data analysis projects.

Defining Histograms

A histogram is a type of graphical representation of data that shows the distribution of numerical values. It consists of a set of vertical bars, where each bar represents a range of values, and the height of the bar indicates the frequency or count of data points falling within that range.  

Histograms are commonly used in statistics and data analysis to visualize the shape of a data set and to identify patterns, such as the presence of outliers or skewness. They are also useful for comparing the distribution of different data sets or for identifying trends over time. 

The picture above shows how 1000 random data points from a normal distribution with a mean of 0 and standard deviation of 1 are plotted in a histogram with 30 bins and black edges.  

 Advantages of Histograms

Histograms are more than just simple bar charts—they are powerful tools that help analysts make sense of complex data. From spotting trends to identifying outliers, histograms offer several advantages that make them essential in data analysis and visualization. Let’s explore some key benefits of using histograms.

Visual Representation

Histograms offer a clear visual representation of data distribution, allowing us to quickly observe the frequency of data points across different ranges or bins. This visual approach makes it easier to spot trends, patterns, and even anomalies that might not be immediately evident in raw data. Whether you’re looking for skewness, symmetry, or multimodal distributions, histograms provide a straightforward way to understand the overall structure of your data.

Easy Interpretation

One of the main strengths of histograms is their simplicity. Even non-experts can easily interpret them, as the bar chart format intuitively shows how frequently data points fall within specific ranges. The height of each bar represents the frequency or proportion of data points in each bin, making it accessible for anyone to understand the distribution without needing advanced statistical knowledge.

Outlier Identification

Histograms are especially useful for identifying outliers or extreme values. These are typically represented by individual bars that stand apart from the rest, often appearing as isolated spikes on one end of the histogram. Identifying outliers can be crucial for understanding data anomalies or errors and can inform decisions regarding data cleaning or further investigation.

Comparison of Data Sets

Another powerful feature of histograms is their ability to compare the distributions of multiple data sets. By overlaying or side-by-side plotting histograms of different datasets, you can quickly identify similarities and differences in their distributions. This helps to compare trends across different groups, such as customer segments, time periods, or product categories, enabling more insightful decision-making.

Data Summarization

Histograms are an excellent tool for summarizing large datasets. Instead of getting lost in the raw numbers, histograms condense the information into digestible features like the overall shape, center (e.g., mean or median), and spread (e.g., range or standard deviation) of the distribution. This gives a quick snapshot of the data’s most important characteristics, helping analysts and decision-makers grasp the key points without needing to process extensive raw data.

 Creating a Histogram Using Matplotlib Library

We can create histograms using Matplotlib by following a series of steps. Following the import statements of the libraries, the code generates a set of 1000 random data points from a normal distribution with a mean of 0 and standard deviation of 1, using the `numpy.random.normal()` function.  

The plt.hist() function in Python is a powerful tool for creating histograms. By providing the data, number of bins, bar color, and edge color as input, this function generates a histogram plot.
To enhance the visualization, the xlabel(), ylabel(), and title() functions are utilized to add labels to the x and y axes, as well as a title to the plot.
Finally, the show() function is employed to display the histogram on the screen, allowing for detailed analysis and interpretation.

Overall, this code generates a histogram plot of a set of random data points from a normal distribution, with 30 bins, blue bars, black edges, labeled axes, and a title. The histogram shows the frequency distribution of the data, with a bell-shaped curve indicating the normal distribution.    

Customizations Available in Matplotlib for Histograms   

In Matplotlib, there are several customizations available for histograms. These include:

Adjusting the number of bins. 
Changing the color of the bars. 
Changing the opacity of the bars. 
Changing the edge color of the bars. 
Adding a grid to the plot. 
Adding labels and a title to the plot. 
Adding a cumulative density function (CDF) line. 
Changing the range of the x-axis. 
Adding a rug plot.

Now, let’s see all the customizations being implemented in a single example code snippet: 

In this example, the histogram is customized in the following ways: 

The number of bins is set to `20` using the `bins` parameter. 
The transparency of the bars is set to `0.5` using the `alpha` parameter. 
The edge color of the bars is set to `black` using the `edgecolor` parameter. 
The color of the bars is set to `green` using the `color` parameter. 
The range of the x-axis is set to `(-3, 3)` using the `range` parameter. 
The y-axis is normalized to show density using the `density` parameter. 
Labels and a title are added to the plot using the `xlabel()`, `ylabel()`, and `title()` functions. 
A grid is added to the plot using the `grid` function. 
A cumulative density function (CDF) line is added to the plot using the `cumulative` parameter and `histtype=’step’`. 
A rug plot showing individual data points is added to the plot using the `plot` function.

Creating a Histogram Using ‘Seaborn’ Library

We can create histograms using Seaborn by following the steps: 

First and foremost, importing the libraries: `NumPy`, `Seaborn`, `Matplotlib`, and `Pandas`. After importing the libraries, a toy dataset is created using `pd.DataFrame()` of 1000 samples that are drawn from a normal distribution with mean 0 and standard deviation 1 using NumPy’s `random.normal()` function. 
We use Seaborn’s `histplot()` function to plot a histogram of the ‘data’ column of the DataFrame with `20` bins and a `blue` color. 
The plot is customized by adding labels, and a title, and changing the style to a white grid using the `set_style()` function. 
Finally, we display the plot using the `show()` function from matplotlib.

Overall, this code snippet demonstrates how to use Seaborn to plot a histogram of a dataset and customize the appearance of the plot quickly and easily. 

 Customizations Available in Seaborn for Histograms

Following is a list of the customizations available for Histograms in Seaborn: 

Change the number of bins. 
Change the color of the bars. 
Change the color of the edges of the bars. 
Overlay a density plot on the histogram. 
Change the bandwidth of the density plot. 
Change the type of histogram to cumulative. 
Change the orientation of the histogram to horizontal. 
Change the scale of the y-axis to logarithmic.

Now, let’s see all these customizations being implemented here as well, in a single example code snippet: 

 In this example, we have done the following customizations: 

Set the number of bins to `20`. 
Set the color of the bars to `green`. 
Set the `edgecolor` of the bars to `black`. 
Added a density plot overlaid on top of the histogram using the `kde` parameter set to `True`. 
Set the bandwidth of the density plot to `0.5` using the `kde_kws` parameter. 
Set the histogram to be cumulative using the `cumulative` parameter. 
Set the y-axis scale to logarithmic using the `log_scale` parameter. 
Set the title of the plot to ‘Customized Histogram’. 
Set the x-axis label to ‘Values’. 
Set the y-axis label to ‘Frequency’.

Limitations of Histograms

 Histograms are widely used for visualizing the distribution of data, but they also have limitations that should be considered when interpreting them. These limitations are jotted down below: 

They can be sensitive to the choice of bin size or the number of bins, which can affect the interpretation of the distribution. Choosing too few bins can result in a loss of information while choosing too many bins can create artificial patterns and noise. 
They can be influenced by outliers, which can skew the distribution or make it difficult to see patterns in the data. 
They are typically univariate and cannot capture relationships between multiple variables or dimensions of data. 
Histograms assume that the data is continuous and does not work well with categorical data or data with large gaps between values. 
They can be affected by the choice of starting and ending points, which can affect the interpretation of the distribution. 
They do not provide information on the shape of the distribution beyond the binning intervals.

 It’s important to consider these limitations when using histograms and to use them in conjunction with other visualization techniques to gain a more complete understanding of the data. 

 Wrapping Up

In conclusion, histograms are powerful tools for visualizing the distribution of data. They provide valuable insights into the shape, patterns, and outliers present in a dataset. With their simplicity and effectiveness, histograms offer a convenient way to summarize and interpret large amounts of data.

By customizing various aspects such as the number of bins, colors, and labels, you can tailor the histogram to your specific needs and effectively communicate your findings. So, embrace the power of histograms and unlock a deeper understanding of your data.

Written by Safia Faiz

May 23, 2023

Data Visualization

Data Science Dojo Staff

Line Plots for Beginners: Charting Success

Line plots, also known as line graphs, are a fundamental and widely used type of chart that visually represents data points connected by straight lines. They are particularly effective for illustrating trends, patterns, and changes in data over time or across different categories. By connecting individual data points, line plots provide a clear and intuitive way to observe relationships, fluctuations, and overall trends within a dataset.

One of the key advantages of line plots is their simplicity and ease of interpretation. Even for those without a background in data analysis, these charts offer an accessible way to grasp complex information at a glance.

Additionally, line plots are highly versatile, making them suitable for a wide range of applications, including business performance tracking, scientific research, financial analysis, and more. They can effectively visualize continuous data, highlight seasonal variations, compare multiple datasets, and identify long-term trends, making them indispensable tools in both professional and educational settings.

Advantages of Line Plots

Line plots can be useful for visualizing many different types of data, including:

Time series data visualization: They are useful for visualizing time series data, which refers to data that is collected over time. By plotting data points on a line, trends and patterns over time can be easily identified and communicated.
Continuous data representation: They can be used to represent continuous data, which is data that can take on any value within a range. By plotting the values along a continuous scale, the line plot can show the progression of the data and highlight any trends.
Discrete data representation: They can also be used to represent discrete data, which is data that can only take on certain values. By plotting the values as individual points along the x-axis, the line plot can show how the values are distributed and any outliers.
Easy to understand: They are simple and easy to read, making them an effective way to communicate trends in data to a wide audience. The basic format of a line plot, with data points connected by a line, is intuitive and requires little explanation.
Versatility: They can be used to visualize a wide variety of data types, including both quantitative and qualitative data. They can also be customized to suit different needs, such as by changing the scale, adding labels or annotations, and adjusting the color scheme.
Identifying patterns and trends: They can be useful for identifying patterns and trends in data, such as upward or downward trends, cyclical patterns, or seasonal trends. By visually representing the data in a line plot, it becomes easier to spot trends and make predictions about future outcomes.

Creating line plots:

When it comes to creating line plots in Python, you have two primary libraries to choose from: `Matplotlib` and `Seaborn`.

Using “Matplotlib”:

`Matplotlib` is a highly customizable library that can produce a wide range of plots, including line plots. With Matplotlib, you can specify the appearance of your line plots using a variety of options such as line style, color, marker, and label.

1. “Single” Line Plot:

A single-line plot is used to display the relationship between two variables, where one variable is plotted on the x-axis and the other on the y-axis. This type of plot is best used for displaying trends over time, as it allows you to see how one variable changes in response to the other over a continuous period.

In this example, two lists named x and y are defined to hold the data points to be plotted. The plt.plot() function is used to plot the points on a line graph, and plt.show() function is used to display the plot.

This creates a simple line plot with the x-axis displaying the values [1, 2, 3, 4, 5] and the y-axis displaying the values [2, 4, 6, 8, 10].

Also explore: Data Visualization Tools

2. “Multiple” Lines on One Plot:

A plot with multiple lines is useful for comparing trends between different groups or categories. Multiple lines can be plotted on the same graph using different colors. This type of plot is particularly useful for analyzing data with multiple variables or for comparing data across different groups.

In this example, we have two lists y1 and y2 containing data points for two different lines. We use the plt.plot() function twice to plot both lines on the same graph. We add a legend using the plt.legend() function to distinguish between the two lines.

The legend is created by providing a list of labels for each line, and the loc parameter is used to position the legend on the graph. Additionally, we add x-axis and y-axis labels and a title to the graph using the plt.xlabel(), plt.ylabel(), and plt.title() functions.

3. “Customized” Line Plot:

`Matplotlib` is a popular data visualization library in Python that allows you to create both single-line plots and plots with multiple lines. With `Matplotlib`, you can customize your plots with various colors, line styles, and markers to make them more visually appealing and informative.

In this code snippet, x and y lists are defined as before, and then a line plot is created using the plt.plot() function with customized settings.

The line color is set to green using the color parameter, and the line style is set to dashed using the linestyle parameter. The linewidth parameter is set to 2 to make the line thicker.

Markers are added to each data point using the marker parameter, which is set to 'o' to create circular markers. The face color of the markers is set to blue using the markerfacecolor parameter, and the size of the markers is set to 8 using the markersize parameter.

Finally, x-axis and y-axis labels are added to the plot using the plt.xlabel() and plt.ylabel() functions, and a title is added using the plt.title() function.

4. Adding a Regression Line:

It is possible to plot a regression line using the `Matplotlib` library in Python. Although `Seaborn` offers convenient functions for regression plot, `Matplotlib` has the capability to create various types of visualizations, including regression plots.

This code begins by importing the necessary libraries, numpy and matplotlib.pyplot.
Next, it generates a set of 100 random data points and stores them in the variables x and y.
A scatter plot is created using the scatter function from matplotlib, which takes x and y as inputs.
To fit a linear regression line to the data points, the polyfit function from numpy is used to calculate the coefficients of the line.
The plot function from matplotlib is then used to plot the regression line using the coefficients m and b along with x and m*x+b.
To improve the readability of the plot, the title, xlabel, and ylabel functions are used to set the title and axis labels.
Finally, the show function is called to display the plot on the screen.

Using “Seaborn”:

`Seaborn` is a library that specializes in statistical visualization. Seaborn provides several types of line plots, including those with regression lines, confidence intervals, and error bars.

1. “Single” Line Plot:

Visualizing data with a single line plot and multiple lines on one plot using `Seaborn` are two ways of representing data in a graphical format. A single-line plot is useful when the data being presented involves only one variable, such as time series data. It allows for the visualization of trends and patterns over time, making it an effective tool for analyzing data.

The code provided loads the tips dataset from Seaborn library and generates a basic line plot. The total_bill variable is plotted on the x-axis and the tip variable is plotted on the y-axis.

2. “Multiple” Lines on One Plot:

When there are multiple variables involved, a line plot with multiple lines using `Seaborn` can be more effective. This method allows for the comparison of different variables on the same graph, making it easier to identify patterns and relationships between them.

The code shown loads the exercise dataset from Seaborn and generates a line plot using time on the x-axis and pulse on the y-axis. The hue parameter is used to group the data by the kind variable, which creates multiple lines on the plot, with each line representing a different exercise activity.

3. “Customized” Line Plot:

`Seaborn` also provides various customization options, including color schemes and markers, which can be used to make the graph more visually appealing and informative.

The code loads the fmri dataset from Seaborn and creates a line plot with timepoint on the x-axis and signal on the y-axis. The hue parameter is used to group the data by the region variable, while the style parameter is used to group the data by the event variable.

Moreover, the markers parameter is set to True, which causes the plot to display markers at each data point, while dashes parameter is set to False, causing the plot to display solid lines. These parameter settings are useful for visualizing the data clearly and making it easier to interpret.

4. Adding a Regression Line:

`Seaborn` provides a wide range of tools to create stunning and informative plots. One of its key features is the ability to add a regression line to a plot, which can help to identify the relationship between two variables and make predictions based on that relationship.

The code above loads the anscombe dataset from Seaborn, which contains four different datasets. It then creates a set of line plots with x on the x-axis and y on the y-axis, one for each dataset.

The col parameter is used to create a separate plot for each dataset, which means that each dataset will have its own subplot in the figure. The hue parameter is used to color the lines by the dataset, so that each dataset’s line will be a different color.

The lmplot() function is used to add a regression line to each plot. This line represents the linear relationship between x and y in the dataset.

The other parameters, such as col_wrap, ci, palette, and scatter_kws, are used to customize the appearance of the plot. For example, col_wrap specifies how many subplots should be shown per row, ci controls the confidence interval for the regression line, palette specifies the color palette to use, and scatter_kws specifies additional keyword arguments for the scatter plot.

Limitations of Line Plots:

Line plots have some limitations that need to be considered when using them for data visualization. These include:

Limited data types: Line plots are not suitable for all types of data. For example, they may not work well with data that has multiple categories or data with nonlinear relationships.
Can be misleading: If the scale of the y-axis is not carefully chosen, line plots can be misleading. It is important to choose appropriate scales to avoid misinterpretation of the data.

You might also like: Business Analytics 101

Lack of context: Line plots only show the relationship between two variables, and do not provide context about other factors that may be influencing the data.
Limited visual impact: Line plots may not be as visually impactful as other types of data visualizations, such as bar charts or scatter plots.
Difficulty comparing multiple datasets: When using multiple line plots to compare different datasets, it can be difficult to visually compare the lines if they are not plotted on the same scale or with the same y-axis limits

Wrapping Up

In conclusion, line plots are a useful tool in data analysis and communication. They are easy to understand, versatile, and can visualize different types of data. Python provides two primary libraries, Matplotlib and Seaborn, for creating line plots. Both libraries offer different features and customization options. By providing examples of creating line plots using both libraries, we hope this article has been helpful in illustrating how to create line plots effectively.

April 28, 2023

Data Visualization

Ali Haider Shalwani

12 Powerful GitHub Repositories for Data Science and Engineering

GitHub is a goldmine for developers, data scientists, and engineers looking to sharpen their skills and explore new technologies. With thousands of open-source repositories available, it can be overwhelming to find the most valuable ones.

In this blog, we highlight some of the best trending GitHub repositories in data science, analytics, and engineering. Whether you’re looking for machine learning frameworks, data visualization tools, or coding resources, these repositories can help you learn faster, work smarter, and stay ahead in the tech world. Let’s dive in!

What is GitHub?

Before exploring the top repositories, we should first understand what GitHub is and why it’s so important for developers and data scientists.

GitHub is an online platform that allows people to store, share, and collaborate on code. It works as a version control system, meaning you can track changes, revert to previous versions, and work on projects with teams seamlessly. Built on Git, an open-source version control tool, GitHub makes it easier to manage coding projects—whether you’re working alone or with a team.

One of the best things about GitHub is its massive collection of open-source repositories. Developers from around the world share their code, tools, and frameworks, making it a go-to platform for learning, innovation, and collaboration. Whether you’re looking for AI models, data science projects, or web development frameworks, GitHub has something for everyone.

Also explore: Kaggle competitions

Best GitHub Repositories to Stay Ahead of the Tech Curve

Now that we understand what GitHub is and why it’s a goldmine for developers, let’s dive into the repositories that can truly make a difference. The right repositories can save time, improve coding efficiency, and introduce you to cutting-edge technologies. Whether you’re looking for AI frameworks, automation tools, or coding best practices, these repositories will help you stay ahead of the tech curve and keep your skills sharp.

1. Scikit-learn: A Python library for machine learning built on top of NumPy, SciPy, and matplotlib. It provides a range of algorithms for classification, regression, clustering, and more.  

Link to the repository: https://github.com/scikit-learn/scikit-learn 

2.TensorFlow: An open-source machine learning library developed by Google Brain Team. TensorFlow is used for numerical computation using data flow graphs.  

Link to the repository: https://github.com/tensorflow/tensorflow 

3.Keras: A deep learning library for Python that provides a user-friendly interface for building neural networks. It can run on top of TensorFlow, Theano, or CNTK.  

Link to the repository: https://github.com/keras-team/keras 

4.Pandas: A Python library for data manipulation and analysis. It provides a range of data structures for efficient data handling and analysis.  

Link to the repository: https://github.com/pandas-dev/pandas 

5.PyTorch: An open-source machine learning library developed by Facebook’s AI research group. PyTorch provides tensor computation and deep neural networks on a GPU.  

Link to the repository: https://github.com/pytorch/pytorch 

6.Apache Spark: An open-source distributed computing system used for big data processing. It can be used with a range of programming languages such as Python, R, and Java.  

Link to the repository: https://github.com/apache/spark 

7.FastAPI: A modern web framework for building APIs with Python. It is designed for high performance, asynchronous programming, and easy integration with other libraries.  

Link to the repository: https://github.com/tiangolo/fastapi 

8.Dask: A flexible parallel computing library for analytic computing in Python. It provides dynamic task scheduling and efficient memory management.  

Link to the repository: https://github.com/dask/dask 

9.Matplotlib: A Python plotting library that provides a range of 2D plotting features. It can be used for creating interactive visualizations, animations, and more.  

Link to the repository: https://github.com/matplotlib/matplotlib 

10.Seaborn: A Python data visualization library based on matplotlib. It provides a range of statistical graphics and visualization tools.  

Link to the repository: https://github.com/mwaskom/seaborn 

11.NumPy: A Python library for numerical computing that provides a range of array and matrix operations. It is used extensively in scientific computing and data analysis.  

Link to the repository: https://github.com/numpy/numpy 

12.Tidyverse: A collection of R packages for data manipulation, visualization, and analysis. It includes popular packages such as ggplot2, dplyr, and tidyr.  

Link to the repository: https://github.com/tidyverse/tidyverse 

How to Contribute to GitHub Repositories

Now that you know the value of GitHub and some of the best repositories to explore, the next step is learning how to contribute. Open-source projects thrive on collaboration, and contributing to them is a great way to improve your coding skills, gain real-world experience, and connect with the developer community. Here’s a step-by-step guide to getting started:

1. Find a Repository to Contribute To

Look for repositories that align with your interests and expertise. You can start by browsing GitHub’s Explore section or checking issues labeled “good first issue” or “help wanted” in open-source projects.

2. Fork the Repository

Forking creates a copy of the original repository in your own GitHub account. This allows you to make changes without affecting the original project. To do this, simply click the Fork button on the repository page, and a copy will appear in your GitHub profile.

3. Clone the Repository

Once you have forked the repository, you need to download it to your local computer so you can work on it. This process is called cloning. It allows you to edit files and test changes before submitting them back to the original project.

4. Create a New Branch

Before making any changes, it’s best practice to create a new branch. This keeps your updates separate from the main code, making it easier to manage and review. Naming your branch based on the feature or fix you’re working on helps maintain organization.

5. Make Your Changes

Now, you can edit the code, fix bugs, or add new features. Be sure to follow any contribution guidelines provided in the repository, write clear code, and test your changes thoroughly.

You might also like: Kaggle Data Scientists: Insights & Tips

6. Commit Your Changes

Once you’re satisfied with your updates, you need to save them. In GitHub, this process is called committing. A commit is like a snapshot of your work, and it should include a short, meaningful message explaining what changes you made.

7. Push Your Changes to GitHub

After committing your updates, you need to send them back to your forked repository on GitHub. This ensures your changes are saved online and can be accessed when submitting a contribution.

8. Create a Pull Request (PR)

A pull request is how you ask the maintainers of the original repository to review and merge your changes. When creating a pull request, provide a clear title and description of what you’ve updated and why it’s beneficial to the project.

9. Collaborate and Make Changes if Needed

The project maintainers will review your pull request. They might approve it right away or request modifications. Be open to feedback and make any necessary adjustments before your contribution is merged.

10. Celebrate Your Contribution!

Once your pull request is merged, congratulations—you’ve successfully contributed to an open-source project! Keep exploring and contributing to more repositories to continue learning and growing as a developer.

Final Thoughts

GitHub is more than just a code-sharing platform—it’s a hub for innovation, learning, and collaboration. The repositories we’ve highlighted can help you stay ahead in the ever-evolving tech world, whether you’re exploring AI, data science, or software development. By engaging with these open-source projects, you can sharpen your skills, contribute to the community, and keep up with the latest industry trends. So, start exploring, experimenting, and leveling up your expertise with these powerful GitHub repositories!

April 27, 2023

Data Analytics

Guest Blog

Supercharge your Python plots with zero extra code!

Graphs play a very important role in the data science workflow. Learn how to create dynamic professional-looking plots with Plotly.py.

We use plots to understand the distribution and nature of variables in the data and use visualizations to describe our findings in reports or presentations to both colleagues and clients. The importance of plotting in a data scientist’s work cannot be overstated.

Learn more about visualizing your data at Data Science Dojo’s Introduction to Python for Data Science!

Plotting with Matplotlib

If you have worked on any kind of data analysis problem in Python you will probably have encountered matplotlib, the default (sort of) plotting library. I personally have a love-hate relationship with it — the simplest plots require quite a bit of extra code but the library does offer flexibility once you get used to its quirks. The library is also used by pandas for its built-in plotting feature. So even if you haven’t heard of matplotlib, if you’ve used df.plot(), then you’ve unknowingly used matplotlib.

Plotting with Seaborn

Another popular library is seaborn, which is essentially a high-level wrapper around matplotlib and provides functions for some custom visualizations, these require quite a bit of code to create in the standard matplotlib. Another nice feature seaborn provides is sensible defaults for most options like axis labels, color schemes, and sizes of shapes.

Introducing Plotly

Plotly might sound like the new kid on the block, but in reality, it’s nothing like that. Plotly originally provided functionality in the form of a JavaScript library built on top of D3.js and later branched out into frontends for other languages like R, MATLAB and, of course, Python. plotly.py is the Python interface to the library.

As for usability, in my experience Plotly falls in between matplotlib and seaborn. It provides a lot of the same high-level plots as seaborn but also has extra options right there for you to tweak, such as matplotlib. It also has generally much better defaults than matplotlib.

Plotly’s interactivity

The most fascinating feature of Plotly is the interactivity. Plotly is fundamentally different from both matplotlib and seaborn because plots are rendered as static images by both of them while Plotly uses the full power of JavaScript to provide interactive controls like zooming in and panning out of the visual panel. This functionality can also be extended to create powerful dashboards and responsive visualizations that could convey so much more information than a static picture ever could.

First, let’s see how the three libraries differ in their output and complexity of code. I’ll use common statistical plots as examples.

To have a relatively even playing field, I’ll use the built-in seaborn theme that matplotlib comes with so that we don’t have to deduct points because of the plot’s looks.

fig, ax = plt.subplots(figsize=(8,6))

for species, species_df in iris.groupby('species'):
    ax.scatter(species_df['sepal_length'], species_df['sepal_width'], label=species);

ax.set(xlabel='Sepal Length', ylabel='Sepal Width', title='A Wild Scatterplot appears');
ax.legend();

fig, ax = plt.subplots(figsize=(8,6))

sns.scatterplot(data=iris, x='sepal_length', y='sepal_width', hue='species', ax=ax);

ax.set(xlabel='Sepal Length', ylabel='Sepal Width', title='A Wild Scatterplot appears');

fig = go.FigureWidget()

for species, species_df in iris.groupby('species'):
    fig.add_scatter(x=species_df['sepal_length'], y=species_df['sepal_width'],
                    mode='markers', name=species);

fig.layout.hovermode = 'closest'
fig.layout.xaxis.title = 'Sepal Length'
fig.layout.yaxis.title = 'Sepal Width'
fig.layout.title = 'A Wild Scatterplot appears'
fig

Looking at the plots, the matplotlib and seaborn plots are basically identical, the only difference is in the amount of code. The seaborn library has a nice interface to generate a colored scatter plot based on the hue argument, but in matplotlib we are basically creating three scatter plots on the same axis. The different colors are automatically assigned in both (default color cycle but can also be specified for customization). Other relatively minor differences are in the labels and legend, where seaborn creates these automatically. This, in my experience, is less useful than it seems because very rarely do datasets have nicely formatted column names. Usually, they contain abbreviations or symbols so you still have to assign ‘proper’ labels.

But we really want to see what Plotly has done, don’t we? This time I’ll start with the code. It’s eerily similar to matplotlib, apart from not sharing the exact syntax of course, and the hovermode option. Hovering? Does that mean…? Yes, yes it does. Moving the cursor over a point reveals a tooltip showing the coordinates of the point and the class label.

The tooltip can also be customized to show other information about a particular point. To the top right of the panel, there are controls to zoom, select, and pan across the plot. The legend is also interactive, it acts sort of like checkboxes. You can click on a class to hide/show all the points of that class.

Since the amount or complexity of code isn’t that drastically different from the other two options and we get all these interactivity options, I’d argue this is basically free benefits.

fig, ax = plt.subplots(figsize=(8,6))

grouped_df = iris.groupby('species').mean()
ax.bar(grouped_df.index.values, 
       grouped_df['sepal_length'].values);

ax.set(xlabel='Species', ylabel='Average Sepal Length', title='A Wild Barchart appears');

fig, ax = plt.subplots(figsize=(8,6))

sns.barplot(data=iris, x='species', y='sepal_length', estimator=np.mean, ax=ax);

ax.set(xlabel='Species', ylabel='Average Sepal Length', title='A Wild Barchart appears');

fig = go.FigureWidget()

grouped_df = iris.groupby('species').mean()
fig.add_bar(x=grouped_df.index, y=grouped_df['sepal_length']);

fig.layout.xaxis.title = 'Species'
fig.layout.yaxis.title = 'Average Sepal Length'
fig.layout.title = 'A Wild Barchart appears'
fig

The bar chart story is similar to the scatter plots. In this case, again, seaborn provides the option within the function call to specify the metric to be shown on the y-axis using the x variable as the grouping variable. For the other two, we have to do this ourselves using pandas. Plotly still provides interactivity out of the box.

Now that we’ve seen that Plotly can hold its own against our usual plotting options, let’s see what other benefits it can bring to the table. I will showcase some trace types in Plotly that are useful in a data science workflow, and how interactivity can make them more informative.

Heatmaps

fig = go.FigureWidget()

cor_mat = car_crashes.corr()
fig.add_heatmap(z=cor_mat, 
                x=cor_mat.columns,
                y=cor_mat.columns,
                showscale=True)

fig.layout.width = 500
fig.layout.height = 500
fig.layout.yaxis.automargin = True
fig.layout.title = 'A Wild Heatmap appears'
fig

Heatmaps are commonly used to plot correlation or confusion matrices. As expected, we can hover over the squares to get more information about the variables. I’ll paint a picture for you. Suppose you have trained a linear regression model to predict something from this dataset. You can then show the appropriate coefficients in the hover tooltips to get a better idea of which correlations in the data the model has captured.

Parallel coordinates plot

fig = go.FigureWidget()

parcords = fig.add_parcoords(dimensions=[{'label':n.title(),
                                          'values':iris[n],
                                          'range':[0,8]} for n in iris.columns[:-2]])

fig.data[0].dimensions[0].constraintrange = [4,8]
parcords.line.color = iris['species_id']
parcords.line.colorscale = make_plotly(cl.scales['3']['qual']['Set2'], repeat=True)

parcords.line.colorbar.title = ''
parcords.line.colorbar.tickvals = np.unique(iris['species_id']).tolist()
parcords.line.colorbar.ticktext = np.unique(iris['species']).tolist()
fig.layout.title = 'A Wild Parallel Coordinates Plot appears'
fig

I suspect some of you might not yet be familiar with this visualization, as I wasn’t a few months ago. This is a parallel coordinates plot of four variables. Each variable is shown on a separate vertical axis. Each line corresponds to a row in the dataset and the color obviously shows which class that row belongs to. A thing that should jump out at you is that the class separation in each variable axis is clearly visible. For instance, the Petal_Length variable can be used to classify all the Setosa flowers very well.

Since the plot is interactive, the axes can be reordered by dragging to explore the interconnectedness between the classes and how it affects the class separations. Another interesting interaction is the constrained range widget (the bright pink object on the Sepal_Length axis).

It can be dragged up or down to decolor the plot. Imagine having these on all axes and finding a sweet spot where only one class is visible. As a side note, the decolored plot has a transparency effect on the lines so the density of values can be seen.

A version of this type of visualization also exists for categorical variables in Plotly. It is called Parallel Categories.

Choropleth plot

fig = go.FigureWidget()

choro = fig.add_choropleth(locations=gdp['CODE'],
                           z=gdp['GDP (BILLIONS)'],
                           text = gdp['COUNTRY'])

choro.marker.line.width = 0.1
choro.colorbar.tickprefix = '$'
choro.colorbar.title = 'GDP<br>Billions US$'
fig.layout.geo.showframe = False
fig.layout.geo.showcoastlines = False
fig.layout.title = 'A Wild Choropleth appears<br>Source:\
                    <a href="https://www.cia.gov/library/publications/the-world-factbook/fields/2195.html">\
                    CIA World Factbook</a>'
fig

A choropleth is a very commonly used geographical plot. The benefit of the interactivity should be clear in this one. We can only show a single variable using the color but the tooltip can be used for extra information. Zooming in is also very useful in this case, allowing us to look at the smaller countries. The plot title contains HTML which is being rendered properly. This can be used to create fancier labels.

Interactive scatter plot

fig = go.FigureWidget()

scatter_trace = fig.add_scattergl(x=diamonds['carat'], y=diamonds['price'],
                                  mode='markers', marker={'opacity':0.2});

fig.layout.hovermode = 'closest'
fig.layout.xaxis.title = 'Carat'
fig.layout.yaxis.title = 'Price'
fig.layout.title = 'A Wild Scatterplot appears'
fig

I’m using the scattergl trace type here. This is a version of the scatter plot that uses WebGL in the background so that the interactions don’t get laggy even with larger datasets.

There is quite a bit of over-plotting here even with the aggressive transparency, so let’s zoom into the densest part to take a closer look. Zooming in reveals that the carat variable is quantized and there are clean vertical lines.

def selection_handler(trace, points, selector):
    data_mean = np.mean(points.ys)
    fig.data[0].figure.layout.title.text = f'A Wild Scatterplot appears - mean price: ${data_mean:.1f}'
fig.data[0].on_selection(selection_handler)

fig

Selecting a bunch of points in this scatter plot will change the title of the plot to show the mean price of the selected points. This could prove to be very useful in a plot where there are groups and you want to visually see some statistics of a cluster.

This behavior is easily implemented using callback functions attached to predefined event handlers for each trace.

More interactivity

Let’s do something fancier now.

fig1 = go.FigureWidget()
fig1.add_scattergl(x=exports['beef'], y=exports['total exports'],
                   text=exports['state'],
                   mode='markers');
fig1.layout.hovermode = 'closest'
fig1.layout.xaxis.title = 'Beef Exports in Million US$'
fig1.layout.yaxis.title = 'Total Exports in Million US$'
fig1.layout.title = 'A Wild Scatterplot appears'

fig2 = go.FigureWidget()
fig2.add_choropleth(locations=exports['code'],
                    z=exports['total exports'].astype('float64'),
                    text=exports['state'],
                    locationmode='USA-states')
fig2.data[0].marker.line.width = 0.1
fig2.data[0].marker.line.color = 'white'
fig2.data[0].marker.line.width = 2
fig2.data[0].colorbar.title = 'Exports Millions USD'
fig2.layout.geo.showframe = False
fig2.layout.geo.scope = 'usa'
fig2.layout.geo.showcoastlines = False
fig2.layout.title = 'A Wild Choropleth appears'

def do_selection(trace, points, selector):
    if trace is fig2.data[0]:
        fig1.data[0].selectedpoints = points.point_inds
    else:
        fig2.data[0].selectedpoints = points.point_inds
fig1.data[0].on_selection(do_selection)
fig2.data[0].on_selection(do_selection)

HBox([fig1, fig2])

We have already seen how to make scatter and choropleth plots so let’s put them to use and plot the same data-frame. Then, using the event handlers we also saw before, we can link both plots together and interactively explore which states produce which kinds of goods.

This kind of interactive exploration of different slices of the dataset is far more intuitive and natural than transforming the data in pandas and then plotting it again.

fig = go.FigureWidget()
fig.add_histogram(x=iris['sepal_length'],
                  histnorm='probability density');
fig.layout.xaxis.title = 'Sepal Length'
fig.layout.yaxis.title = 'Probability Density'
fig.layout.title = 'A Wild Histogram appears'

def change_binsize(s):
    fig.data[0].xbins.size = s
slider = interactive(change_binsize, s=(0.1,1,0.1))
label = Label('Bin Size: ')

VBox([HBox([label, slider]),
      fig])

Using the ipywidgets module’s interactive controls different aspects of the plot can be changed to gain a better understanding of the data. Here the bin size of the histogram is being controlled.

fig = go.FigureWidget()

scatter_trace = fig.add_scattergl(x=diamonds['carat'], y=diamonds['price'],
                                  mode='markers', marker={'opacity':0.2});

fig.layout.hovermode = 'closest'
fig.layout.xaxis.title = 'Carat'
fig.layout.yaxis.title = 'Price'
fig.layout.title = 'A Wild Scatterplot appears'

def change_opacity(x):
    fig.data[0].marker.opacity = x
slider = interactive(change_opacity, x=(0.1,1,0.1))
label = Label('Marker Opacity: ')

VBox([HBox([label, slider]),
      fig])

The opacity of the markers in this scatter plot is controlled by the slider. These examples only control the visual or layout aspects of the plot. We can also change the actual data which is being shown using dropdowns. I’ll leave you to explore that on your own.

What have we learned about Python plots

Let’s take a step back and sum up what we have learned. We saw that Plotly can reveal more information about our data using interactive controls, which we get for free and with no extra code. We saw a few interesting, slightly more complex visualizations available to us. We then combined the plots with custom widgets to create custom interactive workflows.

All this is just scratching the surface of what Plotly is capable of. There are many more trace types, an animations framework, and integration with Dash to create professional dashboards and probably a few other things that I don’t even know of.

August 18, 2022

Programming

LLM - Online Courses

Reviews

Consulting

Community

matplotlib

Data Science Dojo Staff

15 Python Packages You Must Know to Maximize Your Coding Productivity

What are Python Packages?

The Structure of a Python Package

Top 15 Python Packages You Must Explore

Core Libraries for Data Analysis

1. NumPy

2. Pandas

3. Dask

Visualization Tools

4. Matplotlib

5. Seaborn

6. Plotly

Machine Learning and Deep Learning

7. Scikit-learn

8. TensorFlow

9. PyTorch

Natural Language Processing (NLP)

10. NLTK

11. SpaCy

Web Scraping

12. BeautifulSoup

Bonus Additions to the List!

13. SQLAlchemy

14. OpenCV

15. urllib

What is the Standard vs Third-Party Packages Debate?

What are Standard Packages?

What are Third-Party Packages?

Key Points of the Debate

Wrapping up

Data Science Dojo Staff

Demystifying heatmaps: A comprehensive beginner’s guide

What are heatmaps?

Advantages of heatmaps

Creating heatmaps using “Matplotlib”

Customizations available in Matplotlib for heatmaps

Creating heatmaps using “Seaborn”

Customizations available in Seaborn for heatmaps:

Limitations of heatmaps:

Conclusion

Data Science Dojo Staff

Mastering Histograms: A Beginner’s Comprehensive Guide

Defining Histograms

Advantages of Histograms

Visual Representation

Easy Interpretation

Outlier Identification

Comparison of Data Sets

Data Summarization

Creating a Histogram Using Matplotlib Library

Customizations Available in Matplotlib for Histograms

Creating a Histogram Using ‘Seaborn’ Library

Customizations Available in Seaborn for Histograms

Limitations of Histograms

Wrapping Up

Data Science Dojo Staff

Line Plots for Beginners: Charting Success

Advantages of Line Plots

Creating line plots:

Using “Matplotlib”:

1. “Single” Line Plot:

2. “Multiple” Lines on One Plot:

3. “Customized” Line Plot:

4. Adding a Regression Line:

Using “Seaborn”:

1. “Single” Line Plot:

2. “Multiple” Lines on One Plot:

3. “Customized” Line Plot:

4. Adding a Regression Line:

Limitations of Line Plots:

Wrapping Up

Ali Haider Shalwani

12 Powerful GitHub Repositories for Data Science and Engineering

What is GitHub?

Creating heatmaps using “Matplotlib”  

Customizations available in Matplotlib for heatmaps  

Creating heatmaps using “Seaborn” 

 Advantages of Histograms

 Creating a Histogram Using Matplotlib Library

Customizations Available in Matplotlib for Histograms   

 Customizations Available in Seaborn for Histograms

 Wrapping Up