histograms

Syed Muhammad Mubashir Rizvi

Understanding Data Visualizations: A Beginner’s Guide

Unlock the full potential of your data with the power of data visualization! Go through this blog and discover why visualizations are crucial in Data Science and explore the most effective and game-changing types of visualizations that will revolutionize the way you interpret and extract insights from your data. Get ready to take your data analysis skills to the next level!

What are Data Visualizations?

Data visualizations involve using different charts, graphs, and other visual elements to represent data and information graphically and the purpose of it is to make complex and hard to understand and complex datasets easily understandable, accessible, and interpretable.

This powerful tool enables businesses to explore, analyze and identify trends, patterns and relationships from the raw data that are usually hidden by just looking at the data itself or its statistics.

By mastering the ability of data visualization, businesses and organizations can make effective and important decisions and actions based on the data and the insights gained. These decisions are additionally referred to as ‘Data-Driven Decisions’. By presenting data in a visual format, analysts can effectively communicate their findings to their team and to their clients, which is a challenging task as clients sometimes can’t interpret raw data and need a medium that they can interpret easily.

Importance of Data Visualization

Here is a list of some benefits data visualization offers that make us understand its importance and its usefulness:

1. Simplifying complex data: It enables complex data to be presented in a simplified and understandable manner. By using visual representations such as graphs and charts, data can be made more accessible to individuals who are not familiar with the underlying data.

2. Enhancing insights: It can help to identify patterns and trends that might not be immediately apparent from raw data. By presenting data visually, it is easier to identify correlations and relationships between variables, enabling analysts to draw insights and make more informed decisions.

3. Enhanced communication: It makes it easier to communicate complex data to a wider audience, including non-technical stakeholders in a way that is easy to understand and engage with. Visualizations can be used to tell a story, convey complex information, and facilitate collaboration among stakeholders, team members, and decision makers.

4. Increasing efficiency: It can save time and increase efficiency by enabling analysts to quickly identify patterns and relationships in raw data. This can help to streamline the analysis process and enable analysts to focus their efforts on areas that are most likely to yield insights.

5. Identifying anomalies and errors: It can help to identify errors or anomalies in the data. By presenting data visually, it is easier to spot outliers or unusual patterns that might indicate errors in data collection or processing. This can help analysts to clean and refine the data, ensuring that the insights derived from the data are accurate and reliable.

You might also like: 33 Ways to Stunning Data Visualization

6. Faster and more effective decision-making: It can help you make more informed and data-driven decisions by presenting information in a way that is easy to digest and interpret. Visualizations can help you identify key trends, outliers, and insights that can inform your decision-making, leading to faster and more effective outcomes.

7. Improved data exploration and analysis: It enables you to explore and analyze your data in a more intuitive and interactive way. By visualizing data in different formats and at different levels of detail, you can gain new insights and identify areas for further exploration and analysis.

Choosing the Right Type of Visualization

This is the only challenge faced when working with data visualizations, and to master this skill completely, you must have a clear idea about choosing the right type of visual for creating amazing, clear, attractive, and pleasing visuals. Keeping the following points in mind will help you in this:

Identify Purpose

The first and most crucial step in choosing the right visualization is clearly identifying why you’re creating it. Are you trying to compare values, show relationships, highlight distributions, or illustrate compositions? Each of these goals aligns best with a specific type of chart or graph.

For example:

If your goal is to compare categories, a bar chart may work best.
To show how something changes over time, a line chart might be more appropriate.
If you’re highlighting the proportion of parts to a whole, consider using a pie chart or stacked bar chart.
When you’re interested in patterns or relationships between variables, scatter plots or bubble charts can provide more insight.

Being clear about your objective from the start will prevent miscommunication and make your visual more impactful and efficient.

Understanding Audience

Knowing your audience is just as important as understanding your data. Your audience’s background, familiarity with the topic, and their expectations all influence how your visual should be designed.

For instance:

A technical team may appreciate more complex visualizations like heatmaps, box plots, or interactive dashboards.
A non-technical audience, such as clients or stakeholders, might prefer simplified visuals like bar charts or infographics with minimal jargon.
Consider whether your audience will view the visualization on a small screen, in a presentation, or as part of a report—this impacts layout, detail, and interactivity.

Tailoring your visuals to your audience ensures that the message is not only understood but also resonates with them. Remember, the best visuals bridge the gap between data and decision-making by speaking the audience’s language.

Selecting the Appropriate Visual

Once you have clearly defined your purpose and understand your audience, the next step is to choose the most suitable visualization to convey your insights. The right type of visual ensures that your data story is both meaningful and easy to grasp. Here are some common categories of data visualizations and when to use them:

Comparison Charts:
These charts are ideal when you want to compare values across different groups or categories. Use them when you’re analyzing trends, changes over time, or differences between entities. Common examples include bar charts, column charts, and line charts.
Distribution Charts:
If your goal is to understand the spread or range of your data, distribution charts are the way to go. They help you spot patterns, outliers, and the overall shape of your data. Histograms, box plots, and scatter plots fall under this category.
Relationship Charts:
These visuals are used to reveal the correlation or connection between two or more variables. They are especially helpful in uncovering trends or associations. Scatter plots, bubble charts, and heat maps are commonly used for this purpose.
Composition Charts:
When your data represents parts of a whole, composition charts help break it down. They show how different segments contribute to the total. Pie charts, stacked bar charts, and area charts are typical examples.

Choosing the right chart type is not just a technical step—it’s a storytelling decision. When done right, your visual not only looks good but also delivers insights in a clear and engaging way.

Ethics of Data Visualization

In many cases, data visualization may also be used to misinterpret information intentionally or unintentionally. An example includes manipulating data by using specific scales or omitting specific data points to support a particular narrative and not showing the actual view of the data. Some considerations regarding the ethics of data visualization include:

Accuracy of data: Data should be accurate and should not be presented in a way to misinterpret information.
Appropriateness of visualization type: The type of visual selected should be appropriate for the data being presented and the message being conveyed.
Clarity of message: The message conveyed through visualization should be clear and easy to understand.
Avoiding bias and discrimination: Each data visualization should be clear of bias and discrimination.

Avoiding Misleading Representations

You want to represent your data in the most efficient way possible which can be easily interpreted and free of ambiguities, now that’s not always the case, there are times when your data can mislead your visualization and convey the wrong message. In those cases, you can take help from the following points to avoid misleadingness:

Use consistent scales and axes in your charts and graphs.
Avoid using truncated axes and skewed data ranges which cause data to appear less significant.
Label your data points and axes properly for clarity.
Avoid cherry-picking the data to support a particular narrative.
Provide clear and concise context for the data you are presenting.

Types of Data Visualizations

There are numerous visualizations available, each with its own use and importance, and the choice of a visual depends on your need i.e., what kind of data you want to analyze, and what type of insight are you looking for. Nonetheless, here are some most common visuals used in data science:

Bar Charts: Bar charts are normally used to compare categorical data, such as the frequency or proportion of different categories. They are used to visualize data that can be organized or split into different discrete groups or categories.
Line Graphs: Line graphs are a type of visualization that uses lines to represent data values. They are typically used to represent continuous data.
Scatter Plots: Scatter plot is a type of data visualization that displays the relationship between two quantitative (numerical) variables. They are used to explore and analyze the correlation or association between two continuous variables.
Histograms: A histogram graph represents the distribution of a continuous numerical variable by dividing it into intervals and counting the number of observations. They are used to visualize the shape and spread of data.

Heatmaps: Heatmaps are commonly used to show the relationships between two variables, such as the correlation between different features in a dataset.
Box and Whisker Plots: They are also known as boxplots and are used to display the distribution of a dataset. A box plot consists of a box that spans the first quartile (Q1) to the third quartile (Q3) of the data, with a line inside the box representing the median.
Count Plots: A count plot is a type of bar chart that displays the number of occurrences of a categorical variable. The x-axis represents the categories, and the y-axis represents the count or frequency of each category.
Point Plots: A point plot is a type of line graph that displays the mean (or median) of a continuous variable for each level of a categorical variable. They are useful for comparing the values of a continuous variable across different levels.
Choropleth Maps: Choropleth map is a type of geographical visualization that uses color to represent data values for different geographic regions, such as countries, states, or counties.
Tree Maps: This visualization is used to display hierarchical data as nested rectangles, with each rectangle representing a node in the hierarchy. Treemaps are useful for visualizing complex hierarchical data in a way that highlights the relative sizes and values of different nodes.

Conclusion

So, this blog was all about introducing you to this powerful tool in the world of data science. Now you have a clear idea about what data visualization is, and what is its importance for analysts, businesses, and stakeholders.

You also learned about how you can choose the right type of visual, the ethics of data visualization and got familiar with 10 new different data visualizations and how they look like. The next step for you is to learn about how you can create these visuals using Python libraries such as matplotlib, seaborn and plotly.

May 29, 2023

Data Visualization

Data Science Dojo Staff

Mastering Histograms: A Beginner’s Comprehensive Guide

Histograms are a fundamental tool in data visualization, offering a simple yet powerful way to understand the distribution of data. Whether you’re new to data analysis or looking to sharpen your skills, histograms are a crucial tool for summarizing and visualizing data points.

They allow you to easily spot trends, patterns, and outliers in your dataset. In this comprehensive guide, we’ll explore what histograms are, why they are important, and how to create and interpret them. By the end of this guide, you’ll be equipped with the knowledge to use histograms effectively in your own data analysis projects.

Defining Histograms

A histogram is a type of graphical representation of data that shows the distribution of numerical values. It consists of a set of vertical bars, where each bar represents a range of values, and the height of the bar indicates the frequency or count of data points falling within that range.  

Histograms are commonly used in statistics and data analysis to visualize the shape of a data set and to identify patterns, such as the presence of outliers or skewness. They are also useful for comparing the distribution of different data sets or for identifying trends over time. 

The picture above shows how 1000 random data points from a normal distribution with a mean of 0 and standard deviation of 1 are plotted in a histogram with 30 bins and black edges.  

 Advantages of Histograms

Histograms are more than just simple bar charts—they are powerful tools that help analysts make sense of complex data. From spotting trends to identifying outliers, histograms offer several advantages that make them essential in data analysis and visualization. Let’s explore some key benefits of using histograms.

Visual Representation

Histograms offer a clear visual representation of data distribution, allowing us to quickly observe the frequency of data points across different ranges or bins. This visual approach makes it easier to spot trends, patterns, and even anomalies that might not be immediately evident in raw data. Whether you’re looking for skewness, symmetry, or multimodal distributions, histograms provide a straightforward way to understand the overall structure of your data.

Easy Interpretation

One of the main strengths of histograms is their simplicity. Even non-experts can easily interpret them, as the bar chart format intuitively shows how frequently data points fall within specific ranges. The height of each bar represents the frequency or proportion of data points in each bin, making it accessible for anyone to understand the distribution without needing advanced statistical knowledge.

Outlier Identification

Histograms are especially useful for identifying outliers or extreme values. These are typically represented by individual bars that stand apart from the rest, often appearing as isolated spikes on one end of the histogram. Identifying outliers can be crucial for understanding data anomalies or errors and can inform decisions regarding data cleaning or further investigation.

Comparison of Data Sets

Another powerful feature of histograms is their ability to compare the distributions of multiple data sets. By overlaying or side-by-side plotting histograms of different datasets, you can quickly identify similarities and differences in their distributions. This helps to compare trends across different groups, such as customer segments, time periods, or product categories, enabling more insightful decision-making.

Data Summarization

Histograms are an excellent tool for summarizing large datasets. Instead of getting lost in the raw numbers, histograms condense the information into digestible features like the overall shape, center (e.g., mean or median), and spread (e.g., range or standard deviation) of the distribution. This gives a quick snapshot of the data’s most important characteristics, helping analysts and decision-makers grasp the key points without needing to process extensive raw data.

 Creating a Histogram Using Matplotlib Library

We can create histograms using Matplotlib by following a series of steps. Following the import statements of the libraries, the code generates a set of 1000 random data points from a normal distribution with a mean of 0 and standard deviation of 1, using the `numpy.random.normal()` function.  

The plt.hist() function in Python is a powerful tool for creating histograms. By providing the data, number of bins, bar color, and edge color as input, this function generates a histogram plot.
To enhance the visualization, the xlabel(), ylabel(), and title() functions are utilized to add labels to the x and y axes, as well as a title to the plot.
Finally, the show() function is employed to display the histogram on the screen, allowing for detailed analysis and interpretation.

<br />

Overall, this code generates a histogram plot of a set of random data points from a normal distribution, with 30 bins, blue bars, black edges, labeled axes, and a title. The histogram shows the frequency distribution of the data, with a bell-shaped curve indicating the normal distribution.    

Customizations Available in Matplotlib for Histograms   

In Matplotlib, there are several customizations available for histograms. These include:

Adjusting the number of bins. 
Changing the color of the bars. 
Changing the opacity of the bars. 
Changing the edge color of the bars. 
Adding a grid to the plot. 
Adding labels and a title to the plot. 
Adding a cumulative density function (CDF) line. 
Changing the range of the x-axis. 
Adding a rug plot.

Now, let’s see all the customizations being implemented in a single example code snippet: 

<br />

In this example, the histogram is customized in the following ways: 

The number of bins is set to `20` using the `bins` parameter. 
The transparency of the bars is set to `0.5` using the `alpha` parameter. 
The edge color of the bars is set to `black` using the `edgecolor` parameter. 
The color of the bars is set to `green` using the `color` parameter. 
The range of the x-axis is set to `(-3, 3)` using the `range` parameter. 
The y-axis is normalized to show density using the `density` parameter. 
Labels and a title are added to the plot using the `xlabel()`, `ylabel()`, and `title()` functions. 
A grid is added to the plot using the `grid` function. 
A cumulative density function (CDF) line is added to the plot using the `cumulative` parameter and `histtype=’step’`. 
A rug plot showing individual data points is added to the plot using the `plot` function.

Creating a Histogram Using ‘Seaborn’ Library

We can create histograms using Seaborn by following the steps: 

First and foremost, importing the libraries: `NumPy`, `Seaborn`, `Matplotlib`, and `Pandas`. After importing the libraries, a toy dataset is created using `pd.DataFrame()` of 1000 samples that are drawn from a normal distribution with mean 0 and standard deviation 1 using NumPy’s `random.normal()` function. 
We use Seaborn’s `histplot()` function to plot a histogram of the ‘data’ column of the DataFrame with `20` bins and a `blue` color. 
The plot is customized by adding labels, and a title, and changing the style to a white grid using the `set_style()` function. 
Finally, we display the plot using the `show()` function from matplotlib.

<br />

Overall, this code snippet demonstrates how to use Seaborn to plot a histogram of a dataset and customize the appearance of the plot quickly and easily. 

 Customizations Available in Seaborn for Histograms

Following is a list of the customizations available for Histograms in Seaborn: 

Change the number of bins. 
Change the color of the bars. 
Change the color of the edges of the bars. 
Overlay a density plot on the histogram. 
Change the bandwidth of the density plot. 
Change the type of histogram to cumulative. 
Change the orientation of the histogram to horizontal. 
Change the scale of the y-axis to logarithmic.

Now, let’s see all these customizations being implemented here as well, in a single example code snippet: 

<br />

 In this example, we have done the following customizations: 

Set the number of bins to `20`. 
Set the color of the bars to `green`. 
Set the `edgecolor` of the bars to `black`. 
Added a density plot overlaid on top of the histogram using the `kde` parameter set to `True`. 
Set the bandwidth of the density plot to `0.5` using the `kde_kws` parameter. 
Set the histogram to be cumulative using the `cumulative` parameter. 
Set the y-axis scale to logarithmic using the `log_scale` parameter. 
Set the title of the plot to ‘Customized Histogram’. 
Set the x-axis label to ‘Values’. 
Set the y-axis label to ‘Frequency’.

Limitations of Histograms

 Histograms are widely used for visualizing the distribution of data, but they also have limitations that should be considered when interpreting them. These limitations are jotted down below: 

They can be sensitive to the choice of bin size or the number of bins, which can affect the interpretation of the distribution. Choosing too few bins can result in a loss of information while choosing too many bins can create artificial patterns and noise. 
They can be influenced by outliers, which can skew the distribution or make it difficult to see patterns in the data. 
They are typically univariate and cannot capture relationships between multiple variables or dimensions of data. 
Histograms assume that the data is continuous and does not work well with categorical data or data with large gaps between values. 
They can be affected by the choice of starting and ending points, which can affect the interpretation of the distribution. 
They do not provide information on the shape of the distribution beyond the binning intervals.

 It’s important to consider these limitations when using histograms and to use them in conjunction with other visualization techniques to gain a more complete understanding of the data. 

 Wrapping Up

In conclusion, histograms are powerful tools for visualizing the distribution of data. They provide valuable insights into the shape, patterns, and outliers present in a dataset. With their simplicity and effectiveness, histograms offer a convenient way to summarize and interpret large amounts of data.

By customizing various aspects such as the number of bins, colors, and labels, you can tailor the histogram to your specific needs and effectively communicate your findings. So, embrace the power of histograms and unlock a deeper understanding of your data.

Written by Safia Faiz

May 23, 2023

Data Visualization

Guest Blog

Logistic regression in R: A classification technique to predict credit card default

Learn how logistic regression fits a dataset to make predictions in R, as well as when and why to use it.

Logistic regression is one of the statistical techniques in machine learning used to form prediction models. It is one of the most popular classification algorithms mostly used for binary classification problems (problems with two class values, however, some variants may deal with multiple classes as well). It’s used for various research and industrial problems.

Therefore, it is essential to have a good grasp of logistic regression algorithms while learning data science. This tutorial is a sneak peek from many of Data Science Dojo’s hands-on exercises from their data science Bootcamp program, you will learn how logistic regression fits a dataset to make predictions, as well as when and why to use it.

In short, Logistic Regression is used when the dependent variable(target) is categorical. For example:

To predict whether an email is spam (1) or not spam (0)
Whether the tumor is malignant (1) or not (0)

Intro to Logistic Regression

It is named ‘Logistic Regression’ because its underlying technology is quite the same as Linear Regression. There are structural differences in how linear and logistic regression operate. Therefore, linear regression isn’t suitable to be used for classification problems. This link answers in detail why linear regression isn’t the right approach for classification.

Its name is derived from one of the core functions behind its implementation called the logistic function or the sigmoid function. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

The hypothesis function of logistic regression can be seen below where the function g(z) is also shown.

The hypothesis for logistic regression now becomes:

Here θ (theta) is a vector of parameters that our model will calculate to fit our classifier.

After calculations from the above equations, the cost function is now as follows:

Here m is several training examples. Like Linear Regression, we will use gradient descent to minimize our cost function and calculate the vector θ (theta).

This tutorial will follow the format below to provide you with hands-on practice with Logistic Regression:

Importing Libraries
Importing Datasets
Exploratory Data Analysis
Feature Engineering
Pre-processing
Model Development
Prediction
Evaluation

The scenario

In this tutorial, we will be working with the Default of Credit Card Clients Data Set. This data set has 30000 rows and 24 columns. The data set could be used to estimate the probability of default payment by credit card clients using the data provided. These attributes are related to various details about a customer, his past payment information, and bill statements. It is hosted in Data Science Dojo’s repository.

Think of yourself as a lead data scientist employed at a large bank. You have been assigned to predict whether a particular customer will default on their payment next month or not. The result is an extremely valuable piece of information for the bank to make decisions regarding offering credit to its customers and could massively affect the bank’s revenue. Therefore, your task is very critical. You will learn to use logistic regression to solve this problem.

The dataset is a tricky one as it has a mix of categorical and continuous variables. Moreover, you will also get a chance to practice these concepts through short assignments given at the end of a few sub-modules. Feel free to change the parameters in the given methods once you have been through the entire notebook.

Download Exercise Files

1) Importing libraries

We’ll begin by importing the dependencies that we require. The following dependencies are popularly used for data-wrangling operations and visualizations. We would encourage you to have a look at their documentation.

library(knitr)
library(tidyverse)
library(ggplot2)
library(mice)
library(lattice)
library(reshape2)
#install.packages("DataExplorer") if the following package is not available
library(DataExplorer)

2) Importing Datasets

The dataset is available at Data Science Dojo’s repository in the following link. We’ll use the head method to view the first few rows.

## Need to fetch the excel file
path <- "https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/
Default%20of%20Credit%20Card%20Clients/default%20of%20credit%20card%20clients.csv"
data <- read.csv(file = path, header = TRUE)
head(data)

Since the header names are in the first row of the dataset, we’ll use the code below to first assign the headers to be the one from the first row and then delete the first row from the dataset. This way we will get our desired form.

colnames(data) <- as.character(unlist(data[1,]))
data = data[-1, ]
head(data)

To avoid any complications ahead, we’ll rename our target variable “default payment next month” to a name without spaces using the code below.

colnames(data)[colnames(data)=="default payment next month"] <- "default_payment"
head(data)

3) Exploratory data analysis

Data Exploration is one of the most significant portions of the machine-learning process. Clean data can ensure a notable increase in the accuracy of our model. No matter how powerful our model is, it cannot function well unless the data we provide has been thoroughly processed.

This step will briefly take you through this step and assist you in visualizing your data, finding the relation between variables, dealing with missing values and outliers, and assisting in getting some fundamental understanding of each variable we’ll use.

Moreover, this step will also enable us to figure out the most important attributes to feed our model and discard those that have no relevance.

We will start by using the dim function to print out the dimensionality of our data frame.

dim(data)

30000 25

The str method will allow us to know the data type of each variable. We’ll transform it to a numeric data type since it’ll be easier to use for our functions ahead.

str(data)

'data.frame':	30000 obs. of  25 variables:
 $ ID             : Factor w/ 30001 levels "1","10","100",..: 1 11112 22223 23335 24446 25557 26668 27779 28890 2 ...
 $ LIMIT_BAL      : Factor w/ 82 levels "10000","100000",..: 14 5 81 48 48 48 49 2 7 14 ...
 $ SEX            : Factor w/ 3 levels "1","2","SEX": 2 2 2 2 1 1 1 2 2 1 ...
 $ EDUCATION      : Factor w/ 8 levels "0","1","2","3",..: 3 3 3 3 3 2 2 3 4 4 ...
 $ MARRIAGE       : Factor w/ 5 levels "0","1","2","3",..: 2 3 3 2 2 3 3 3 2 3 ...
 $ AGE            : Factor w/ 57 levels "21","22","23",..: 4 6 14 17 37 17 9 3 8 15 ...
 $ PAY_0          : Factor w/ 12 levels "-1","-2","0",..: 5 1 3 3 1 3 3 3 3 2 ...
 $ PAY_2          : Factor w/ 12 levels "-1","-2","0",..: 5 5 3 3 3 3 3 1 3 2 ...
 $ PAY_3          : Factor w/ 12 levels "-1","-2","0",..: 1 3 3 3 1 3 3 1 5 2 ...
 $ PAY_4          : Factor w/ 12 levels "-1","-2","0",..: 1 3 3 3 3 3 3 3 3 2 ...
 $ PAY_5          : Factor w/ 11 levels "-1","-2","0",..: 2 3 3 3 3 3 3 3 3 1 ...
 $ PAY_6          : Factor w/ 11 levels "-1","-2","0",..: 2 4 3 3 3 3 3 1 3 1 ...
 $ BILL_AMT1      : Factor w/ 22724 levels "-1","-10","-100",..: 13345 10030 10924 15026 21268 18423 12835 1993 1518 307 ...
 $ BILL_AMT2      : Factor w/ 22347 levels "-1","-10","-100",..: 11404 5552 3482 15171 16961 17010 13627 12949 3530 348 ...
 $ BILL_AMT3      : Factor w/ 22027 levels "-1","-10","-100",..: 18440 9759 3105 15397 12421 16866 14184 17258 2072 365 ...
 $ BILL_AMT4      : Factor w/ 21549 levels "-1","-10","-100",..: 378 11833 3620 10318 7717 6809 16081 8147 2129 378 ...
 $ BILL_AMT5      : Factor w/ 21011 levels "-1","-10","-100",..: 385 11971 3950 10407 6477 6841 14580 76 1796 2638 ...
 $ BILL_AMT6      : Factor w/ 20605 levels "-1","-10","-100",..: 415 11339 4234 10458 6345 7002 14057 15748 12215 3230 ...
 $ PAY_AMT1       : Factor w/ 7944 levels "0","1","10","100",..: 1 1 1495 2416 2416 3160 5871 4578 4128 1 ...
 $ PAY_AMT2       : Factor w/ 7900 levels "0","1","10","100",..: 6671 5 1477 2536 4508 2142 4778 6189 1 1 ...
 $ PAY_AMT3       : Factor w/ 7519 levels "0","1","10","100",..: 1 5 5 646 6 6163 4292 1 4731 1 ...
 $ PAY_AMT4       : Factor w/ 6938 levels "0","1","10","100",..: 1 5 5 337 6620 5 2077 5286 5 813 ...
 $ PAY_AMT5       : Factor w/ 6898 levels "0","1","10","100",..: 1 1 5 263 5777 5 950 1502 5 408 ...
 $ PAY_AMT6       : Factor w/ 6940 levels "0","1","10","100",..: 1 2003 4751 5 5796 6293 963 1267 5 1 ...
 $ default_payment: Factor w/ 3 levels "0","1","default payment next month": 2 2 1 1 1 1 1 1 1 1 ...

data[, 1:25] <- sapply(data[, 1:25], as.character)

We have involved an intermediate step by converting our data to character first. We need to use as.character before as.numeric. This is because factors are stored internally as integers with a table to give the factor level labels. Just using as.numeric will only give the internal integer codes.

data[, 1:25] <- sapply(data[, 1:25], as.numeric)
str(data)

'data.frame':	30000 obs. of  25 variables:
 $ ID             : num  1 2 3 4 5 6 7 8 9 10 ...
 $ LIMIT_BAL      : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
 $ SEX            : num  2 2 2 2 1 1 1 2 2 1 ...
 $ EDUCATION      : num  2 2 2 2 2 1 1 2 3 3 ...
 $ MARRIAGE       : num  1 2 2 1 1 2 2 2 1 2 ...
 $ AGE            : num  24 26 34 37 57 37 29 23 28 35 ...
 $ PAY_0          : num  2 -1 0 0 -1 0 0 0 0 -2 ...
 $ PAY_2          : num  2 2 0 0 0 0 0 -1 0 -2 ...
 $ PAY_3          : num  -1 0 0 0 -1 0 0 -1 2 -2 ...
 $ PAY_4          : num  -1 0 0 0 0 0 0 0 0 -2 ...
 $ PAY_5          : num  -2 0 0 0 0 0 0 0 0 -1 ...
 $ PAY_6          : num  -2 2 0 0 0 0 0 -1 0 -1 ...
 $ BILL_AMT1      : num  3913 2682 29239 46990 8617 ...
 $ BILL_AMT2      : num  3102 1725 14027 48233 5670 ...
 $ BILL_AMT3      : num  689 2682 13559 49291 35835 ...
 $ BILL_AMT4      : num  0 3272 14331 28314 20940 ...
 $ BILL_AMT5      : num  0 3455 14948 28959 19146 ..
 $ BILL_AMT6      : num  0 3261 15549 29547 19131 ...
 $ PAY_AMT1       : num  0 0 1518 2000 2000 ...
 $ PAY_AMT2       : num  689 1000 1500 2019 36681 ...
 $ PAY_AMT3       : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
 $ PAY_AMT4       : num  0 1000 1000 1100 9000 ...
 $ PAY_AMT5       : num  0 0 1000 1069 689 ...
 $ PAY_AMT6       : num  0 2000 5000 1000 679 ...
 $ default_payment: num  1 1 0 0 0 0 0 0 0 0 ...

When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together. For a continuous (numeric) variable like “age”, it returns the 5-number summary showing 5 descriptive statistics as these are numeric values.

summary(data)

       ID          LIMIT_BAL            SEX          EDUCATION    
 Min.   :    1   Min.   :  10000   Min.   :1.000   Min.   :0.000  
 1st Qu.: 7501   1st Qu.:  50000   1st Qu.:1.000   1st Qu.:1.000  
 Median :15000   Median : 140000   Median :2.000   Median :2.000  
 Mean   :15000   Mean   : 167484   Mean   :1.604   Mean   :1.853  
 3rd Qu.:22500   3rd Qu.: 240000   3rd Qu.:2.000   3rd Qu.:2.000  
 Max.   :30000   Max.   :1000000   Max.   :2.000   Max.   :6.000  
    MARRIAGE          AGE            PAY_0             PAY_2        
 Min.   :0.000   Min.   :21.00   Min.   :-2.0000   Min.   :-2.0000  
 1st Qu.:1.000   1st Qu.:28.00   1st Qu.:-1.0000   1st Qu.:-1.0000  
 Median :2.000   Median :34.00   Median : 0.0000   Median : 0.0000  
 Mean   :1.552   Mean   :35.49   Mean   :-0.0167   Mean   :-0.1338  
 3rd Qu.:2.000   3rd Qu.:41.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   :3.000   Max.   :79.00   Max.   : 8.0000   Max.   : 8.0000  
     PAY_3             PAY_4             PAY_5             PAY_6        
 Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000  
 1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   :-0.1662   Mean   :-0.2207   Mean   :-0.2662   Mean   :-0.2911  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000  
   BILL_AMT1         BILL_AMT2        BILL_AMT3         BILL_AMT4      
 Min.   :-165580   Min.   :-69777   Min.   :-157264   Min.   :-170000  
 1st Qu.:   3559   1st Qu.:  2985   1st Qu.:   2666   1st Qu.:   2327  
 Median :  22382   Median : 21200   Median :  20089   Median :  19052  
 Mean   :  51223   Mean   : 49179   Mean   :  47013   Mean   :  43263  
 3rd Qu.:  67091   3rd Qu.: 64006   3rd Qu.:  60165   3rd Qu.:  54506  
 Max.   : 964511   Max.   :983931   Max.   :1664089   Max.   : 891586  
   BILL_AMT5        BILL_AMT6          PAY_AMT1         PAY_AMT2      
 Min.   :-81334   Min.   :-339603   Min.   :     0   Min.   :      0  
 1st Qu.:  1763   1st Qu.:   1256   1st Qu.:  1000   1st Qu.:    833  
 Median : 18105   Median :  17071   Median :  2100   Median :   2009  
 Mean   : 40311   Mean   :  38872   Mean   :  5664   Mean   :   5921  
 3rd Qu.: 50191   3rd Qu.:  49198   3rd Qu.:  5006   3rd Qu.:   5000  
 Max.   :927171   Max.   : 961664   Max.   :873552   Max.   :1684259  
    PAY_AMT3         PAY_AMT4         PAY_AMT5           PAY_AMT6       
 Min.   :     0   Min.   :     0   Min.   :     0.0   Min.   :     0.0  
 1st Qu.:   390   1st Qu.:   296   1st Qu.:   252.5   1st Qu.:   117.8  
 Median :  1800   Median :  1500   Median :  1500.0   Median :  1500.0  
 Mean   :  5226   Mean   :  4826   Mean   :  4799.4   Mean   :  5215.5  
 3rd Qu.:  4505   3rd Qu.:  4013   3rd Qu.:  4031.5   3rd Qu.:  4000.0  
 Max.   :896040   Max.   :621000   Max.   :426529.0   Max.   :528666.0  
 default_payment 
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.2212  
 3rd Qu.:0.0000  
 Max.   :1.0000

Using the introduced method, we can get to know the basic information about the dataframe, including the number of missing values in each variable.

introduce(data)

As we can observe, there are no missing values in the dataframe.

The information in summary above gives a sense of the continuous and categorical features in our dataset. However, evaluating these details against the data description shows that categorical values such as EDUCATION and MARRIAGE have categories beyond those given in the data dictionary. We’ll find out these extra categories using the value_counts method.

count(data, vars = EDUCATION)

vars	n
0	14
1	10585
2	14030
3	4917
4	123
5	280
6	51

The data dictionary defines the following categories for EDUCATION: “Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)”. However, we can also observe 0 along with numbers greater than 4, i.e. 5 and 6. Since we don’t have any further details about it, we can assume 0 to be someone with no educational experience and 0 along with 5 & 6 can be placed in others along with 4.

count(data, vars = MARRIAGE)

vars	n
0	54
1	13659
2	15964
3	323

The data dictionary defines the following categories for MARRIAGE: “Marital status (1 = married; 2 = single; 3 = others)”. Since category 0 hasn’t been defined anywhere in the data dictionary, we can include it in the ‘others’ category marked as 3.

#replace 0's with NAN, replace others too
data$EDUCATION[data$EDUCATION == 0] <- 4
data$EDUCATION[data$EDUCATION == 5] <- 4
data$EDUCATION[data$EDUCATION == 6] <- 4
data$MARRIAGE[data$MARRIAGE == 0] <- 3

count(data, vars = MARRIAGE)
count(data, vars = EDUCATION)

vars	n
1	13659
2	15964
3	377

vars	n
1	10585
2	14030
3	4917
4	468

We’ll now move on to a multi-variate analysis of our variables and draw a correlation heat map from the DataExplorer library. The heatmap will enable us to find out the correlation between each variable. We are more interested in finding out the correlation between our predictor attributes with the target attribute default payment next month. The color scheme depicts the strength of the correlation between the 2 variables.

This will be a simple way to quickly find out how much of an impact a variable has on our final outcome. There are other ways as well to figure this out.

plot_correlation(na.omit(data), maxcat = 5L)

We can observe the weak correlation of AGE, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, and BILL_AMT6 with our target variable.

Now let’s have a univariate analysis of our variables. We’ll start with the categorical variables and have a quick check on the frequency of distribution of categories. The code below will allow us to observe the required graphs. We’ll first draw the distribution for all PAY variables.

plot_histogram(data)

We can make a few observations from the above histogram. The distribution above shows that nearly all PAY attributes are rightly skewed.

4) Feature engineering

This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial.

Analyzing our data above, we’ve been able to note the extremely weak correlation of some variables with the final target variable. The following are the ones that have significantly low correlation values: AGE, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6.

#deleting columns

data_new <- select(data, -one_of('ID','AGE', 'BILL_AMT2',
       'BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'))
head(data_new)

5) Pre-processing

Standardization is a transformation that centers the data by removing the mean value of each feature and then scaling it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression, and linear discriminate analysis. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

In the code below, we’ll use the scale method to transform our dataset using it.

data_new[, 1:17] <- scale(data_new[, 1:17])

head(data_new)

The next task we’ll do is to split the data for training and testing as we’ll use our test data to evaluate our model. We will now split our dataset into train and test. We’ll change it to 0.3. Therefore, 30% of the dataset is reserved for testing while the remaining is for training. By default, the dataset will also be shuffled before splitting.

#create a list of random number ranging from 1 to number of rows from actual data 
#and 70% of the data into training data  

data2 = sort(sample(nrow(data_new), nrow(data_new)*.7))

#creating training data set by selecting the output row values
train <- data_new[data2,]

#creating test data set by not selecting the output row values
test <- data_new[-data2,]

Let us print the dimensions of all these variables using the dim method. You can notice the 70-30% split.

dim(train)
dim(test)

21000 18

9000 18

6) Model development

We will now move on to the most important step of developing our logistic regression model. We have already fetched our machine learning model in the beginning. Now with a few lines of code, we’ll first create a logistic regression model which has been imported from sci-kit learn’s linear model package to our variable named model.

Following this, we’ll train our model using the fit method with X_train and y_train which contain 70% of our dataset. This will be a binary classification model.

## fit a logistic regression model with the training dataset
log.model <- glm(default_payment ~., data = train, family = binomial(link = "logit"))

summary(log.model)

Call:
glm(formula = default_payment ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1171  -0.6998  -0.5473  -0.2946   3.4915  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.465097   0.019825 -73.900  < 2e-16 ***
LIMIT_BAL   -0.083475   0.023905  -3.492 0.000480 ***
SEX         -0.082986   0.017717  -4.684 2.81e-06 ***
EDUCATION   -0.059851   0.019178  -3.121 0.001803 ** 
MARRIAGE    -0.107322   0.018350  -5.849 4.95e-09 ***
PAY_0        0.661918   0.023605  28.041  < 2e-16 ***
PAY_2        0.069704   0.028842   2.417 0.015660 *  
PAY_3        0.090691   0.031982   2.836 0.004573 ** 
PAY_4        0.074336   0.034612   2.148 0.031738 *  
PAY_5        0.018469   0.036430   0.507 0.612178    
PAY_6        0.006314   0.030235   0.209 0.834584    
BILL_AMT1   -0.123582   0.023558  -5.246 1.56e-07 ***
PAY_AMT1    -0.136745   0.037549  -3.642 0.000271 ***
PAY_AMT2    -0.246634   0.056432  -4.370 1.24e-05 ***
PAY_AMT3    -0.014662   0.028012  -0.523 0.600677    
PAY_AMT4    -0.087782   0.031484  -2.788 0.005300 ** 
PAY_AMT5    -0.084533   0.030917  -2.734 0.006254 ** 
PAY_AMT6    -0.027355   0.025707  -1.064 0.287277    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22176  on 20999  degrees of freedom
Residual deviance: 19535  on 20982  degrees of freedom
AIC: 19571

Number of Fisher Scoring iterations: 6

7) Prediction

Below we’ll use the prediction method to find out the predictions made by our Logistic Regression method. We will first store the predicted results in our y_pred variable and print the first 10 rows of our test data set. Following this we will print the predicted values of the corresponding rows and the original labels that were stored in y_test for comparison.

test[1:10,]

## to predict using logistic regression model, probablilities obtained
log.predictions <- predict(log.model, test, type="response")

## Look at probability output
head(log.predictions, 10)

2: 0.539623162720197
7: 0.232835137994762
10: 0.25988780274953
11: 0.0556716133560243
15: 0.422481223473459
22: 0.165384552048511
25: 0.0494775267027534
26: 0.238225423596718
31: 0.248366972046479
37: 0.111907725985513

Below we are going to assign our labels with the decision rule that if the prediction is greater than 0.5, assign it 1 else 0.

log.prediction.rd <- ifelse(log.predictions > 0.5, 1, 0)
head(log.prediction.rd, 10)

2: 1
7: 0
10: 0
11: 0
15: 0
22: 0
25: 0
26: 0
31: 0
37: 0

Evaluation

We’ll now discuss a few evaluation metrics to measure the performance of our machine-learning model here. This part has significant relevance since it will allow us to understand the most important characteristics that led to our model development.

We will output the confusion matrix. It is a handy presentation of the accuracy of a model with two or more classes.

The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

According to an article the entries in the confusion matrix have the following meaning in the context of our study:

[[a b][c d]]

a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect predictions that an instance is negative, and
d is the number of correct predictions that an instance is positive.

table(log.prediction.rd, test[,18])

                 
log.prediction.rd    0    1
                0 6832 1517
                1  170  481

We’ll write a simple function to print the accuracy below

accuracy <- table(log.prediction.rd, test[,18])
sum(diag(accuracy))/sum(accuracy)

0.812555555555556

Conclusion

This tutorial has given you a brief and concise overview of the Logistic Regression algorithm and all the steps involved in achieving better results from our model. This notebook has also highlighted a few methods related to Exploratory Data Analysis, Pre-processing, and Evaluation, however, there are several other methods that we would encourage you to explore on our blog or video tutorials.

If you want to take a deeper dive into several data science techniques. Join our 5-day hands-on Data Science Bootcamp preferred by working professionals, we cover the following topics:

Fundamentals of Data Mining
Machine Learning Fundamentals
Introduction to R
Introduction to Azure Machine Learning Studio
Data Exploration, Visualization, and Feature Engineering
Decision Tree Learning
Ensemble Methods: Bagging, Boosting, and Random Forest
Regression: Cost Functions, Gradient Descent, Regularization
Unsupervised Learning
Recommendation Systems
Metrics and Methods for Evaluating Predictive Models
Introduction to Online Experimentation and A/B Testing
Fundamentals of Big Data Engineering
Hadoop and Hive
Message Queues and Real-time Analytics
NoSQL Databases and HBase
Hack Project: Creating a Real-time IoT Pipeline
Naive Bayes
Logistic Regression
Times Series Forecasting

This post was originally sponsored on What’s The Big Data.

August 18, 2022

Statistics

LLM - Online Courses

Reviews

Consulting

Community

histograms

Syed Muhammad Mubashir Rizvi

Understanding Data Visualizations: A Beginner’s Guide

What are Data Visualizations?

Importance of Data Visualization

Choosing the Right Type of Visualization

Identify Purpose

Understanding Audience

Selecting the Appropriate Visual

Ethics of Data Visualization

Avoiding Misleading Representations

Types of Data Visualizations

Conclusion

Data Science Dojo Staff

Mastering Histograms: A Beginner’s Comprehensive Guide

Defining Histograms

Advantages of Histograms

Visual Representation

Easy Interpretation

Outlier Identification

Comparison of Data Sets

Data Summarization

Creating a Histogram Using Matplotlib Library

Customizations Available in Matplotlib for Histograms

Creating a Histogram Using ‘Seaborn’ Library

Customizations Available in Seaborn for Histograms

Limitations of Histograms

Wrapping Up

Guest Blog

Logistic regression in R: A classification technique to predict credit card default

Intro to Logistic Regression

The scenario

1) Importing libraries

2) Importing Datasets

3) Exploratory data analysis

4) Feature engineering

5) Pre-processing

6) Model development

7) Prediction

Evaluation

Conclusion

Related Topics

Training Programs

Enterprise

Community

About

 Advantages of Histograms

 Creating a Histogram Using Matplotlib Library

Customizations Available in Matplotlib for Histograms   

 Customizations Available in Seaborn for Histograms

 Wrapping Up