fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Mastering histograms: A beginner’s comprehensive guide

Data Science Dojo
Safia Faiz

May 23

Researchers, statisticians, and data analysts rely on histograms to gain insights into data distributions, identify patterns, and detect outliers. Data scientists and machine learning practitioners use histograms as part of exploratory data analysis and feature engineering. Overall, anyone working with numerical data and seeking to gain a deeper understanding of data distributions can benefit from information on histograms.

Defining histograms

A histogram is a type of graphical representation of data that shows the distribution of numerical values. It consists of a set of vertical bars, where each bar represents a range of values, and the height of the bar indicates the frequency or count of data points falling within that range.   

Histograms
Histograms

Histograms are commonly used in statistics and data analysis to visualize the shape of a data set and to identify patterns, such as the presence of outliers or skewness. They are also useful for comparing the distribution of different data sets or for identifying trends over time. 

The picture above shows how 1000 random data points from a normal distribution with a mean of 0 and standard deviation of 1 are plotted in a histogram with 30 bins and black edges.  

Advantages of histograms

  • Visual Representation: Histograms provide a visual representation of the distribution of data, enabling us to observe patterns, trends, and anomalies that may not be apparent in raw data.
  • Easy Interpretation: Histograms are easy to interpret, even for non-experts, as they utilize a simple bar chart format that displays the frequency or proportion of data points in each bin.
  • Outlier Identification: Histograms are useful for identifying outliers or extreme values, as they appear as individual bars that significantly deviate from the rest of the bars.
  • Comparison of Data Sets: Histograms facilitate the comparison of distribution between different data sets, enabling us to identify similarities or differences in their patterns.
  • Data Summarization: Histograms are effective for summarizing large amounts of data by condensing the information into a few key features, such as the shape, center, and spread of the distribution.

Creating a histogram using Matplotlib library

We can create histograms using Matplotlib by following a series of steps. Following the import statements of the libraries, the code generates a set of 1000 random data points from a normal distribution with a mean of 0 and standard deviation of 1, using the `numpy.random.normal()` function. 

  1. The plt.hist() function in Python is a powerful tool for creating histograms. By providing the data, number of bins, bar color, and edge color as input, this function generates a histogram plot.
  2. To enhance the visualization, the xlabel(), ylabel(), and title() functions are utilized to add labels to the x and y axes, as well as a title to the plot.
  3. Finally, the show() function is employed to display the histogram on the screen, allowing for detailed analysis and interpretation.

Overall, this code generates a histogram plot of a set of random data points from a normal distribution, with 30 bins, blue bars, black edges, labeled axes, and a title. The histogram shows the frequency distribution of the data, with a bell-shaped curve indicating the normal distribution.  

Customizations available in Matplotlib for histograms  

In Matplotlib, there are several customizations available for histograms. These include:

  1. Adjusting the number of bins.
  2. Changing the color of the bars.
  3. Changing the opacity of the bars.
  4. Changing the edge color of the bars.
  5. Adding a grid to the plot.
  6. Adding labels and a title to the plot.
  7. Adding a cumulative density function (CDF) line.
  8. Changing the range of the x-axis.
  9. Adding a rug plot.

Now, let’s see all the customizations being implemented in a single example code snippet: 

In this example, the histogram is customized in the following ways: 

  • The number of bins is set to `20` using the `bins` parameter.
  • The transparency of the bars is set to `0.5` using the `alpha` parameter.
  • The edge color of the bars is set to `black` using the `edgecolor` parameter.
  • The color of the bars is set to `green` using the `color` parameter.
  • The range of the x-axis is set to `(-3, 3)` using the `range` parameter.
  • The y-axis is normalized to show density using the `density` parameter.
  • Labels and a title are added to the plot using the `xlabel()`, `ylabel()`, and `title()` functions.
  • A grid is added to the plot using the `grid` function.
  • A cumulative density function (CDF) line is added to the plot using the `cumulative` parameter and `histtype=’step’`.
  • A rug plot showing individual data points is added to the plot using the `plot` function.

Creating a histogram using ‘Seaborn’ library: 

We can create histograms using Seaborn by following the steps: 

  • First and foremost, importing the libraries: `NumPy`, `Seaborn`, `Matplotlib`, and `Pandas`. After importing the libraries, a toy dataset is created using `pd.DataFrame()` of 1000 samples that are drawn from a normal distribution with mean 0 and standard deviation 1 using NumPy’s `random.normal()` function. 
  • We use Seaborn’s `histplot()` function to plot a histogram of the ‘data’ column of the DataFrame with `20` bins and a `blue` color. 
  • The plot is customized by adding labels, and a title, and changing the style to a white grid using the `set_style()` function. 
  • Finally, we display the plot using the `show()` function from matplotlib. 

  

Overall, this code snippet demonstrates how to use Seaborn to plot a histogram of a dataset and customize the appearance of the plot quickly and easily. 

Customizations available in Seaborn for histograms

Following is a list of the customizations available for Histograms in Seaborn: 

  1. Change the number of bins.
  2. Change the color of the bars.
  3. Change the color of the edges of the bars.
  4. Overlay a density plot on the histogram.
  5. Change the bandwidth of the density plot.
  6. Change the type of histogram to cumulative.
  7. Change the orientation of the histogram to horizontal.
  8. Change the scale of the y-axis to logarithmic.

Now, let’s see all these customizations being implemented here as well, in a single example code snippet: 

In this example, we have done the following customizations:

  1. Set the number of bins to `20`.
  2. Set the color of the bars to `green`.
  3. Set the `edgecolor` of the bars to `black`.
  4. Added a density plot overlaid on top of the histogram using the `kde` parameter set to `True`.
  5. Set the bandwidth of the density plot to `0.5` using the `kde_kws` parameter.
  6. Set the histogram to be cumulative using the `cumulative` parameter.
  7. Set the y-axis scale to logarithmic using the `log_scale` parameter.
  8. Set the title of the plot to ‘Customized Histogram’.
  9. Set the x-axis label to ‘Values’.
  10. Set the y-axis label to ‘Frequency’.

Limitations of Histograms: 

Histograms are widely used for visualizing the distribution of data, but they also have limitations that should be considered when interpreting them. These limitations are jotted down below: 

  1. They can be sensitive to the choice of bin size or the number of bins, which can affect the interpretation of the distribution. Choosing too few bins can result in a loss of information while choosing too many bins can create artificial patterns and noise.
  2. They can be influenced by outliers, which can skew the distribution or make it difficult to see patterns in the data.
  3. They are typically univariate and cannot capture relationships between multiple variables or dimensions of data.
  4. Histograms assume that the data is continuous and does not work well with categorical data or data with large gaps between values.
  5. They can be affected by the choice of starting and ending points, which can affect the interpretation of the distribution.
  6. They do not provide information on the shape of the distribution beyond the binning intervals.

 It’s important to consider these limitations when using histograms and to use them in conjunction with other visualization techniques to gain a more complete understanding of the data. 

 Wrapping up

In conclusion, histograms are powerful tools for visualizing the distribution of data. They provide valuable insights into the shape, patterns, and outliers present in a dataset. With their simplicity and effectiveness, histograms offer a convenient way to summarize and interpret large amounts of data.

By customizing various aspects such as the number of bins, colors, and labels, you can tailor the histogram to your specific needs and effectively communicate your findings. So, embrace the power of histograms and unlock a deeper understanding of your data.

DSD Sign
Written by Safia Faiz

I am a Analytics Content Engineer Intern at Data Science Dojo. I'm really passionate about spreading and collecting knowledge on different data science concepts

Interested in writing for us? Apply here: Submit your guest post with us
Newsletters | Data Science Dojo
Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.

Data Science Dojo | data science for everyone

Discover more from Data Science Dojo

Subscribe to get the latest updates on AI, Data Science, LLMs, and Machine Learning.