Display the Distribution of Your Data Using the Histogram

Add bookmark

Anantha Kollengode
08/09/2010

The histogram, one of the seven quality tools, is very effective in displaying the distribution of data. It’s particularly valuable if there are a large number of observations, as the tool quickly conveys the shape of the data distribution. The histogram visually presents the data of tabulated frequencies as consecutive, non-overlapping rectangular intervals.

The histogram is depicted as a special bar chart where the individual data are grouped together into intervals or classes. The chart gives a quick overview of how frequently the data in each interval occur in the data set. The higher the bar of a certain interval, the greater the number of data points it has. Consequently, the histogram allows us to easily identify the location of the data and the variation in the data.

While the tool is very simple to use, a discussion is warranted on the width of the intervals (also referred to as bin widths). It is important to note that the width of the intervals, along with the starting point for the first interval, has a profound impact on the shape of the histogram. There are some excellent discussions on this topic (I’ll highlight them in a future article) and several rules of thumb to calculate the number of intervals or bins. These calculations include Sturges’ formula (uses the log function); Scott’s formula (based on the standard deviation); the Rice rule (twice the square root of the number of data points); and Freedman-Diaconis choice (based on inter-quartile range), among others.

For example, if we have to determine the number of intervals needed to represent data with 1,000 data points, and we use Sturges’ formula (1 + log2 1,000) to construct the histogram, 11 intervals will be produced. In comparison, if we use the square root formula, 20 intervals will be produced. It is a good idea to experiment with the different bin-width choices and select bin widths based on how well they communicate the shape of the data distribution.

Steps for Creating a Histogram

To illustrate how to create a histogram, we will use patient wait times in an emergency room as the data to be depicted visually. Let’s assume that we collected the wait times in the ER over a period of time.

The first step is to create a frequency table (shown below), which works best with small data sets. Begin by constructing a table with three columns. Under the first column, insert the information (in this case, the patient wait time) that is being arranged in ascending order. Under the second column, tally up the number of times a data point (wait time) occurs. Under the third column, add up the tally counts to get the frequency.

Wait time (min)	Tally	Frequency
5	\|	1
6	\|	1
11	\|	1
15	\|	1
20	\|	1
21	\|\|	2
26	\|	1
28	\|	1
30	\|\|\|	3
35	\|	1
42	\|	1
45	\|	1

For larger data sets, it will be difficult to set up a frequency table like the one above. It’s common to use software, such as Excel, to quickly organize the data. I must note that Excel does not automatically calculate the bin widths; they will need to be specified in order to construct a histogram using the program. (If you need help with this feature in Excel, please e-mail me.)

Assessing the Shape of a Histogram

If a histogram is bell shaped, with the highest point in the middle and symmetrical slopes on either side, it is a normal distribution (see figure 1). However, distributions can be skewed or flatter than the normal-shaped bell curve.

A skewed histogram will have a long tail in one particular direction. For instance, a right long tail is referred to as positively skewed, a long left tail as negatively skewed.

Kurtosis is a measure of the length of the tails. Positive kurtosis (a flatter curve) is a potential indicator for process specifications being wider than customer specifications. (Click on image to enlarge.)

Figure 1

Histograms, Customer Specifications and Process Improvement

A histogram can prove useful in determining whether a process, or product or service, falls outside of customer specifications. This is achieved by overlaying the specifications (target, upper and lower control limits) on the histogram itself and seeing if the data points fall within these specifications.

Using our ER wait time example, the histogram in figure 2 shows that the team’s goal is to have wait times reduced to 30 minutes or less. By visually depicting this specification on the histogram, the team can see that it has ways to go before achieving this objective, as the majority of the data points fall outside of a 30 minute wait time. (Click on image to enlarge.)

Figure 2

Note of Caution

While the histogram is useful in providing information in a simple manner, we do need to be careful about a few things. For one, we need to be cautious with the number of columns or bins used in a histogram. The reason is that the number of bins impacts the shape of the histogram and thus could lead to wrong assumptions about the shape of the distribution.

To illustrate this point, let’s use the wait time example once again. In the histogram in figure 3, the data is now depicted using five bins instead of 11. This results in a skewed distribution. In comparison, the 11-bin histogram in figure 2 has a bi-modal distribution — a clear discrepancy from figure 3’s distribution.

I recommend experimenting with the number of column intervals so that the resulting histogram is representative of the actual data. (Click on image to enlarge.)

Figure 3

Histograms can also mask the effect of variation, such as seasonality and other factors. For example, if the ER wait times are collected for a month and tabulated using the frequency table alone, we could miss any variations due to the day of the week, staffing levels, time of day, lack of access to preferred providers on weekends, etc. Therefore, a histogram should be used in a proper context and after observing data over a longer period of time to better understand the process.