Bowman’s Website

August 29, 2010

Statistics Notes — Measures of Central Tendency

Filed under: Statistics — Tags: — bowman @ 11:01 pm

In mathematics, central tendency of a data set refers to an average.
Measures of central tendency are measures of the location of the middle or the center of a distribution. The definition of “middle” or “center” is purposely left somewhat vague so that the term “central tendency” can refer to several forms of measures.
The term central tendency refers a typical value of the data, and is measured using the mean, median, or mode. Each of these measures is calculated differently, and the one that is best to use depends upon the situation.

.

The mean is the most commonly-used measure of central tendency. When we talk about an “average”, we usually are referring to the mean. The mean is simply the sum of the values divided by the total number of items in the set.

The mean is valid only for interval data or ratio data. Since it uses the values of all of the data points in the population or sample, the mean is influenced by outliers that may be at the extremes of the data set.

An outlier is a data entry is that far removed from the other entries in the data set.  It is an unusually large or an unusually small value compared to the others.  Outliers are infrequent observations.
An outlier might be the result of an error in measurement, in which case it will distort the interpretation of the data, having undue influence on many summary statistics — in particular, the mean.
If an outlier is a genuine result, it is important because it might indicate an extreme of behavior of the process under study.  For this reason, all outliers must be examined carefully before embarking on any formal analysis.  Outliers should not routinely be removed without further justification.

.

The median is determined by sorting the data set from lowest to highest values and taking the data point in the middle of the sequence. There is an equal number of points above and below the median. For example, in the data set {1,2,3,4,5} the median is 3; there are two data points greater than this value and two data points less than this value. In this case, the median is equal to the mean. But consider the data set {1,2,3,4,10}. In this data set, the median still is three, but the mean is equal to 4. If there is an even number of data points in the set, then there is no single point at the middle and the median is calculated by taking the mean of the two middle points.

The median can be determined for ordinal data as well as interval and ratio data. Unlike the mean, the median is not influenced by outliers at the extremes of the data set. For this reason, the median often is used when there are a few extreme values that could greatly influence the mean and distort what might be considered typical. This often is the case with home prices and with income data for a group of people, which often is very skewed. For such data, the median often is reported instead of the mean. For example, in a group of people, if the salary of one person is 10 times the mean, the mean salary of the group will be higher because of the unusually large salary. In this case, the median may better represent the typical salary level of the group.

.

The mode is the most frequently occurring value in the data set. For example, in the data set {1,2,3,4,4}, the mode is equal to 4. A data set can have more than a single mode, in which case it is multimodal. In the data set {1,1,2,3,3} there are two modes: 1 and 3.
If no entry is repeated, the data set has no mode.  If two entries occur with the same greatest frequency, each entry is a mode and the data set is called bimodal.

The mode can be very useful for dealing with nominal data. For example, if a sandwich shop sells 10 different types of sandwiches, the mode would represent the most popular sandwich. The mode also can be used with ordinal, interval, and ratio data. However, in interval and ratio scales, the data may be spread thinly with no data points having the same value. In such cases, the mode may not exist or may not be very meaningful.

.

The sample mean is the most common used measure of central tendency for inferential statistics.

While the mean, the median, and the mode each describe a typical entry of a data set, there are advantages and disadvantages of using each, especially when the data set contains outliers.

The following table summarizes the appropriate methods of determining the middle or typical value of a data set based on the measurement scale of the data.

Because the median considers only the order of values, it is resistant to values that are extraordinarily large or small; it simply notes that they are one of the “big ones” or “small ones” and ignores their distance from center.
To choose between the mean and median, start by looking at the data.  If the histogram is symmetric and there are no outliers, use the mean.
However, if the histogram is skewed or with outliers, you are better off with the median.

.

Example: An instructor recorded the average number of absences for his students in one semester. For a random sample the data are:
2    4    2    0    40    2    4    3    6

If there are any clear outliers and you are reporting the mean, report it with the outliers present and with the outliers removed. The differences may be quite revealing.

.

.

The weighted mean is similar to an arithmetic mean, where instead of each of the data points contributing equally to the final average, some data points contribute more than others.

If all the weights are equal, then the weighted mean is the same as the arithmetic mean.

Example: Given two school classes, one with 20 students, and one with 30 students, the grades in each class on a test were:

Morning class = 62, 67, 71, 74, 76, 77, 78, 79, 79, 80, 80, 81, 81, 82, 83, 84, 86, 89, 93, 98

Afternoon class = 81, 82, 83, 84, 85, 86, 87, 87, 88, 88, 89, 89, 89, 90, 90, 90, 90, 91, 91, 91, 92, 92, 93, 93, 94, 95, 96, 97, 98, 99

The average for just the morning class is 80 and the average for just the afternoon class is 90. The average of these two averages, 80 and 90, is 85, the mean of the two class means. However, this does not account for the difference in number of students in each class, and the value of 85 does not reflect the average student grade (independent of class). The average student grade can be obtained by either averaging all the numbers without regard to classes, or weighting the class means by the number of students in each class:

The sum of the 50 entries in the data set is 4300.  Thus the average is 4300 divided by 50.

or…

The sum of the appropriate weight given to the first and the appropriate weight given to the second class divided by the sums of the appropriate weights.

The weighted mean makes it possible to find the average student grade in the case where only the class means and the number of students in each class are available.

.

The Mean of a Frequency Distribution
Sometimes we are given a chart showing frequencies of certain groups instead of the actual values. We can still come up with a good estimate of a typical value for the set of data, provided that we make some assumptions. We assume that the values in each class or group are spread evenly throughout the group. If this is the case, then the mean for each class should be approximately equal to the midpoint for each class. So for each class, we have a mean and a number of values. We now can assign weight for each class. If we multiply each midpoint by its frequency, and then divide by the total number of values in the frequency distribution, we have an estimate of the mean.

Example.

About these ads

The WordPress Classic Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.