Mean, Median, and Mode Revisited

I calculated some descriptive statistics on home price estimates in the South Midtown neighborhood in Reno, NV. Two of the statistics I looked at were the average estimated home price and the median estimated home price based on the Zestimate values from Zilllow. There was something that stood out in the estimated prices that I did not discuss previously that I will discuss here. If you downloaded the data and worked with it, or if you analyzed the statistics closely, you may have caught the issue.

Mean Revisited

The mean, as mentioned in the previous post, is a measure of central tendency. It describes a typical value of a data set and is not an individual value. However, if the mean is not applied properly, it can result in an inaccurate interpretation of the data. The home estimate data set from the previous post is an instance where the average can be misleading. If you recall, the range of the estimates was $575,855 with a minimum price of $238,924 and a maximum price of $814,779. We are interested in the affect the maximum price has on the average price of $346,618.36. In the histogram below (Figure 1) you will see that I have circled this price and it is very far removed from the rest of the prices.

hist2
Figure 1 – Histogram of House Price Estimates 

 There are statistical methods for determining outliers, but we did not need them here. This price was a real zinger! If we exclude that price, the average price of the data is $341,791.96. That is only $4,826.40 difference between the two averages. I thought it would be a greater difference, skewing the average price higher and resulting in an inaccurate representation of the housing prices but that is not the case here. However, this is something that should be considered whenever anyone discusses averages.

To illustrate the point I wanted to make with the house estimates let’s look at some hypothesized salaries to see how an extreme value can change our interpretation of the data. This demonstrates the effect the salary of a very wealthy person could have on the average salary of bar patrons if they were all there drinking together. Consider the following salaries:

Screen Shot 2018-03-09 at 7.38.55 PM

As you can see, there is an extreme outlier salary of $3,000,000.00. The average salary is $138,867 which does not seem to be very representative of the salary range. If we remove the $3,000,000 value, the average becomes $40,207. This looks more realistic.

Median Revisited

As mentioned previously, the median is the middle value; half of the data lie below it and half lie above it. It is a good representation of the typical value of a data set and it is often an actual value from the data set if there is an odd number of data. If there is an even number, it is the mean of the middle two values. The great thing about the median is that it is robust to outliers, like the $3,000,000 salary. The median for the salary data set containing the $3,000,000 salary is $36,000. If the outlier salary is removed, the median is $35,000. In this instance, the median provides and better description of the typical salary.

Mode Revisited

The housing estimate data set did not have a mode since none of the housing prices were repeated. However, the hypothesized salaries do have a mode. If you recall, the mode is the most frequently occurring value in a data set. In our salary data, the mode is $30,000. This is not always useful, but it does help complete the full picture of a data set

So, when provided summary statistics, it is always a good idea to make sure that in addition to the mean, or average, that the median, mode, range, minimum, and maximum are also provided. If most of the data are similar in value, then a mean can provide a good understanding of a typical value of the data set. If there are extreme outliers, then the median may provide a more robust picture. Leave a comment and let me know if you have come across misuse of the mean and how this knowledge can help you.

All images property of Chadwick Spencer