Descriptive Statistics

Data literacy and statistics are important parts of data analysis and knowledge development. There will be more discussion of that but at some point, a blog about data has to actually dive into some data. Additionally, some of you have commented that you want to learn some statistics so, in this post, we will get to work on a small data set of home prices. As we go through this exercise, please try to poke holes in what I am doing. There are some intentional holes that we will discuss in another post but I would like for you to find, and comment, on what can be done differently and better.

Data

The data we are going to work with is a compilation of  house “zestimates” from www.zillow.com for 98 houses on a series of six streets in a neighborhood in Reno, NV in the 89509 area code (Figure 1).

Screen Shot 2018-03-08 at 10.45.50 PM
Figure 1 – The homes and estimates used to obtain housing data.

The streets are Oakhurst, Knox, Watt, Ordway, Ardmore, and Glenmanor. This is a very popular area known as the South Midtown area. If you are not familiar with “zestimates”, they are the housing estimates created by Zillow using their proprietary algorithm. The data were collected by viewing the area of interest in a map on Zillow and entering the address, street, zestimate, number of bedrooms, number of baths, and square footage for each house. In total the data was collected for 98 houses. Data are available here . From here forward I will refer to the “zestimate” as the estimate.

Tools

I will use Microsoft Excel for the descriptive analysis since that is a widely used and available program. However, I do have my reservations with Excel and I would recommend if you want to delve further into data analysis that you look into Python and its collection of libraries or R. Both are much richer for data manipulation and analysis. Python will be covered in future posts.

Analysis

Maximum, Minimum, and Range

First I will identify the minimum value and the maximum value of the estimate, which are $238,924 and $814,779, respectfully. This gives a range of (maximum estimate – minimum estimate) of $575,855 dollars. This is a pretty broad range of house prices for such a limited area. It is likely there are some significant differences between the homes in this data set.

Frequency

Next, let’s look at the frequency, or the percentage, of houses that fall within given price ranges. I have created 10 equal sized bins, or classes of house prices, starting at $230,000, increasing by $60,000 until a maximum of $830,000 dollars is reached. This will encompass the prices of the 98 houses we are analyzing and will show the distribution of the house prices. I have created a frequency table (Figure 2)

Screen Shot 2018-03-08 at 9.55.18 PM
Figure 2 – Frequency table for the home cost estimates.

showing the count of houses and the percentage of each house within each price range or bin as well as created a histogram (Figure 3) graphically showing the shape of the data.

hist2
Figure 3 – Histogram of house estimates.

The histogram is right skewed, with the mean, or average, toward the right of the histogram instead of the middle like we would see with a symmetrical distribution. This is right skew is the result of the large outlier estimate which is the $814,779 value.

Mean, Median, and Mode

The mean and median are measures of central tendency. In other words, they provide an estimate of the typical value. The mean is described by the formula below (Figure 4), or more simply, by summing up the list of house estimates, or estimates, and dividing by the total number of houses.

 

Screen Shot 2018-03-08 at 10.07.27 PM
Figure 4 – Formula for mean

The mean housing estimate is $346,618.36.The median is essentially the middle value of the data. In the case of our data, since we have an even number of data points, we will average the middle two data points. Our middle two estimates are $338,978 and $339,401. Therefore, the median is:

( $338,978 + $339,401)/2  = $339,189.50

The mode is the most frequently occurring value or estimate. There are no recurring values in our data so there is no mode.

So, with this information we can deduce the following:

  •    The least expensive house estimate is $238,924.
  •    The most expensive house estimate is $814,779.
  •    70% of the houses are estimated to cost between $290,000.01 and $350,000.00.
  •    21% of the house are estimated to cost between $350,000.01 and $410,000.00.
  •    The average estimated house price is $346,618.36.
  •    The median estimated house price is $339,189.50.

This was just a brief introduction of a basic descriptive analysis of data to show you what I do when I am given a data set to get a feel for what is going on. We will explore more factors in the next post but, in the meantime, play with the data I have made available to you. Send me a comment with what you find or if you can find any omissions or biases in my analysis. Follow my blog or follow me on Twitter to get updates when I cover more statistical techniques or pontificate on data literacy.

Housing map courtesy of Zillow