Getting Started with Analysis – Excel

For some, one of the hurdles to beginning data analysis is which tool to use. I personally spent quite a bit of time going back and forth over whether or not to learn Python or R while ignoring a very familiar tool already at my fingertips – Excel. While it doesn’t have the cache or power of R or Python, or statistics packages like JMP,Minitab, or SAS, it is probably the most popular and widely available data analysis tool available and a great place to start learning data analysis. For those that don’t want to invest the time to learn a computer programming, Excel is an excellent tool for your basic data analysis needs.

excel_data_analExcel is the primary data analysis tool in most offices and academic environments and almost anyone familiar with a computer has had some experience with it. While it can take some time to master and learn all of its nuances, its familiarity, availability, and prominence greatly lower the bar to doing data analytics. Excel allows for easy created data visualizations and also lends itself to basic calculations and data cleaning functions. Basic statistical analyses are possible after loading the Data Analysis plug-in.

A great tutorial on how to get started with data analysis and to begin dabbling in data science can be found at Kaggle’s Getting Started with Excel: Kaggle’s Titanic Competition. This tutorial uses Excel to predict survivors from the Titanic using real passenger data. In this tutorial you will learn about splitting data into training sets and testing sets and will teach you how to use pivot tables; one of Excel’s most powerful features for summarizing data. The tutorial will also teach you how to use basic logical if statements to classify data as well as to make predictions as to who, and who didn’t, survive the Titanic disaster.

Despite its capabilities and popularity, Excel does have its limitations. It is memory intensive and cannot handle data sets with more than 1,048,576 rows by 16,384 columns. While this seems very large, you may eventually encounter data sets too large for Excel if you continue pursuing data analysis or data science. Other downsides of Excel include difficult analysis and model reproducibility, accuracy and precision errors, poor version control amongst users, copy – paste errors, and difficulty maintaining data types.

Even with these shortcomings, Excel is a great tool for exploratory data analyses and one – off models like the example in the Kaggle tutorial. If data file size, reproducibility, and long-term maintainability are concerns of yours, you will eventually want to learn Python or R, but don’t let the tool get in the way of learning and performing analysis and exploring data. Comment and let me know how you started your data analysis and data literacy journey. What tools do you use? Please leave your comments below and follow me on Twitter.