Improved Data Analysis with Python

Excel is the most popular data analysis tool in use today; it is feature rich, powerful, and relatively easy to use. However, it often gets pushed to its limits and can easily become an error prone mess of nested formulas, making it difficult to find errors and to follow and replicate others’ work. Thankfully, the Python programming language makes it easy to avoid some of these pitfalls. I switched to Python and the Pandas library several years ago and haven’t looked back. Combined with Jupyter notebooks, this has been a great data munging, analysis, visualization, and reporting toolset.

Python

Python is a very powerful, open-source, object-oriented programming language that is famous for its ease of readability, relatively easy learning curve, and its rich libraries for data analysis. It is a general-purpose programming language that famous for rapid automation of repeated tasks and is reliable enough for tech organizations ranging from Google to NASA. And best of all, it will work on any operating system.

Reasons to use Python

There are several reasons to include Python in your tool kit for times when Excel, or other spreadsheet program, is inadequate. Some of those reasons are:

Readability

Python was designed to be easily read by humans making it easy to locate errors in the code and relatively quick to learn and write.

Repeatability & Automation

Code based analyses promote repeatability. Any mistakes are systematic and occur consistently, making them easy to trace and correct. It is also easy to automate and repeat common analyses on multiple data sets.

Transparency

Computer code is essentially a running record of all operations and commands performed on a data set. This makes it easy to audit and to understand what has transpired in the analytical workflow.

Data Munging

Python is very popular for its data munging (cleaning) capabilities. Data is never clean and organized when it is initially received and cleaning it is about 80% of data analysis. Various methods in Python make this easy and repeatable.

Libraries

Python also has a very rich ecosystem of libraries to facilitate data analysis. If you are transitioning from Excel to Python you will want to use the following:

Pandas

Pandas is a library used for data analysis. Its core data structures the dataframe and the series. Dataframes are spreadsheet– like structures similar to an Excel spreadsheet. Unlike Excel, the dataframe ensures data alignment and integrity. A series is a single column of data

Numpy

Numpy is a library designed for numerical calculations and scientific computations. It provides vectorization capabilities and provides more efficient calculations on arrays and matrices.

Seaborn

Seaborn is a Python data visualization library built on Matplotlib that is very handy for quickly creating visually pleasing, publication quality, statistical graphics.

Final Thoughts

Unlike Excel, Python is a programming language that will take some time to learn. It will take practice to become proficient and at first you will be slower with your calculations in Python than you were in Excel. You may face some resistance from others in your organization if they are Excel users, but the effort will pay off in the end with improved data quality and greater reliability in your analyses. Python will allow for easier collaboration and work inheritance should someone have to pick up where you left off. And, if you can’t convince your team to leave Excel, you can always port your Python analyses out to Excel for their use. Excel is a valuable tool for a quick data view and one-off calculations. But, more often than not, there are better tools for analysis and Python is one of the most approachable and provides some very long-term benefits.

Have you used Python for data analysis? Comment below or contact me on Twitter to tell me about your experience.