3.4 Tidy Data
As researchers, we work with data from many different sources. Often these data are messy. One of the first steps of any analysis is to clean any available raw data so that you can make simple visualizations of the data, calculate simple summary statistics, look for missing or incorrect data, and eventually proceed with more involved analyses or modeling.
As part of the data cleaning process, we recommend getting all data into a “tidy” format (also known as “long” format, as opposed to “wide” format). According to the Tidy Data Guide by Garrett Grolemund and Hadley Wickham, tidy data is defined as:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
These three features of tidy data can be seen in the following figure, also from the Tidy Data Guide:
Once data are in this format, it makes subsequent visualization and analysis much easier. If you’ve ever worked with a file that has a separate column for each year (an example of “wide” format data), you know how hard that type of data format is to work with!
As always, we recommend keeping a backup copy of the raw data you obtained from the original source, and using a reproducible script for transforming these data into a tidy data format.
3.4.1 Recommended Resources
We highly recommend the chapter on Tidy Data from the book R for Data Science by Garrett Grolemund and Hadley Wickham. This guide is geared towards R users and provides helpful tips for transforming and working with data in R, but the concepts should be broadly applicable to other languages as well. For tips specific to Python, we recommend a blogpost by Jean-Nicholas Hould titled Tidy Data in Python.