Data Cleaning with OpenRefine

Credit: Data Carpentry

This lesson is based on Data Carpentry’s “Data Cleaning In OpenRefine” lessons from the ecology and social sciences curricula.

Most, if not all, real world data is messy. It includes errors, corruptions, inaccuracies and inconsistencies. Before data analysis can take place, data cleaning is needed to identify and correct errors, and make the data structure and formatting consistent. This process has the potential to radically change the data, so it must be completed with the same care and attention to reproducibility as the data analysis itself.

Consistent and transparent manipulation of data is not a task that is well suited to spreadsheets. Fortunately, we have OpenRefine (previously known as Google Refine): a powerful and free, open-source tool that was purpose-built to help clean messy data. Many people comment that using OpenRefine for data cleaning saves them literally months of work.

This lesson will teach you how to use OpenRefine to clean and format data, and will show you how it tracks the changes you make to automatically document the data cleaning process.

Getting Started

This lesson assumes no prior knowledge of the computational skills or tools apart from the basic understanding of tabular data stored in spreadsheets.

To get started, follow the directions in the “Setup” section to download the data we will use during the lesson.

Screenshots

Throughout this lesson we will make use and show content and screenshots from OpenRefine. As an actively developed software, OpenRefine is adding new features or changing visual elements. Screenshots in this lesson have been made with OpenRefine 3.3 and may have became out-of-date or not look identical as on your computer.

Schedule

Setup Download files required for the lesson
00:00 1. Introduction How is OpenRefine useful?
00:15 2. Opening and Exploring Data How can we import our data into OpenRefine?
How can we summarise our data?
How can we find errors in our data?
How can we edit data to fix errors?
How can we convert column data from one data type to another?
00:50 3. Transforming Data How can we transform our data to correct errors?
01:30 4. Filtering and Sorting Data How can we select only a subset of our data to work with?
How can we sort our data?
02:00 5. Exporting Data Cleaning Steps How can we document the data-cleaning steps we’ve applied to our data?
How can we apply these steps to additional data sets?
02:15 6. Exporting and Saving Data How can we save and export our cleaned data from OpenRefine?
02:30 7. Further Resources on OpenRefine What other resources are available for working with OpenRefine?
02:35 Finish