Open Refine for Ecology

Working with OpenRefine

Overview

Teaching: 15 min
Exercises: 20 min
Questions
  • Getting started working with OpenRefine

Objectives
  • Create a new OpenRefine project from a CSV file.

  • Recall what facets are and how they are used to sort and summarize data.

  • Recall what clustering is and how it is applied to group and edit typos.

  • Manipulate data using previous steps with undo/redo.

  • Employ drop-downs to split values from one column into multiple columns.

  • Employ drop-downs to remove white spaces from cells.

Lesson

Creating a Project

Start the program. (Double-click on the openrefine.exe file (or google-refine.exe if using an older version). Java services will start on your machine, and OpenRefine will open in your Firefox browser).

Launch OpenRefine (see Getting Started with OpenRefine).

OpenRefine can import a variety of file types, including tab separated (*.tsv), comma separated (*.csv), Excel (*.xls, *.xlsx), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.

In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, We modified the Portal_rodents CSV file, adding several columns: scientificName, locality, county, state, country and generating several more columns in the lesson itself (JSON, decimalLatitude, decimalLongitude). Data in locality, county, country, JSON, decimalLatitude and decimalLongitude are contrived and are in no way related to the original dataset.

If you haven’t already, download the data from:
https://ndownloader.figshare.com/files/7823341

Once OpenRefine is launched in your browser, the left margin as options to Create Project, Open Project, or Import Project. Here we create a project, which just takes a couple steps:

  1. click Create Project to bring you to “Get data from” “This Computer”.
  2. Click Browse… and select file Portal_rodents_19772002_scinameUUIDs.csv. Click Open in the browse window to finish selection
  3. Click Next» under the browse button to upload data into OpenRefine.
  4. OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click Update Preview (look in bottom panel). If this is the wrong file, click «Start Over (look in upper left region).
  5. If all looks well, click Create Project» (look in upper right corner).

Note that at step 1, you could upload data in a standard form from a web address by using “Get data from” “Web Addresses (URLs)”. However, this won’t work for all URLs.

Faceting

Exploring data by applying multiple filters

OpenRefine supports faceted browsing as a mechanism for

Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture on that column, and allows you to filter to some subset of rows for which their cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples.

More on faceting

Exercise

Using faceting, find out how many years are represented in the census.

Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?

Which years have the most and least observations?

Cluster

In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings “New York” and “new york” are very likely to refer to the same concept and just have capitalization differences. Likewise, “Gödel” and “Godel” probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.

More on clustering

Split

If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the bits into their own columns.

Exercise

Try to change the name of the second new column to “species”. How can you correct the problem you encounter?

Undo / Redo

It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo and Redo operations to make this easy.

Trim Leading and Trailing Whitespace

Strings with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.

Key Points