Data Cleaning with OpenRefine: Glossary

Key Points

Introduction
  • OpenRefine is a powerful and free, open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps you take in working with your data, and will leave your original data intact.

Opening and Exploring Data
  • Faceting can identify errors or outliers in data.

Transforming Data
  • Clustering can identify outliers in data and help us fix errors in bulk.

  • GREL (General Refine Expression Language) is a powerful tool for transforming data.

Filtering and Sorting Data
  • OpenRefine provides various ways to sort and filter data without affecting the raw data.

Exporting Data Cleaning Steps
  • All changes are being tracked in OpenRefine (apart from individual cell changes and sorting!), and this information can be used for scripts for future analyses or reproducing an analysis.

  • Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.

Exporting and Saving Data
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Further Resources on OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Glossary