Data Organization in Spreadsheets

Exporting data

lesson home

Overview

Teaching: 10 min
Exercises: 0 min

Questions

Exporting data from spreadsheets

Objectives

Store spreadsheet data in universal file formats.

Export data from a spreadsheet to a .csv file.

Authors:Christie Bahlai, Aleksandra Pawlik
Contributors: Jennifer Bryan, Alexander Duryee, Jeffrey Hollister, Daisie Huang, Owen Jones, and Ben Marwick

Storing the data you’re going to work with for your analyses in Excel default file format (*.xls or *.xlsx - depending on the Excel version) is a bad idea. Why?

Because it is a proprietary format, and it is possible that in the future, technology won’t exist (or will become sufficiently rare) to make it inconvenient, if not impossible, to open the file.

Think about zipdisks. How many old theses in your lab are “backed up” and stored on zipdisks? Ever wanted to pull out the raw data from one of those? Exactly.
Other spreadsheet software may not be able to open files saved in a proprietary Excel format.
Different versions of Excel may handle data differently, leading to inconsistencies.
Finally, more journals and grant agencies are requiring you to deposit your data in a data repository, and most of them don’t accept Excel format. It needs to be in one of the formats discussed below.

As an example of inconsistencies in data storage, do you remember how we talked about how Excel stores dates earlier? It turns out that there are multiple defaults for different versions of the software, and you can switch between them all. So, say you’re compiling Excel-stored data from multiple sources. There’s dates in each file- Excel interprets them as their own internally consistent serial numbers. When you combine the data, Excel will take the serial number from the place you’re importing it from, and interpret it using the rule set for the version of Excel you’re using. Essentially, you could be adding errors to your data, and it wouldn’t necessarily be flagged by any data cleaning methods if your ranges overlap.

Storing data in a universal, open, and static format will help deal with this problem. Try tab-delimited (tab separated values or TSV) or comma-delimited (comma separated values or CSV). CSV files are plain text files where the columns are separated by commas, hence ‘comma separated values’ or CSV. The advantage of a CSV file over an Excel/SPSS/etc. file is that we can open and read a CSV file using just about any software, including plain text editors like TextEdit or NotePad. Data in a CSV file can also be easily imported into other formats and environments, such as SQLite and R. We’re not tied to a certain version of a certain expensive program when we work with CSV files, so it’s a good format to work with for maximum portability and endurance. Most spreadsheet programs can save to delimited text formats like CSV easily, although they may give you a warning during the file export.

To save a file you have opened in Excel in *.csv format:

From the top menu select ‘File’ and ‘Save as’.
In the ‘Format’ field, from the list, select ‘Comma Separated Values’ (*.csv).
Double check the file name and the location where you want to save it and hit ‘Save’.

An important note for backwards compatibility: you can open CSV files in Excel!

Saving an Excel file to CSV

A Note on Cross-platform Operability

By default, most coding and statistical environments expect UNIX-style line endings (\n) as representing line breaks. However, Windows uses an alternate line ending signifier (\r\n) by default for legacy compatibility with Teletype-based systems. As such, when exporting to CSV using Excel, your data will look like this:

data1,data2\r\n1,2\r\n4,5\r\n…

which, upon passing into most environments (which split on \n), will parse as:

data1
data2\r
1
2\r
…

thus causing terrible things to happen to your data. For example, 2\r is not a valid integer, and thus will throw an error (if you’re lucky) when you attempt to operate on it in R or Python. Note that this happens on Excel for OSX as well as Windows, due to legacy Windows compatibility.

There are a handful of solutions for enforcing uniform UNIX-style line endings on your exported CSV files:

When exporting from Excel, save as a “Windows comma separated (.csv)” file
If you store your data file under version control (which you should be doing!) using Git, edit the .git/config file in your repository to automatically translate \r\n line endings into \n. Add the follwing to the file (see the detailed tutorial):
```
 [filter "cr"]
 		clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n'
 		smudge = tr '\\n' '\\r'` 
```

and then create a file .gitattributes that contains the line: *.csv filter=cr

Use dos2unix (available on OSX, *nix, and Cygwin) on local files to standardize line endings.

A note on R and `xls`

There are R packages that can read xls files (as well as Google spreadsheets). It is even possible to access different worksheets in the xls documents.

But

some of these only work on Windows
this equates to replacing a (simple but manual) export to csv with additional complexity/dependencies in the data analysis R code
data formatting best practice still apply
Is there really a good reason why csv (or similar) is not adequate?

Caveats on commas

In some datasets, the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. This is because the commas which are a part of the data values will be interpreted as delimiters.

If you are working with data that contains commas, you likely will need to use another delimiter when working in a spreadsheet. In this case, consider using tabs as your delimiter and working with TSV files. TSV files can be exported from spreadsheet programs in the same way as CSV files.

Key Points

Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

Exporting data from spreadsheets to formats like .csv or .tsv puts it in a format that can be used consistently by most programs.

previous episode

Data Organization in Spreadsheets

Exporting data

lesson home

Overview

A Note on Cross-platform Operability

A note on R and `xls`

Caveats on commas

Key Points

previous episode

lesson home

previous episode

Data Organization in Spreadsheets

Exporting data

lesson home

Overview

A Note on Cross-platform Operability

A note on R and xls

Caveats on commas

Key Points

previous episode

lesson home

A note on R and `xls`