Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain. The lessons below are based upon using ecology data in R, and this is an introduction to R designed for participants with no programming experience.
A typical data science project looks something like this:
The overall aim is to understand and communicate findings from our data, but this is preceded by the fundamental tasks of importing, tidying and often transforming the data. Transformation means for example, selecting a subset of the data to work with, or calculating the mean of a set of observations.
This entire process can be encompased within a single programming language. Here we will be using R, but Python and other programming languages can acheive the same thing.
These lessons are designed to be taught in a day, and cover only a fraction of what is possible, but will help you take your first steps towards managing your own data projects.
The lessons start with an overview of R and the RStudio interface and introduce visualising data as the first tool for generating insights.
Data is rarely provided in the form we need, so the next set of tools we’ll learn about are those for transforming data.
Armed with these tools, we’ll tackle an ecology dataset from desert rodent surveys as a case study for importing CSV files, performing some basic transformations, and then visualising the data. By the end of this process we should be able to understand and communicate some general findings obtained from the dataset.
Concretely, using a rodent survey data covering a period a period from 1977 to 1991, we will aim to understand the effect on the populations of small seed eating rodents as a result of the exclusion of larger competitor kangeroo rats. In doing so we will perform tasks common to many data projects irrespective of subject area.
Note:
Much of the material in these lessons is derived from R for data science by Garrett Grolemund and Hadley Wickham. This book contains many more detailed examples and advanced concepts not covered here.
Ideally all datasets should come with a codebook describing the contents, structure, and layout of a data collection. This is a text document containing information intended to be complete and self-explanatory for each variable in a data file. As this has not been generated for this data, we will use the description found in the Ecology Data Overiew.
Data files for the lesson are available here: http://dx.doi.org/10.6084/m9.figshare.1314459
We will download the combined.csv file during the lesson.
Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow. These lessons assume no prior knowledge of the skills or tools, but working through this lesson requires working copies of the software described below. To most effectively use these materials, please make sure to download the data and install everything before working through this lesson.
R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio. After installing both programs, you will need to install the tidyverse
package from within RStudio. Follow the instructions below for your operating system, and then follow the instructions to install tidyverse
. More about the tidyverse.
sessionInfo()
, which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You may also want to consider removing your old version of R. You can check here for more information..exe
file that was just downloadedsessionInfo()
, which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You may also want to consider removing your old version of R. You can check here for more information..pkg
file for the version of OS X that you have and the file will downloadsudo apt-get install r-base
, and for Fedora sudo yum install R
), but the versions provided by this approach are usually out of date. In any case, make sure you have at least R 3.3.1sudo dpkg -i rstudio-x.yy.zzz-amd64.deb
at the terminal).After installing R and RStudio, you need to install the “tidyverse” packages. More about the tidyverse.
install.packages("tidyverse")
If you receive a message that the tidyverse
pacakge is not available, you will need to install the latest version of R from the CRAN website. See above notes.
The list of contributors to this lesson is available here.
Data Carpentry, 2017.
License. Questions? Feedback?
Please file
an issue on GitHub.
On Twitter: @datacarpentry