Additional Exercises
Overview
Teaching: 0 min
Exercises: 20 minQuestions
How can I build a data-processing pipeline?
Objectives
Construct a pipeline that takes input files and generates output.
How to separate input and output data files.
Use
dateto determine today’s date in a particular format.Use
cutto extract particular fields from a comma-separate value (CSV) data file.Extend an existing pipeline with an additional data processing stage.
Working with dates
Using the date command we’re able to retrieve today’s date and time in any format we want. For example:
$ date +“%d-%m-%y”
The argument after the command indicates the format we’d like - %d will be replaced with the day of the month, %m the month, and %y the year. So this will output today’s date in the form day, month and year, separated by hyphens, e.g. 16-11-20. We can capture this date output in a Bash variable using $(), for example in the shell:
$ today_date=$(date +“%d-%m-%y”)
Copying files with new filenames
Write a script in the
swc-shell-novice/directory that goes through each.csvfile in thedatadirectory (that resides in theswc-shell-novice/directory) and creates a copy of that file with today’s date at the start of the filename, e.g.16-11-20-sc_climate_data.csv.Hints:
- With the input data files we’re going to use held in the data directory, create a new directory which will be used to hold your output files. When writing scripts or programs that generate output from input, it is a good idea to separate where you keep your input and output files. This allows you to more easily distinguish between the two, makes it easier to understand, and retains your original input which is invaluable if your script has a fault and corrupts your input data. It also means that if you add an additional processing step that will work on the output data and generate new output, you can logically extend this by adding a new script and a new directory to hold that new output, and so on.
- You can use
basenamefollowed by a path to extract only the filename from a path, e.g.basename test_directory/notes.txtproducesnotes.txt.Solution
If we assume the output directory is named
copied:today_date=$(date +"%d-%m-%y") for file in data/*.csv do base_file=$(basename $file) cp $file copied/$today_date-$base_file done
Extracting columns from CSV
A really handy command to know when working with comma-separated value (CSV) data is the cut command. You can use the cut command to extract a specific column from CSV data, e.g.:
$ echo “1,2,3,4” | cut -d”,” -f 3
Will output 3.
The -d argument specifies, within quotes, the delimiter that separates the columns on each line, whilst the -f argument indicates the column (or columns) you wish to extract.
Filtering our output
Let’s extend our pipeline to extract a specific column of data from each of our newly copied files.
Create a new script that takes each new file we created in the earlier exercise and creates a new file (again, in another separate output directory), which contains only the data held in the
Max_temp_jul_Fcolumn.Solution
The
Max_temp_jul_Fcolumn is the fourth column in each data file If we assume the input directory is namedcopiedand the output directory is namedfiltered:for file in copied/*.csv do base_file=$(basename $file) cat $file | cut -d"," -f 4 > filtered/$base_file done
Key Points
dateprints the current date in a specified format.Scripts can save the output of a command to a variable using
$(command)
basenameremoves directories from a path to a file, leaving only the name
cutlets you select specific columns from files, with-d','letting you select the column separator, and-fletting you select the columns you want.