The Bash Shell

Shell Scripts

Learning Objectives

  • Write a shell script that runs a command or series of commands for a fixed set of files.
  • Run a shell script from the command line.
  • Write a shell script that operates on a set of files defined by the user on the command line.
  • Create pipelines that include user-written shell scripts.

We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.

Our first shell script

Let’s start by going back to novice/shell/data and putting some commands into a new file called middle.sh using an editor like nano:

$ cd ~/2020-10-29-socobio-crs/novice/shell/data
$ nano middle.sh

So why the .sh extension to the filename? Adding .sh is the convention to show that this is a Bash shell script.

Enter the following line into our new file, then save it and exit nano (using Control-O to save it and then Control-X to exit nano):

head -15 sc_climate_data_1000.csv | tail -5

This pipe selects lines 11-15 of the file sc_climate_data_1000.csv. It selects the first 15 lines of that file using head, then passes that to tail to show us only the last 5 lines - hence lines 11-15. Remember, we are not running it as a command just yet: we are putting the commands in a file.

Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash, so we run the following command:

$ bash middle.sh
299196.8188,972890.0521,48.07,61.41,0.78
324196.8188,972890.0521,48.20,-9999.00,0.72
274196.8188,968890.0521,47.86,60.94,0.83
275196.8188,968890.0521,47.86,61.27,0.83
248196.8188,961890.0521,46.22,58.98,1.43

Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.

Enabling our script to run on any file

What if we want to select lines from an arbitrary file? We could edit middle.sh each time to change the filename, but that would probably take longer than just retyping the command. Instead, let’s edit middle.sh and replace sc_climate_data_1000.csv with a special variable called $1:

$ nano middle.sh
head -15 "$1" | tail -5

Inside a shell script, $1 means the first filename (or other argument) passed to the script on the command line. We can now run our script like this:

$ bash middle.sh sc_climate_data_1000.csv
299196.8188,972890.0521,48.07,61.41,0.78
324196.8188,972890.0521,48.20,-9999.00,0.72
274196.8188,968890.0521,47.86,60.94,0.83
275196.8188,968890.0521,47.86,61.27,0.83
248196.8188,961890.0521,46.22,58.98,1.43

or on a different file like this (our full data set!):

$ bash middle.sh sc_climate_data.csv
299196.8188,972890.0521,48.07,61.41,0.78
324196.8188,972890.0521,48.20,-9999.00,0.72
274196.8188,968890.0521,47.86,60.94,0.83
275196.8188,968890.0521,47.86,61.27,0.83
248196.8188,961890.0521,46.22,58.98,1.43

Note the output is the same, since our full data set contains the same first 1000 lines as sc_climate_data_1000.csv.

Adding more arguments to our script

However, if we want to adjust the range of lines to extract, we still need to edit middle.sh each time. Less than ideal! Let’s fix that by using the special variables $2 and $3. These represent the second and third arguments passed on the command line:

$ nano middle.sh
head "$2" "$1" | tail "$3"

So now we can pass the head and tail line range arguments to our script:

$ bash middle.sh sc_climate_data_1000.csv -20 -5
252196.8188,961890.0521,46.22,60.94,1.43
152196.8188,960890.0521,48.81,-9999.00,1.08
148196.8188,959890.0521,48.81,59.43,1.08
325196.8188,957890.0521,48.20,61.36,0.72
326196.8188,957890.0521,47.44,61.36,0.80

This does work, but it may take the next person who reads middle.sh a moment to figure out what it does. We can improve our script by adding some comments at the top:

$ cat middle.sh
# Select lines from the middle of a file.
# Usage: middle.sh filename -end_line -num_lines
head "$2" "$1" | tail "$3"

In Bash, a comment starts with a # character and runs to the end of the line. The computer ignores comments, but they’re invaluable for helping people understand and use scripts.

A line or two of documentation like this make it much easier for other people (including your future self) to re-use your work. The only caveat is that each time you modify the script, you should check that its comments are still accurate: an explanation that sends the reader in the wrong direction is worse than none at all.

Processing multiple files

What if we want to process many files in a single pipeline? For example, if we want to sort our .csv files by length, we would type:

$ wc -l *.csv | sort -n

This is because wc -l lists the number of lines in the files (recall that wc stands for ‘word count’, adding the -l flag means ‘count lines’ instead) and sort -n sorts things numerically. We could put this in a file, but then it would only ever sort a list of .csv files in the current directory. If we want to be able to get a sorted list of other kinds of files, we need a way to get all those names into the script. We can’t use $1, $2, and so on because we don’t know how many files there are. Instead, we use the special variable $@, which means, “All of the command-line parameters to the shell script.” We also should put $@ inside double-quotes to handle the case of parameters containing spaces ("$@" is equivalent to "$1" "$2" …)

Here’s an example. Edit a new file called sort.sh:

$ nano sorted.sh

And in that file enter:

wc -l "$@" | sort -n

When we run it with some wildcarded file arguments:

$ bash sorted.sh *.csv ../test_directory/creatures/*.dat

We have the following output:

      11 sc_climate_data_10.csv
     155 ../test_directory/creatures/minotaur.dat
     163 ../test_directory/creatures/basilisk.dat
     163 ../test_directory/creatures/unicorn.dat
    1001 sc_climate_data_1000.csv
 1048580 sc_climate_data.csv
 1050073 total

Again, we should explain what we are trying to do here using a comment, for example:

# List given files sorted by number of lines
wc -l "$@" | sort -n

Exercises

Variables in shell scripts

In the test_directory/molecules directory, you have a shell script called script.sh containing the following commands:

head $2 $1
tail -n $3 $1

Note that here, we use the explicit -n flag to pass the number of lines to tail that we want to extract, since we’re passing in multiple .pdb files. Otherwise, tail can give us an error about incorrect options on certain machines if we don’t.

While you are in the molecules directory, you type the following command:

bash script.sh '*.pdb' -1 -1

Which of the following outputs would you expect to see?

  1. All of the lines between the first and the last lines of each file ending in *.pdb in the molecules directory
  2. The first and the last line of each file ending in *.pdb in the molecules directory
  3. The first and the last line of each file in the molecules directory
  4. An error because of the quotes around *.pdb

Script reading comprehension

Joel’s data directory contains three files: fructose.dat, glucose.dat, and sucrose.dat. Explain what a script called example.sh would do when run as bash example.sh *.dat if it contained the following lines:

# Script 1
echo *.*
# Script 2
for filename in $1 $2 $3
do
    cat $filename
done
# Script 3
echo $@.dat

Next: Loops