Building programs with Python
Processing data files
Learning Objectives
- Write a script to open a data file and print out its contents.
- Perform some operations on strings to extract desired data from it.
- Understand the basics of how Python handles objects.
- Understand good practices of how and when to write a code comment.
So far we’ve seen how to use and manipulate variables, and how to use loops in a script to process strings. But let’s take a look at a more interesting use case - performing some temperature conversions on our CSV data file.
We’ll start out by looking at how to read the data file and print its contents in a script, and then modify our script to perform some conversions and output that. Along the way, we’ll see how we can make our code more understandable to others (as well as ourselves, when we might come back to it at a later date).
Printing out the contents of a data file
We first need to be able to read in our data from the sc_climate_data_10.csv
file, and using a loop, print out each line. Let’s write another script called climate_analysis.py
, and enter the following (see climate_analysis-1.py
):
climate_data = open('../data/sc_climate_data_10.csv', 'r')
for line in climate_data:
print(line)
Using open
, we first specify the file we wish to open, and then include how we want to use that file. If we wanted to open a file to write to, we would use ‘w’, but in this case, we specify r
for reading.
In general, we know that a loop will iterate over a collection and set a loop variable to be each item in that collection. When Python deals with files, it does something quite helpful in a loop. By specifying climate_data
as our collection, it reads in a single line at a time from our data file, assigning it to our line
loop control variable.
We can run our code with:
$ python climate_analysis.py
And we get the following output:
# POINT_X,POINT_Y,Min_temp_Jul_F,Max_temp_jul_F,Rainfall_jul_inch
461196.8188,1198890.052,47.77,58.53,0.76
436196.8188,1191890.052,47.93,58.60,0.83
445196.8188,1168890.052,47.93,58.30,0.74
450196.8188,1144890.052,48.97,56.91,0.66
329196.8188,1034890.052,49.26,59.86,0.78
359196.8188,1017890.052,49.39,58.95,0.70
338196.8188,1011890.052,49.28,58.73,0.74
321196.8188,981890.0521,48.20,61.41,0.72
296196.8188,974890.0521,48.07,61.27,0.78
299196.8188,972890.0521,48.07,61.41,0.78
Hmmm… but that’s not really perfect, since it’s also printing out additional newlines which exist at the end of each line in our data file. We can remove them by stripping them out, using rstrip
, a function that works on strings. We can use it like:
print(line.rstrip())
So what’s happening here?
So, let’s try that out (see climate_analysis-2.py
):
climate_data = open('../data/sc_climate_data_10.csv', 'r')
for line in climate_data:
print(line.rstrip())
And now we get:
# POINT_X,POINT_Y,Min_temp_Jul_F,Max_temp_jul_F,Rainfall_jul_inch
461196.8188,1198890.052,47.77,58.53,0.76
436196.8188,1191890.052,47.93,58.60,0.83
445196.8188,1168890.052,47.93,58.30,0.74
450196.8188,1144890.052,48.97,56.91,0.66
329196.8188,1034890.052,49.26,59.86,0.78
359196.8188,1017890.052,49.39,58.95,0.70
338196.8188,1011890.052,49.28,58.73,0.74
321196.8188,981890.0521,48.20,61.41,0.72
296196.8188,974890.0521,48.07,61.27,0.78
299196.8188,972890.0521,48.07,61.41,0.78
Much better!
Selecting and printing out only part of the data
But we’re not being very discriminating with our data, we’re just blindly printing out everything. Let’s assume we need to process the individual column that represents the maximum temperature for July, the 4th one, how do we extract it from the line of data?
As luck (or more likely, good design) would have it, there’s a handy string method called split()
which can separate all the columns into a list.
We’ve seen how we can trim trailing newlines from strings with rstrip()
acting on a string object. Well, we use split()
in exactly the same way:
data = line.split(',')
Although in this case, we’re capturing the returned list from split()
into a variable called data
. We can access elements in that list as before.
So, let’s change our code accordingly (see climate_analysis-3.py
):
climate_data = open('../data/sc_climate_data_10.csv', 'r')
for line in climate_data:
data = line.split(',')
# print 4th column (max temperature)
print('Max temperature', data[3])
Now, it’s important to remember that the column we want, the maximum temperature, is the 4th column. But in Python list indexes start at 0, so in fact we need to obtain the value from data[3]
and not data[4]
. So, we have made a note to that effect in a comment.
And we get:
Max temperature Max_temp_jul_F
Max temperature 58.53
Max temperature 58.60
Max temperature 58.30
Max temperature 56.91
Max temperature 59.86
Max temperature 58.95
Max temperature 58.73
Max temperature 61.41
Max temperature 61.27
Max temperature 61.41
This perhaps isn’t what we want - the column header is also part of the output!