Chapter 4 Our World in Data (parsing)

4.1 Practice parsing

Let’s practice with a small file. Make sure you are in a screen.

  1. Make a directory under your home directory called “parse”.

  2. Go into that directory.

  3. Copy the file “pandemics.csv” from this directory: “/home/data/nise” to the directory you just made.

  4. Take a look at the file.

  5. Sort by year an put it into a new file called “sort.pandemic.csv” Note that you will need to change the delimiter to a comma (-t), tell it to sort numerically (-n), and tell it which column to sort (-k) sort -t, -n -k3 pandemics.csv > sort.pandemics.csv

  6. Pull out all the instances that mention plague Note that grep is case sensitive by default so you might want to use the -i flag to make it case insensitive in case the is inconsistent

  7. Pull out all the instance of the flu

  8. Let’s count the number of each type of pathogen (column 2). You could do this with a series of grep commands but with a big file you might not know all the possible pathogens and it would get tedious. So let’s do this in a step-wise fashion and make sure each piece is as expected, then pipe it into the next step.
    a. First pull out column 2 using awk or cut
    b. Get the unique values (make sure you “sort” before you “uniq” or you will only deredundify adjacent identical values).
    c. Get the count (hint: add -c to the uniq step)

  9. Grab and count the organisms for the instances that mention plague

Hint: It is the same as #8 except you need to grab the plague lines first. When you put a command after the pipe it will read in the output of the previous command so only use the file name on the first command.

Click for All Answers

Note that there are often several ways to do things in linux and not all methods are shown.

1. mkdir ~/parse

2. cd ~/parse

3. cp /home/data/nise/pandemics.csv . [Don't forget the space and period at the end; the period says to name it the same thing.]

4. cat pandemics.csv [you could also use more, less, head, tail, or other commands]

5. sort -t, -nk 3 pandemics.csv > sort.pandemics.csv

6. grep -i plague sort.pandemics.csv

7. grep -i flu sort.pandemics.csv

8. awk -F, '{print $2}' sort.pandemics.csv | sort | uniq -c
     [Note that you can also use: "cut -d, -f2 pandemics.csv" for the first step]

9. grep -i plague sort.pandemics.csv | awk -F, '{print $2}' | sort | uniq -c

4.2 Our World in Data

Now let’s use the data from “Our World In Data”. For things that are new, we have added some in-line answers but before you reveal the answer, try it yourself first using the hints provided.

  1. Make sure you are still in the parse directory.

  2. Link to the file “owid-covid-data.csv”. It is in this directory: “/home/data/nise/”.

Click for Answer
        ln -s /home/data/nise/owid-covid-data.csv .


  1. How many rows are in the file? Use wc -l (wc = word count, -l = lines)
Click for Answer
wc -l owid-covid-data.csv


  1. How many columns? There is a special variable, NF, in awk, which prints the number of fields

Note: this will print the number of fields in each row. You can hit ctrl-c if you don’t want to go through the whole file

Click for Answer
awk -F, '{print NF}' owid-covid-data.csv


  1. Let’s look at hospital_beds_per_thousand (column 60). We also need to get the location (country) from column 3 and the date from column 4. We’ll pipe it into less so that we can scroll through it (to get out of less, hit “q”).
Click for Answer
cut -d, -f 3,4,60 owid-covid-data.csv | less


  1. There are lots of dates for each country. Let’s do the same search but limit it to early in the pandemic (2020-03-11), which is the day that the World Health Organization declared COVID-19 to be pandemic. Put it into a file called “beds.2020-03-11.csv” so we don’t have to keep generating it.

  2. What country had the least beds per thousand on 2020-01-03 and which has the most? Hint: Sort by the number of beds and pipe it into less.

Note: Some lines have missing data and will sort at the top of the file so you will have to scroll down.

  1. How many beds were there in the United States on 2020-01-03?

Hint: There is a “United States” and a “United States Virgin Islands”. To avoid getting the latter, have the match end with a comma. Also, require grep to start the match at the beginning of the line (^). It is good practice to be specific when using grep.

Click for Answer
grep "^United States," beds.2020-03-11.csv


  1. How many beds were there in your three countries on 2020-01-03?

If your countries don’t have data on that date, go back to the original file and see if you can find data for any date.

Hint: we’ll use the extended version of grep which will allow us to search all 3 at once. Use the -E flag and put the things you are searching for in double quotes, seperated by pipes.

Click for Answer
grep -E "^Sweden,|^Norway,|^Denmark," beds.2020-03-11.csv


  1. Create a file that has only the United States data for beds with the same three columns we have been using. Call it usbeds.csv. We’ll use this file for the questions below.

  2. Get the number of beds in the United States for each day in 2020. Note: Since these numbers don’t change, this is kind of silly but it will give you the skills to grab other variables across a date range.

  3. Get the number of beds in the United States for the first COVID wave (March through September 2020).

Hint: Use sed to remove the dashes then awk with $2>=20200301&&$2<=20200930.

Click for Answer
sed 's/-//g' usbeds.csv | awk '$2>=20200301&&$2<=20200930{print}'


  1. BONUS: Get the average number of beds in the United States for the first COVID wave (March through September 2020). This is a tough one. See if you can understand the code.
Click for Answer
sed 's/-//g' usbeds.csv |awk -F, '$2>=20200301&&$2<=20200930{SUM+=$3;CNT+=1}END{print SUM/CNT}'


Click for All Answers
11. pwd [If you aren't in that directory you can: "cd ~/parse"]

10. ln -s /home/data/nise/owid-covid-data.csv . 

12. wc -l owid-covid-data.csv

13. awk -F, '{print NF}' owid-covid-data.csv

14. cut -d, -f 3,4,60 owid-covid-data.csv | less

15. cut -d, -f 3,4,60 owid-covid-data.csv | grep 2020-03-11 > beds.2020-03-11.csv

16. sort -t, -n -k3 beds.2020-03-11.csv | less

    Least: Mali,2020-03-11,0.1
    
    Most: Monaco,2020-03-11,13.8

17. grep "^United States," beds.2020-03-11.csv

  United States,2020-03-11,2.77

18. grep -E "^Sweden,|^Norway,|^Denmark," beds.2020-03-11.csv

  Denmark,2020-03-11,2.5
  Norway,2020-03-11,3.6
  Sweden,2020-03-11,2.22

19. cut -d, -f 3,4,60 owid-covid-data.csv | grep "^United States," > usbeds.csv

20. grep 2020 usbeds.csv

21. sed 's/-//g' usbeds.csv | sed 's/-//g' |awk -F, '$2>=20200301&&$2<=20200930{print}'

22. sed 's/-//g' usbeds.csv | sed 's/-//g' |awk -F, '$2>=20200301&&$2<=20200930{SUM+=$3;CNT+=1}END{print SUM/CNT}'

4.3 Practice with your 3 countries

Now practice on your own. Choose another variable and see if there are any interesting differences between your 3 countries. Explore at least 3 other variables.

Hint: You can look at the variables by heading the first line of the file.

head -1 owid-covid-data.csv

Hint: If you prefer to have them each on their own line you can use sed to replace the commas with hard returns. Make sure to use the global “g” at the end so that every comma is replaced and not just the first one on each line.

head -1 owid-covid-data.csv | sed 's/,/\n/g'

Hint: awk will also print out the row number for you (which corresponds to the column number for that variable in the original file). The NR variable is a special awk variable that will print the row.

head -1 owid-covid-data.csv | sed 's/,/\n/g' |awk '{print NR "\t" $1}'

4.4 Country files

Make one file each for Denmark, Norway, and Sweden that we’ll use in future chapters. We’ll get the following columns: location (column 3) date (column 4) total_cases (column 5) total_deaths (column 8) total_cases_per_million (column 11) total_deaths_per_million (column 14) icu_patients (column 18) icu_patients_per_million (column 19) people_fully_vaccinated (column 37) people_fully_vaccinated_per_hundred (column 43)

An example is below.

grep "Norway" /home/data/nise/owid-covid-data.csv | cut -d, -f 3,4,5,8,11,14,18,19,37,43 > Norwaydata.csv

Make a similar file for each of your three countries.