Chapter 6 Our World in Data (plots)
Now let’s use the data from “Our World in Data” that you created yesterday. We’ll start with the Norway file.
Move into the parse directory.
Click for Answer
cd ~/parse
Open up R.
Click for Answer
R
Load ggplot2.
library ggplot2
Read in “Norwaydata.csv”, which you made in the last chapter. It has no header and it is comma separated.
Click for Answer
nor=read.table("Norwaydata.csv", header=FALSE, sep=",")
Add column names: Location, Date, Total_Cases, Total_Deaths, Total_Cases_Per_Million, Total_Deaths_Per_Million, ICU_Patients, ICU_Patients_Per_Million, Fully_Vaccinated, Fully_Vaccinated_Per_Hundred
Click for Answer
colnames(nor) = c("Location", "Date", "Total_Cases", "Total_Deaths",
"Total_Cases_Per_Million", "Total_Deaths_Per_Million",
"ICU_Patients", "ICU_Patients_Per_Million",
"Fully_Vaccinated", "Fully_Vaccinated_Per_Hundred")
Fix the date field
Click for Answer
nor$Date=as.Date(nor$Date, format="%Y-%m-%d")
6.1 Bar chart
Plot a bar chart of Date x Total_Cases. The “geom_col” layer will make a bar chart.
We’ll use the width parameter to make the width 100% (no space between bars).
png("datexcases.png")
ggplot(nor, aes(x=Date, y=Total_Cases)) + geom_col(width=1)
dev.off()
Let’s add a title.
png("datexcases_title.png")
ggplot(nor, aes(x=Date, y=Total_Cases)) +
geom_col(width=1) +
ggtitle("Norway COVID-19 Cases")
dev.off()
6.2 Line chart
Now let’s make a line chart of total deaths.
png("datexdeaths.png")
ggplot(nor, aes(x=Date, y=Total_Deaths)) +
geom_line() +
ggtitle("Norway COVID-19 Deaths")
dev.off()
Let’s put them both on the same chart. We’ll make the bar chart (cases) gray and the line (deaths) blue so it shows up better.
png("datexcases_deaths.png")
ggplot(nor, aes(x=Date, y=Total_Cases)) +
geom_col(width=1,color="darkgray") +
geom_line(aes(y=Total_Deaths),color="blue") +
scale_y_continuous(
# Features of the first axis
name = "Total Cases",
# Add a second axis and specify its features
sec.axis = sec_axis(trans=~., name="Total Deaths")
)
dev.off()
The number of deaths is really low compared to the number of cases (thankfully!) so the blue line is right at the bottom. Let’s adjust the axis (multiply by 100).
We’ll also give it a white background by changing to the black and white theme, color the right axis label blue, and add a title.
png("datexcases_deaths_adjust.png")
ggplot(nor, aes(x=Date, y=Total_Cases)) +
geom_col(width=1,color="darkgray") +
geom_line(aes(y=Total_Deaths*100),color="blue") +
scale_y_continuous(
# Features of the first axis
name = "Total Cases",
# Add a second axis and specify its features
sec.axis = sec_axis(trans=~./100, name="Total Deaths")
) +
theme_bw() +
ggtitle("Norway COVID-19 Cases and Deaths") +
theme(axis.title.y.right = element_text(color = "blue"))
dev.off()
Try some on your own using some of the other variables, scales, and/or plot types.
Note: Some of the variables in OWID are not reported every day for all countries. If something is only reported once a week, for example, and you are trying to create a geom_line, you will see nothing. This is because it will only connect adjacent datapoints and since there is a row for every day, the line will only connect adjacent days. Since there isn’t any data from adjacent days, it doesn’t plot anything. If you have that issue, here is an example of how to get around that. Before plotting the fully vaccinated column we remove rows that are NA in that column (but we keep all the rows for Total_Cases.)
6.3 Plot all 3 countries + transparency
ggplot2 makes it easy to get all 3 countries on a single plot. It is also easy to make transparent plots so you can stack them together (though doing layers in ggplot2 is a similar concept), combine them with other objects or allow some of the background to show through.
Here is an example of plotting the date (x axis) vs COVID-19 cases per million (y axis). Read in the data for Norway, Denmark, and Sweden. I have combined Norwaydata.csv, Denmarkdata.csv, and Swedendata.csv in the file: /home/data/nise/nds.csv
nds = read.table("/home/data/nise/nds.csv", header=FALSE, sep=",")
Give it some headers.
colnames(nds) = c("Country", "Date", "Total_Cases", "Total_Deaths",
"Cases_Per_Million", "Deaths_Per_Million",
"ICU_Patients", "ICU_Patients_Per_Million",
"Fully_Vaccinated", "Fully_Vaccinated_Per_Hundred")
Fix the date field
nds$Date=as.Date(nor$Date, format="%Y-%m-%d")
The x axis dates overwrite each other so first let’s get a data frame that has all but every 100th date blanked out. We’ll use this to replace the x axis labels (+ scale_x_discrete(labels = xlabels).
xlabels <- sort(unique(nds$Date))
for (x in 2:100) {
xlabels[seq(x, length(xlabels), 100)] <- ""
}
We’ll add some layers that make the plot transparent. We put the plot into a variable and then print the variable through ggsave so we can tell it to keep the background transparent while exporting. We’ll use a PDF to keep the image sharp.
We tell it to group and color the data by Country (group=Country, col=Country, fill=Country). We are also switching from geom_bar to geom_area which makes it easier to make the 3 countries each semi-transparent (using alpha) so we can plot them on top of each other (position=“identity”). We will also turn the x labels 90 degrees. We’ll also add several lines to make the background and other parts of the plot transparent.
p=ggplot(nds, aes(x=Date, y=Cases_Per_Million, group=Country, col=Country, fill=Country)) +
geom_area(aes(y=Cases_Per_Million),alpha=0.2,position="identity")+
ggtitle("Norway, Denmark, Sweden: COVID-19 Cases Per Million") +
theme_bw()+
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
panel.background = element_rect(fill='transparent'), #transparent panel bg
plot.background = element_rect(fill='transparent', color=NA), #transparent plot bg
panel.grid.major = element_blank(), #remove major gridlines
panel.grid.minor = element_blank(), #remove minor gridlines
legend.background = element_rect(fill='transparent'), #transparent legend bg
legend.box.background = element_rect(fill='transparent') #transparent legend panel
) + scale_x_discrete(labels = xlabels)
ggsave('cases_perM.png', p, bg='transparent')
We’ll do the same thing for Deaths_Per_Million except that we will plot lines this time. We’ll use the same xlabels data frame that we made for the previous plot.
p=ggplot(nds, aes(x=Date, y=Deaths_Per_Million, group=Country, col=Country, fill=Country)) +
geom_line(size=2)+
ggtitle("Norway, Denmark, Sweden: Deaths Per Million") +
theme_bw()+
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
panel.background = element_rect(fill='transparent'), #transparent panel bg
plot.background = element_rect(fill='transparent', color=NA), #transparent plot bg
panel.grid.major = element_blank(), #remove major gridlines
panel.grid.minor = element_blank(), #remove minor gridlines
legend.background = element_rect(fill='transparent'), #transparent legend bg
legend.box.background = element_rect(fill='transparent') #transparent legend panel
) + scale_x_discrete(labels = xlabels) +
scale_y_continuous(
name="",
sec.axis = sec_axis(trans=~./1, name="Deaths_Per_Million")
)
ggsave('deaths_perM.png', p, bg='transparent')
And, finally, let’s combine them. It will make the plot a little busy so think about it it really is useful to have everything on the same plot.
p=ggplot(nds, aes(x=Date, y=Cases_Deaths_Per_Million, group=Country, col=Country, fill=Country)) +
geom_area(aes(y=Cases_Per_Million),alpha=0.4,color=NA,position="identity", show.legend = FALSE)+
geom_line(data=uscb,aes(y=Deaths_Per_Million*100,color=Country),size=2) +
scale_y_continuous(
name = "Cases Per Million (Shading)",
sec.axis = sec_axis(trans=~./100, name="Deaths Per Million (Lines)")
) +
ggtitle("Norway, Denmark, Sweden: COVID-19 Cases and Deaths Per Million") +
theme_bw()+
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
panel.background = element_rect(fill='transparent'), #transparent panel bg
plot.background = element_rect(fill='transparent', color=NA), #transparent plot bg
panel.grid.major = element_blank(), #remove major gridlines
panel.grid.minor = element_blank(), #remove minor gridlines
legend.background = element_rect(fill='transparent'), #transparent legend bg
legend.box.background = element_rect(fill='transparent') #transparent legend panel
) + scale_x_discrete(labels = xlabels)
ggsave('cases_deaths_perM.png', p, bg='transparent', width=12, height=10, units="in", dpi=600)
BONUS: Interactive plots
Turning ggplot2 plots into interactive html plots is straightforward with the plotly library. Let’s do the last one we did.
You should still be in R.
Make sure the following libraries are loaded.
library(ggplot2)
library(plotly)
library(htmlwidgets)
Put the plot into a variable. We’ll call it myplot.
myplot = ggplot(nor, aes(x=Date, y=Total_Cases)) +
geom_col(width=1,color="darkgray") +
geom_line(aes(y=Total_Deaths*100),color="blue") +
scale_y_continuous(
# Features of the first axis
name = "Total Cases",
# Add a second axis and specify its features
sec.axis = sec_axis(transform=~./100, name="Total Deaths")
) +
theme_bw() +
ggtitle("Norway COVID-19 Cases and Deaths") +
theme(axis.title.y.right = element_text(color = "blue"))
Make an interactive version of myplot.
myplotint = ggplotly(myplot)
Save it as an HTML file with an accompanying library folder (“lib”)
saveWidget(myplotint, "datexcases_deaths_adjust.html", selfcontained = F, libdir = "lib")
Use scp to copy both the html file and the library file to your computer (use -r for recursive to get the lib folder plus everything in it). The lib folder has to be in the same folder as the html file.
Open up the html file.