Blogging With R and ggplot2 in Org
Table of Contents
I want to include more graphics in my writing. I’m not very good at it, so I will use this place to practise. One type of graphic – that is very efficient at communicating things – is the data plot.
Plots Matter
I used to read The Economist regularly, and they are very good at plotting data. When I wanted to share information from their articles, it was often sufficient to just share a single key plot. That is how powerful plots are. Edward Tufte also hammers this concept in his fantastic book11 Edward Tufte; The Visual Display About Quantitative Information; Graphics Press; 1986. about presenting data visually. A well designed plot can mean the difference between something important going unnoticed or being the focus of an article.
At some point, I would like to learn more about making statistical calculations with computers. I also want nice plots. It appears the R language fulfills both those criteria, so I’ll start by using it to draw plots.
Setup: Org with R and ggplot2
The setup requires two steps. The first is installing R and ggplot2, which is easy enough in Debian:
$ sudo apt-get install r-base r-cran-ggplot2
If we want to test whether the installation worked22 Do people not do this
anymore? It was surprisingly hard to find how to just start R., we should be
able to start R on the terminal by running the R
command33 Yes, that’s an
upper-case R – a relief for those of us who have an alias r='fc -e -'
to
reapeat the last used command. and then enter a string like "hello"
and
seeing it being echoed back.
We then want Org to be allowed to execute R code. We can specify that by setting
the org-babel-load-languages
variable to include the languages we want
executed.
(org-babel-do-load-languages 'org-babel-load-languages '((emacs-lisp . t) (R . t)))
We also want an Emacs mode for editing R code, since it makes the process of creating graphs much easier. Ess is the Emacs Speaks Statistics collection of extensions that give support for, among other things, R.
(use-package ess :init (require 'ess-site))
And, finally, we may want to set the following variables. They reduce the amount of question-asking Org and ess does when publishing44 This is a requirement for me, because I publish posts in a non-interactive git post-receive hook..
(setq org-confirm-babel-evaluate nil) (setq ess-directory "/tmp") (setq ess-ask-for-ess-directory nil)
Vector Graphics From R to Org
Since baby steps is a good idea in general, that’s what we’ll start with. We want our Org document to include55 Okay, in this guide I’ll have to embed Org code in an Org document – this is going to be tricky. an R source block that can generate the plot we want. We add some special parameters for publishing, which are explained now.
#+HEADER: :file myplot.svg #+HEADER: :R-dev-args bg="transparent" #+BEGIN_SRC R :exports results :session :results graphics # R code to draw plot #+END_SRC
The svg export is the tricky bit, because information is scarce. To get proper
SVG exports from R in Org, we need a #+BEGIN_SRC R
block, of course.66 And
it’s worth mentioning for your debugging ability that this stuff is handled by
the Org Babel extension. The parameters we need are as follows.
Obviously, :file myplot.svg
specifies a filename for the graphics. If it ends
in .svg
, we get svg graphics. We want svg graphics, because as with
anything else on the web these days, they scale to arbitrary pixel
densities77 Hey, this is actually somewhat of a realisation. Vector graphics
are no longer about scaling to arbitrary sizes, they are about scaling to
arbitrary pixel densities. Neat.. This file needs to be exported along with
the html.
We tell Org to replace the source code block with the results of its evaluation
using the :exports results
arguments. In our case, that will be the svg file
embedded. Then :results graphics
makes sure everyone is on the same page with
regards to R being used to generate graphics, not to print text.
For the longest time, I could not get a transparent svg file. For some reason
R just wanted to put a huge <rect>
with a white fill colour at the base of the
image, regardless of what my code said. Eventually, I figured out that I needed
the :R-dev-args bg="transparent"
argument to preserve image transparency.
Multiple online manuals are very assertive about the :session
argument being
required, so I’ve included it. I think it also implies that variables will be
shared across R code blocks in the document, which may or may not be what you
want.
Editing R code embedded in Org
You can, of course, edit the code block straight in Org as any text. However,
there is a better way. By pressing C-c '
with the cursor positioned inside the
source block, Org will open up a new Emacs window with only the R source code in
it. This new window is synchronised with the Org file, meaning that if you edit
and save in the R window, the Org file will also be updated and saved.
If you want to test run the code in the window, press C-c C-b
. To exit the new
window, press C-c '
again.
R and the Grammar of Graphics Paradigm
This is only a whirlwind tour. There is so much more to learn.
The ggplot2 api is based on something called the Grammar of Graphics, which is a standardised way to talk about plots. This grammar (and, consequently, ggplot2), has three key concepts we need to introduce right away.
The first concept is probably the most intuitive. It’s called data, and it is the information we want to plot, plain and simple. Each data point consists of two or more variables88 So, for example, a data point can be (1) the amount of money I have (2) at a particular point in time. Or it can be (1) the amount of money (2) a person in a group has. Or maybe (1) the gas mileage of (2) a car model at (3) a particular point in time for (4) a particular air density. You can cram a lot of variables into a data point and plotting them all sensibly can be a challenge., and some of these variables will be represented in the plot.
How the data appears in the plot is based on a mapping from data variables to either spacial dimensions99 say, money on the Y axis and time on the X axis or some other dimension that is intuitively understood visually1010 like thickness of lines, darkness of shading or colours of a heatmap; these all give a quick sense of how large the value they represent is. The mapping between data variables and plot dimensions is called an aesthetic.
Finally, the physical shape used to represent the data points, so for example lines, points, violins1111 That sounds more funny than it is. There’s actually a thing called a violin plot. and so on, is called a geometry. The same data points can be plotted using many geometries, and all will be rendered on the same plot.
To begin with, we can use this R code, representing two weeks worth of sleep and mood tracking data.
library(ggplot2)
simple_data = data.frame(
sleep = c(8, 7, 5, 6, 7, 8, 9, 7, 2, 8, 5, 5, 5, 8),
happiness = c(4, 2, 2, 1, 5, 4, 3, 2, 1, 5, 4, 4, 4, 5)
)
ggplot(simple_data, aes(x=sleep, y=happiness)) +
geom_point()
The top vector represents how many hours of sleep someone has had the night before. The bottom vector is a happiness rating from one to five. We pack it into a dataframe and then we start plotting.
We configure an aesthetic which maps the amount of sleep to the X axis and the
happiness rating to the Y axis. We don’t specify any colours or other such
dimensions, because our data points consists of only two
variables.1212 Generally, do not plot data using more dimensions than the data
has to begin with. Do not plot two-dimensional data with three dimensions, for
example. The opposite is okay. We can reduce the complexity of data by plotting
only some of its dimensions. After the ggplot(data, aes(x=sleep,
y=happiness))
call, we have data, and we have an aesthetic that maps the data
to plot coordinates. But we still don’t have any geometrical shape to represent
the data entries. If we showed the plot at that point, it would essentially be
an empty grid.
So we add a geometry to our data with the +
operator: we tell ggplot2 that,
“After you have drawn the empty plot with grid lines and stuff, please add a
point geometry for each entry in the data.”
With that, we get
which is probably the most basic plot you’ll create in ggplot2.
Styling the Plot
The previous plot uses the default ggplot2 style, which is decent but I want something slightly more modern and optimised for the web. A good start is to apply the classic theme that ships with ggplot2. This removes a lot of visual noise.
ggplot(simple_data, aes(x=sleep, y=happiness)) + geom_point() + theme_classic()
We may also want to remove the tickmarks on the axes, as well as the solid white background.
ggplot(simple_data, aes(x=sleep, y=happiness)) + geom_point() + theme_classic() + theme(axis.ticks=element_blank(), panel.background=element_blank(), plot.background=element_blank())
Making the Plot Easier to Read
The distinction between this section and the previous one is a bit arbitrary, because good style makes the plot easy to read, and something that makes the plot easy to read is good style.
There are several issues with the X axis at the moment. Just off the top of my
head: the label does not convey a lot of information, the scale is cut off on
the left hand side, and doesn’t extend all the way to the origin. All
configuration of the X scale is done in the example below. The ticks are placed
in the values indicated by breaks
, which is set to the vector
c(0,2,4,…,8,10)
.
While we’re at it, we also configure the Y axis to extend all the way down to zero.
ggplot(simple_data, aes(x=sleep, y=happiness)) + geom_point() + scale_x_continuous("Sleep (hours)", limit=c(0, 10), breaks=seq(0, 10, 2)) + scale_y_continuous("Happiness (subjective rating)", limit=c(0, 5), breaks=0:5) + theme_classic() + theme(axis.ticks=element_blank(), panel.background=element_blank(), plot.background=element_blank())
Actually, the Y axis label is a bit hard to read. We can move that to the title of the plot instead, making it horizontal without robbing too much space.
ggplot(simple_data, aes(x=sleep, y=happiness)) + geom_point() + scale_x_continuous("Sleep (hours)", limit=c(0, 10), breaks=seq(0, 10, 2)) + scale_y_continuous("", limit=c(0, 5), breaks=0:5) + labs(title="Happiness (subjective rating 1–5) as a function of sleep") + theme_classic() + theme(axis.ticks=element_blank(), panel.background=element_blank(), plot.background=element_blank())
The graphic is square at the moment, which is often not ideal. Tufte talks about
having a width about 1.2–1.8 times the height, but I’m going to pick something
slightly more extreme for this example, primarily because my X axis is
covering a larger range than my Y axis. Note that we cannot set the height
inside the R code itself, so this is something we set in the Org
#+HEADER:
. I’m now running with
#+HEADER: :R-dev-args bg="transparent" :width 7 :height 3.5
ggplot(simple_data, aes(x=sleep, y=happiness)) + geom_point() + scale_x_continuous("Sleep (hours)", limit=c(0, 10), breaks=seq(0, 10, 2)) + scale_y_continuous("", limit=c(0, 5), breaks=0:5) + labs(title="Happiness (subjective rating 1–5) as a function of sleep") + theme_classic() + theme(axis.ticks=element_blank(), panel.background=element_blank(), plot.background=element_blank())
Adding Annotations
If we’re courious about the potential relationship between sleep and happiness, it could be interesting to overlay a linear regression.
It’s also useful to be able to annotate the plot – but beware that this is not a place where ggplot2 shines. If you’re doing serious publishing, you probably want to annotate the plot in something that’s better for it, like Inkscape. But for simple notes, go ahead and embed it in the code!
ggplot(simple_data, aes(x=sleep, y=happiness)) + geom_point() + geom_smooth(method="lm", formula=y~x, fullrange=TRUE) + scale_x_continuous("Sleep (hours)", limit=c(0, 10), breaks=seq(0, 10, 2)) + scale_y_continuous("", limit=c(0,5), breaks=0:5) + labs(title="Happiness (subjective rating 1–5) as a function of sleep") + theme_classic() + theme(axis.ticks=element_blank(), panel.background=element_blank(), plot.background=element_blank()) + annotate("segment", x=7, xend=7.5, y=2, yend=1.5, colour="grey50") + annotate("text", label="Good sleep, bad day", x=7.2, y=1.3, hjust=0, colour="grey50")
The End
I’m going to stop here, because I don’t think I have many more useful things to say in what little space remains on the page. I could write a lot more about R, and I could write a lot more about plotting well, but those things are better suited for a different article. I hope this is enough to get you started publishing with R, because it’s a great tool to have in your arsenal!