Why use KnitR for scientific publishing?

KnitR Logo

All the figures and text for this post were created in a KnitR document, as an example of how easy it is to use and interpret. The code is available via a link at the end. Also, check out KnitR’s homepage

In a paper, it is common for someone to show a bar chart indicating a difference between two means. A common problem is that the error bars are not fully explained (STD, SEM?) and the data distribution is not examined.

An example could be:

Barchart showing two variables where error bars indicate they may have different means
Figure 1 shows that there is an increase in biomarker activity when chemical Y is combined with chemical X (blue) as opposed to just chemical X (red)

 

They may then draw conclusions from there. As a reader, it would be nice to be able to investigate a) what the error bars actually mean (if only all papers went to a stats reviewer) and b) what the data really looked like, beyond these means and errors.

KnitR helps investigate what has been shown

First off, by having access to the code that produced the figures, even a non-programmer ought to be able to have a guess at simple things like whether the error bars are standard deviations or standard errors of the means. Let’s look at the standard R-code that is easily found in the KnitR ‘R Markdown’ file that was used to include the figure above:

means <- c(mean(a),mean(b))
mp <- barplot(means, axes=FALSE, axisnames=FALSE, ylim=c(0, 2),col=c('red', 'blue'), main="Biomarker activity", xlab="Treatment", ylab="Relative serum concentrations")
axis(1, labels=c("ChemX", "ChemX&ChemY"), at = mp)
axis(2, at=seq(0 , 2, by=0.1))
box()
# error bars
sems <- matrix(c(sqrt(var(a)/length(a)), sqrt(var(b)/length(b))), 2)
segments(mp, means - sems, mp, means + sems, lwd=2)
# 1. The lower bar
segments(mp - 0.1, means - sems, mp + 0.1, means - sems, lwd=2)
# 2. The upper bar
segments(mp - 0.1, means + sems, mp + 0.1, means + sems, lwd=2)

Straight away we can see that even though they’ve unhelpfully (it was me, I admit it) named their main data vectors ‘a’ and ‘b’, they have commented the section where they calculate the error bars and we can see that they are calculating a variable called ‘sems’ using an equation that involves ‘sqrt’ and ‘var’ rather than ‘sd’ so it’s probably fair to say they have used standard error of the mean for their error bars. Having this knowledge alone would revolutionise science publishing – until peer review improves.

OK, you’re not here for an R lesson (I hope not) but I’m hoping to show that the addition of a KnitR file would allow relatively easy investigation of undocumented methods by even non-programmers. Sure, not all code is commented, but the chunk of code can be found by finding it in relation to the publication’s text (see how this works in the code for this page, link at bottom of page) and at least the programming language’s functions should have useful names even if the author’s variable names don’t.

KnitR allows easy revisualisation

Having had a quick look at the code, we can see that the bar plot is made from vectors ‘a’ and ‘b’. This is enough for us to quickly revisualise the data using a plot that describes the distribution of the data more accurately. For example, we could use a boxplot to show the quartiles, range, outliers etc.

KnitR remembers variables in any chunk of code after they’ve initially been declared, so we could put the boxplot() command anywhere after ‘a’ and ‘b’ have been created – running KnitR on the document will then reproduce it with your new visualisation.

In the example below, I place the boxplot() command just before the original barplot() which I then comment out. Again, this is not an R tutorial, it’s just showing how few changes need to be made to re-run the creation of the original paper, but with better plots.

means <- c(mean(a),mean(b))
boxplot(data.frame(a,b))
#mp <- barplot(means, axes=FALSE, axisnames=FALSE, ylim=c(0, 2),col=c('red', 'blue'), main="Biomarker activity", xlab="Treatment", ylab="Relative serum concentrations")
#axis(1, labels=c("ChemX", "ChemX&ChemY"), at = mp)
#axis(2, at=seq(0 , 2, by=0.1))
#box()
# error bars
#sems <- matrix(c(sqrt(var(a)/length(a)), sqrt(var(b)/length(b))), 2)
#segments(mp, means - sems, mp, means + sems, lwd=2)
# 1. The lower bar
#segments(mp - 0.1, means - sems, mp + 0.1, means - sems, lwd=2)
# 2. The upper bar
#segments(mp - 0.1, means + sems, mp + 0.1, means + sems, lwd=2)
Box plot indicating the data is not normal or well described by the means
The revisualised data shows that the two groups are greatly overlapped and share similar means. It indicates a non-normal distribution too.

 

From revisualising the data we can see that the story is not so clear cut as the bar chart led us to believe. The second group (b) does have a higher upper range but the median is roughly the same as in the other group (a). Both the distributions don’t look normal at all, heavily skewed to the positive and possibly curtailed at 0, so any tests based on normality assumptions would have to be re-assessed. (In fact, this data was produced using two different Weibull distributions, a more realistic case than the normal distribution for individual metabolite concentrations or gene expressions)

A need to improve the literature

Sadly, this type of over-simplification of descriptive statistics probably goes on all the time in modern science literature. Certainly the number of cases where error bars are not accurately described is astounding, and it is quite rare to see the distribution of the data investigated prior to ‘standard’ tests that make assumptions about the distribution.

These could be simple over-sights, or the honest mistake of a junior researcher that isn’t so comfortable with statistics (we’ve all been there!). The important thing is to allow easy investigation of these issues by reviewers and by other researchers later – wrapping your publication in KnitR does just that.

KnitR is flexible, allowing almost any command line code to be incorporated into a document and a great variety of output formats – here I’ve used R Markdown to output to HTML (looking a lot like an IPython notebook, and KnitR can handle Python as input) but KnitR’s LaTeX options provide full scientific publication standard outputs.

Even if LaTeX composition of a full publication seems onerous, it would be highly beneficial to provide a simple R Markdown version, like this page, as a supplemental file to accompany the PDF paper.

Without the ability for easy scrutiny like this, we end up with a situation where an estimated 85% of research resources are being wasted!

Extra info:

The data distributions for this quick example were created using:

set.seed(9)
b <- rweibull(100,1,1.5)
a <- rweibull(100,1.5,1.5)

The full code for this page can be found here: markdown code for KnitR science publishing example

To recreate this page, you need to install the KnitR package and then call the library:

install.packages('knitr')
library('knitr')

then use one of the KnitR conversion tools to turn the R Markdown (with R code) into a document. This page used Knit2HTML:

knit2html('path/to/knitr_scipub_example.md')

One thought on “Why use KnitR for scientific publishing?

Leave a Reply

Your email address will not be published. Required fields are marked *