Overview

The R package ggforestplot allows to plot vertical forest plots, a.k.a. blobbograms, and it’s based on ggplot2.

In this tutorial we will go through its basic functionality, as well as how one can produce grouped plots, using demo data from Nightingale’s NMR platform.

Installation

You can install ggforestplot from github as shown below (unless already installed, you need to install devtools first):

If you want display package vignettes with utils::vignette(), install ggforestplot with devtools::install_github("NightingaleHealth/ggforestplot", build_vignettes = TRUE). However, installing with building the vignettes takes little bit longer. (Note: If dependencies are not installed automatically, try updating devtools.)

Usage

The main plotting function is ggforestplot::forestplot(). It’s main input is a data frame that contains the values and corresponding standard errors to be plotted in a forestplot layout.

Let’s get right to it and plot an example before with delve into the details of the input parameters.

These are the linear associations of BMI to 30 blood biomarkers - selected somewhat randomly - along with their 95% confidence intervals.

Now let’s take a closer look at the example dataset used above in order to understand what is the input that ggforestplot::forestplot() expects.

The dataset ggforestplot::df_linear_associations is a tibble() and contains linear associations of blood biomarkers to Body Mass Index (BMI), insulin resistance (log(HOMA-IR)) and fasting glucose as found in A. V. Ahola-Olli et al. (2019). Below are the first 10 rows.

The variables, as also stated in the documentation, are:

  • name: the names of the Nightingale serum biomarkers (Note: glucose is missing as the results are adjusted for this biomarker).
  • trait: the response variable of the regression model, BMI, log(HOMA-IR) or fasting glucose.
  • beta: regression coefficient \(\beta\).
  • se: standard deviation
  • pvalue: p-value

So the first input parameter for forestplot() must be a data frame with at least the variables name (character), estimate (numeric) and se (numeric), which of course may be named differently.

Adding information on statistical significance

Let’s now start improving the plot above.

First, we need a more descriptive x axis label and we may also add a title. We’d also like to visualize a Bonferroni correction to account for multiple testing. Here we have a comparison of 30 biomarkers. If we suppose the common significance threshold \(\alpha = 0.05\), the Bonferroni correction for each individual hypothesis, assuming the 30 tests are independent, is \(\alpha = 0.05 / 30 \approx 0.002\) (Note: in fact for biologically correlated measures this correction is too strict; here we could, for example, perform a principal component analysis and instead of the number of biomarkers we’d choose the number of principal components that explain a fair amount of variance in the metabolic data, e.g. 99%; see also vignette("nmr-data-analysis-tutorial").

In the forestplot function, we will input the significance threshold as the parameter psignif and we will define explicitly the variable name in df that contains the p-values of the linear regression, in this case the column pvalue.

Notice how the non-significant results are now displayed as hollow points.

Grouping the biomarkers

Finally, we would like to group the blood biomarkers by category to improve the readability of the plot.

The package includes also a data frame, ggforestplot::df_NG_biomarker_metadata, with metadata on the Nightingale blood biomarkers, such as different naming options, descriptions, and group/sugroup information. We will use the grouping information in this data frame to plot the 30 biomarkers above in a grouped layout.

We can achieve this by joining the biomarker metadata with the data frame df_compare_traits above, that contains the associations. We will then use ggforce::facet_col to facet by group.

Notice above, how the order of the biomarkers is now different than in the previous plots. This is the order of groups and biomarkers in df_grouping.

Another note is that we had to define manually a common axis for all the plots. For that we added the layer ggplot2::coord_cartesian(xlim = c(-0.3, 0.4)) in each forestplot() call. In order to know, a priori, what is the correct common x limits you are looking for, you need to estimate the confidence intervals from the standard errors. In practice however, you may simply run the above piece of code twice, once without setting x limits and, after visual inspection, rerun adding ggplot2::coord_cartesian(xlim = c(xmin, xmax)).

Finally, the implementation above removes x-axis texts and labels for each group. You may want to comment that out depending on how long is your column of forestplots but, at least, keep axis.text.x for only the last plot.

Plotting odds / hazard ratios

The package includes also a second demo dataset from the same paper, ggforestplot::df_logodds_associations, with log odds ratios of blood biomarkers with incident type 2 diabetes. The beta, se and pvalue variables in this set are the result of logistic regression and the additonal n variable reports the cohorts sample size.

We will use this dataset to demonstrate how to plot odds ratios (the same logic applies for hazard ratios).

The dataset includes log odds ratios with incident type 2 diabetes for a total of 4 cohorts plus the meta-analysis we saw above. Let’s plot the odds ratios for amino acids.

Notice that the additional parameter set here is logodds = TRUE. What happens, in this case, is that the beta values are exponentiated and plotted in a logarithmic scale. The confidence intervals for the odds ratios are estimated by default, using the standard errors of the log odds ratios, as follows \(\text{CI}_\text{low} = \exp(\beta - 1.96 * \text{SE})\) and \(\text{CI}_\text{high} = \exp(\beta + 1.96 * \text{SE})\). If you wish to use some confidence interval other than the 95%, you may specify this in the parameter ci of the function (see below).

Additionally, you may set a scale_shape_manual(). For example, a common convention is to plot meta-analysis results with a diamond shape.

Finally, the ggforestplot package provides a custom function that plots and prints all Nightingale blood biomarkers in a single pdf file, called plot_all_NG_biomarkers(). There are two layouts available for 2016 and 2020 versions of the biomarker platform. Alternatively, one can also input a custom layout. Its usage is straightforward.

You may view the output of this function here.