Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. The original algorithm is detailed in Subramanian, Tamayo, et al. with Java implementations available through the Broad Institute.

The liger package provides a lightweight R implementation of this enrichment test on a list of values. Given a list of values, such as p-values or log-fold changes derived from differential expression analysis or other analyses comparing biological states, this package enables you to test a priori defined set of genes for enrichment to enable interpretability of highly significant or high fold-change genes.

Examples

Consider an example, simulated dataset.

library(liger)
# load gene set
data("org.Hs.GO2Symbol.list")  
# get universe
universe <- unique(unlist(org.Hs.GO2Symbol.list))
# get a gene set
gs <- org.Hs.GO2Symbol.list[[1]]
# fake dummy example where everything in gene set is perfectly enriched
vals <- rnorm(length(universe), 0, 10)
names(vals) <- universe
vals[gs] <- rnorm(length(gs), 100, 10)

head(vals)  # look at vals
##      AKT3   C10orf2      DNA2      LIG3     MEF2A     MGME1 
##  92.04368  91.18714 102.78193  81.08097  94.94418  87.07168

Here, vals can be seen as representing a list of log-fold changes derived from differential expression analysis on samples in two biological states. We want to interpret the set of differentially expressed genes with high positive fold changes using gene set enrichment analysis.

Testing individual gene sets

To test for enrichment of a particular gene set:

names(org.Hs.GO2Symbol.list)[[1]]
## [1] "GO:0000002"
gs  # look at gs
##  [1] "AKT3"     "C10orf2"  "DNA2"     "LIG3"     "MEF2A"    "MGME1"   
##  [7] "MPV17"    "OPA1"     "PID1"     "PRIMPOL"  "SLC25A33" "SLC25A36"
## [13] "SLC25A4"  "STOML2"   "TYMP"
gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=500)

plot of chunk unnamed-chunk-2

## [1] 0.002

In this simulation, we created vals such that gs was obviously enriched. And indeed, we see that this gene set exhibits significant enrichment.

Now to test for enrichment of another gene set:

gs.new <- org.Hs.GO2Symbol.list[[2]]
names(org.Hs.GO2Symbol.list)[[2]]
## [1] "GO:0000003"
head(gs.new)  # look at gs.new
## [1] "ACE"    "ACR"    "ADAM2"  "ADAM20" "ADAM21" "ADAM28"
gsea(values=vals, geneset=gs.new, mc.cores=1, n.rand=500)

plot of chunk unnamed-chunk-3

## [1] 0.604

In this simulation, we created vals such that gs.new was obviously not enriched. And indeed, we see that this gene set does not exhibit significant enrichment.

If we simulate a more ambiguous case:

# add some noise
vals[sample(1:length(universe), 1000)] <-  rnorm(1000, 100, 10)
# test previously perfectly enriched gene set again
gs <- org.Hs.GO2Symbol.list[[1]]
gsea(values=vals, geneset=gs, mc.cores=1, n.rand=500)

plot of chunk unnamed-chunk-4

## [1] 0.044

The enrichment plots and p-values are affected as expected.

Testing multiple gene sets

We can also test a number of gene sets:

bulk.gsea(values=vals, set.list=org.Hs.GO2Symbol.list[1:5], mc.cores=1, n.rand=500)
##                  p.val     q.val     sscore      edge
## GO:0000002 0.001996008 0.0000000  1.9279433  81.08097
## GO:0000003 0.189620758 0.2626667 -0.8531194 105.20476
## GO:0000012 0.017964072 0.0260000 -1.3714705  12.45331
## GO:0000014 0.013972056 0.0260000 -1.3772795  10.36180

To save on computation time, we can also iterative assess significance:

iterative.bulk.gsea(values=vals, set.list=org.Hs.GO2Symbol.list[1:5], mc.cores=1, n.rand=500)
## initial: [5e+02 - 3] done
##                  p.val       q.val     sscore      edge
## GO:0000002 0.001996008 0.007984032  1.9279433  81.08097
## GO:0000003 0.189620758 0.189620758 -0.8531194 105.20476
## GO:0000012 0.017964072 0.023952096 -1.3714705  12.45331
## GO:0000014 0.013972056 0.023952096 -1.3772795  10.36180

R Session Info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] liger_1.1.2
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.0     magrittr_1.5       parallel_3.6.0     tools_3.6.0       
##  [5] Rcpp_1.0.4.6       stringi_1.4.6      highr_0.8          knitr_1.28        
##  [9] stringr_1.4.0      xfun_0.14          matrixStats_0.56.0 evaluate_0.14