adaptMT
packageadaptMT
package is an R
implementation of Adaptive P-value Thresholding (Lei and Fithian 2016). adaptMT
package has two main components: a generic interface that allows users to specify the working model and algorithms to fit them, as well as a pool of easy-to-use end-to-end wrappers. The former is captured by function adapt
. The latter includes adapt_glm
, adapt_gam
and adapt_glmnet
for using GLM, GAM and L1-regularized GLM.
# install the "adaptMT" package from github.
# will be submitted to CRAN very soon.
# devtools::install_github("lihualei71/adaptMT")
library("adaptMT")
AdaPT is a flexible iterative protocol that allows almost arbitrary exploration of data while still guaranteeing the False Discovery Rate (FDR) control in finite samples under standard assumptions in literature. The pipeline of AdaPT is as follows:
adaptMT
PackageadaptMT
package provides three convenient wrappers adapt_glm
, adapt_gam
and adapt_glmnet
, which generate the results in Section 5 of (Lei and Fithian 2016). In later versions we will add adapt_gbm
(generalized boosting machine), adapt_rf
(random forest), etc.. We recommend reading our paper first to familiarize with the basic concepts and scanning the documentation of them; See ?adapt_glm
, ?adapt_gam
, ?adapt_glmnet
.
adapt_glm
We illustrate one of the main function adapt_glm
, for AdaPT with logistic-Gamma GLM as the working model, on estrogen
dataset, a gene/drug response dataset from NCBI Gene Expression Omnibus (GEO). estrogen
dataset consists of gene expression measurements for n=22283 genes, in response to estrogen treatments in breast cancer cells for five groups of patients, with different dosage levels and 5 trials in each. The task is to identify the genes responding to a low dosage. The p-values pi for gene i is obtained by a one-sided permutation test which evaluates evidence for a change in gene expression level between the control group (placebo) and the low-dose group. {pi:i∈[n]} are then ordered according to permutation t-statistics comparing the control and low-dose data, pooled, against data from a higher dosage (with genes that appear to have a strong response at higher dosages placed earlier in the list). The code to compute p-values and the ordering can be found in Rina Barber’s website.
Here we subsample the top 5000 genes for illustration. The results for the full dataset are given in Section 5.1 of (Lei and Fithian 2016).
# load the data.
data("estrogen")
# Take the first 5000 genes
if (!requireNamespace("dplyr")){
install.packages("dplyr")
}
#> Loading required namespace: dplyr
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
estrogen <- select(estrogen, pvals, ord_high) %>%
filter(ord_high <= 5000)
rownames(estrogen) <- NULL
head(estrogen, 5)
#> pvals ord_high
#> 1 0.05011062 366
#> 2 0.71404053 4772
#> 3 0.06675970 3562
#> 4 0.40392007 790
#> 5 0.40614415 2291
summary(estrogen)
#> pvals ord_high
#> Min. :0.000011 Min. : 1
#> 1st Qu.:0.076082 1st Qu.:1251
#> Median :0.238279 Median :2500
#> Mean :0.315094 Mean :2500
#> 3rd Qu.:0.501009 3rd Qu.:3750
#> Max. :0.999289 Max. :5000
Now we run adapt_glm
on this dataset. adapt_glm
takes a conditional logistic-Gamma GLM as the working model by default. Specifically, it models the p-values as Hi∣xi∼π(xi),logit(π(xi))=ϕ(xi)Tβ −logpi∣Hi,xi∼{Exp(1)Hi=0Exp(μ(x))Hi=1,1μ(xi)=ϕ(xi)Tγ where ϕ(x) is a featurization of x. In this example, we use the spline bases, given by ns
function from splines
package. For illustration, we choose our candidate models as the above GLMs with ϕ(x) being the spline bases with equal-spaced knots and the number of knots ranging from 6-10. We use BIC to select the best model at the initial stage and use the selected model for the following model fitting.
# prepare the inputs of AdaPT
# need "splines" package to construct the formula for glm
library("splines")
pvals <- as.numeric(estrogen$pvals)
x <- data.frame(x = as.numeric(estrogen$ord))
formulas <- paste0("ns(x, df = ", 6:10, ")")
formulas
#> [1] "ns(x, df = 6)" "ns(x, df = 7)" "ns(x, df = 8)" "ns(x, df = 9)"
#> [5] "ns(x, df = 10)"
adapt_glm
function provides several user-friendly tools to monitor the progress. For model selection, a progress bar will, by default, be shown in the console that indicates how much proportion of models have been fitted. This can be turned off by setting verbose$ms = FALSE
. Similarly for model fitting, a progress bar can be shown in the console, though not by default, by setting verbose$fit = TRUE
. Also, by default, the progress of the main process will be shown in the console that indicates (1) which target FDR level has been achieved; (2) FDPhat for each target FDR level; (3) number of rejections for each target FDR level.
# run AdaPT
res_glm <- adapt_glm(x = x, pvals = pvals, pi_formulas = formulas, mu_formulas = formulas)
#> Model selection starts!
#> Shrink the set of candidate models if it is too time-consuming.
#>
|
| | 0%
|
|========== | 20%
|
|==================== | 40%
|
|============================== | 60%
|
|======================================== | 80%
|
|==================================================| 100%
#> alpha = 0.29: FDPhat 0.2899, Number of Rej. 3474
#> alpha = 0.28: FDPhat 0.28, Number of Rej. 3325
#> alpha = 0.27: FDPhat 0.2698, Number of Rej. 3084
#> alpha = 0.26: FDPhat 0.2598, Number of Rej. 2925
#> alpha = 0.25: FDPhat 0.2498, Number of Rej. 2882
#> alpha = 0.24: FDPhat 0.2399, Number of Rej. 2834
#> alpha = 0.23: FDPhat 0.23, Number of Rej. 2761
#> alpha = 0.22: FDPhat 0.2198, Number of Rej. 2703
#> alpha = 0.21: FDPhat 0.2096, Number of Rej. 2595
#> alpha = 0.2: FDPhat 0.1998, Number of Rej. 2547
#> alpha = 0.19: FDPhat 0.1897, Number of Rej. 2398
#> alpha = 0.18: FDPhat 0.1797, Number of Rej. 2287
#> alpha = 0.17: FDPhat 0.1699, Number of Rej. 2231
#> alpha = 0.16: FDPhat 0.1599, Number of Rej. 2139
#> alpha = 0.15: FDPhat 0.15, Number of Rej. 2060
#> alpha = 0.14: FDPhat 0.1399, Number of Rej. 1966
#> alpha = 0.13: FDPhat 0.1297, Number of Rej. 1889
#> alpha = 0.12: FDPhat 0.1195, Number of Rej. 1824
#> alpha = 0.11: FDPhat 0.1097, Number of Rej. 1723
#> alpha = 0.1: FDPhat 0.0999, Number of Rej. 1582
#> alpha = 0.09: FDPhat 0.0895, Number of Rej. 1430
#> alpha = 0.08: FDPhat 0.0798, Number of Rej. 1240
#> alpha = 0.07: FDPhat 0.0697, Number of Rej. 1105
#> alpha = 0.06: FDPhat 0.0594, Number of Rej. 960
#> alpha = 0.05: FDPhat 0.0495, Number of Rej. 909
#> alpha = 0.04: FDPhat 0.0389, Number of Rej. 796
#> alpha = 0.03: FDPhat 0.0297, Number of Rej. 539
#> alpha = 0.02: FDPhat 0.0179, Number of Rej. 224
plot_1d_thresh
gives the plot for the rejection threshold as a function of x (must be univariate without repeated value) for given α. We display the plots for α∈{0.3,0.25,0.2,0.15,0.1,0.05}.
plot_1d_lfdr
gives the plot for the estimated local FDR as a function of x (must be univariate without repeated value) for given α. We display the plots for α∈{0.3,0.25,0.2,0.15,0.1,0.05}. It is clear that the estimated local FDR almost remains the same, indicating that the information loss caused by partial masking is small.
par(mfrow = c(2, 3))
for (alpha in seq(0.3, 0.05, -0.05)){
nrejs <- res_glm$nrejs[floor(alpha * 100)]
title <- paste0("alpha = ", alpha, ", nrejs = ", nrejs)
plot_1d_lfdr(res_glm, x, pvals, alpha, title, disp_ymax = 0.25, xlab = "order")
}
adapt_gam
adapt_gam
has the exactly the same setting as adapt_glm
except that a generalized additive model (GAM) is fitted instead of a generalized linear model. We refer the readers to (Wood 2006) for more details on GAMs.
We illustrate adapt_gam
using a simulation dataset, under similar settings in Section 5.2 of (Lei and Fithian 2016). In this case, xi is a two dimensional vector generated from an equi-spaced 20×20 grid in the area [−100,100]×[−100,100]. The p-values are generated from a one-sided normal test, i.e. pi=1−Φ(zi),zi∼N(μi,1),μi={0(i∈H0)2(i∉H0) We set the hypotheses in a circle in the center of the grid with radius 40. The following code generates the p-values:
# Generate a 2-dim x
n <- 400
x1 <- x2 <- seq(-100, 100, length.out = 20)
x <- expand.grid(x1, x2)
colnames(x) <- c("x1", "x2")
# Generate p-values (one-sided z test)
H0 <- apply(x, 1, function(coord){sum(coord^2) < 40^2})
mu <- ifelse(H0, 2, 0)
set.seed(0)
zvals <- rnorm(n) + mu
pvals <- 1 - pnorm(zvals)
Now we apply adapt_gam
on this dataset. adapt_gam
is built on mgcv
package which provides a more advanced version of gam
in stats
package. Users need to install mgcv
package first.
# install.packages("mgcv")
library("mgcv")
#> Loading required package: nlme
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
#> This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
By default, adapt_gam
takes a logistic-Gamma GAM as the working model and the only modelling input is the (list of) partial formula(s) as specified in adapt_glm
. Here we take advantage of automatic knots selection of function s
, based on generalized cross-validation criterion, in mgcv
function to get rid of manual model selection; See ?s
for details. In particular, we consider a two-dimensional spline basis for x1 and x2.
formula <- "s(x1, x2)"
res_gam <- adapt_gam(x = x, pvals = pvals, pi_formulas = formula, mu_formulas = formula)
#> alpha = 0.83: FDPhat 0.8299, Number of Rej. 194
#> alpha = 0.82: FDPhat 0.8196, Number of Rej. 194
#> alpha = 0.81: FDPhat 0.8093, Number of Rej. 194
#> alpha = 0.8: FDPhat 0.7979, Number of Rej. 193
#> alpha = 0.79: FDPhat 0.7895, Number of Rej. 190
#> alpha = 0.78: FDPhat 0.7789, Number of Rej. 190
#> alpha = 0.77: FDPhat 0.7684, Number of Rej. 190
#> alpha = 0.76: FDPhat 0.7579, Number of Rej. 190
#> alpha = 0.75: FDPhat 0.75, Number of Rej. 172
#> alpha = 0.74: FDPhat 0.7349, Number of Rej. 166
#> alpha = 0.73: FDPhat 0.7273, Number of Rej. 165
#> alpha = 0.72: FDPhat 0.7195, Number of Rej. 164
#> alpha = 0.71: FDPhat 0.7055, Number of Rej. 163
#> alpha = 0.7: FDPhat 0.6987, Number of Rej. 156
#> alpha = 0.69: FDPhat 0.6842, Number of Rej. 152
#> alpha = 0.68: FDPhat 0.6776, Number of Rej. 152
#> alpha = 0.67: FDPhat 0.6667, Number of Rej. 150
#> alpha = 0.66: FDPhat 0.66, Number of Rej. 150
#> alpha = 0.65: FDPhat 0.6467, Number of Rej. 150
#> alpha = 0.64: FDPhat 0.64, Number of Rej. 150
#> alpha = 0.63: FDPhat 0.627, Number of Rej. 126
#> alpha = 0.62: FDPhat 0.6129, Number of Rej. 124
#> alpha = 0.61: FDPhat 0.6048, Number of Rej. 124
#> alpha = 0.6: FDPhat 0.5968, Number of Rej. 124
#> alpha = 0.59: FDPhat 0.5854, Number of Rej. 123
#> alpha = 0.58: FDPhat 0.5772, Number of Rej. 123
#> alpha = 0.57: FDPhat 0.5691, Number of Rej. 123
#> alpha = 0.56: FDPhat 0.5574, Number of Rej. 122
#> alpha = 0.55: FDPhat 0.5492, Number of Rej. 122
#> alpha = 0.54: FDPhat 0.5372, Number of Rej. 121
#> alpha = 0.53: FDPhat 0.5254, Number of Rej. 118
#> alpha = 0.52: FDPhat 0.5133, Number of Rej. 113
#> alpha = 0.51: FDPhat 0.5046, Number of Rej. 109
#> alpha = 0.5: FDPhat 0.4948, Number of Rej. 97
#> alpha = 0.49: FDPhat 0.4845, Number of Rej. 97
#> alpha = 0.48: FDPhat 0.4787, Number of Rej. 94
#> alpha = 0.47: FDPhat 0.4674, Number of Rej. 92
#> alpha = 0.46: FDPhat 0.4565, Number of Rej. 92
#> alpha = 0.45: FDPhat 0.4396, Number of Rej. 91
#> alpha = 0.44: FDPhat 0.4396, Number of Rej. 91
#> alpha = 0.43: FDPhat 0.4286, Number of Rej. 91
#> alpha = 0.42: FDPhat 0.4176, Number of Rej. 91
#> alpha = 0.41: FDPhat 0.4066, Number of Rej. 91
#> alpha = 0.4: FDPhat 0.3924, Number of Rej. 79
#> alpha = 0.39: FDPhat 0.3846, Number of Rej. 78
#> alpha = 0.38: FDPhat 0.3718, Number of Rej. 78
#> alpha = 0.37: FDPhat 0.3611, Number of Rej. 72
#> alpha = 0.36: FDPhat 0.3472, Number of Rej. 72
#> alpha = 0.35: FDPhat 0.3472, Number of Rej. 72
#> alpha = 0.34: FDPhat 0.338, Number of Rej. 71
#> alpha = 0.33: FDPhat 0.3286, Number of Rej. 70
#> alpha = 0.32: FDPhat 0.3188, Number of Rej. 69
#> alpha = 0.31: FDPhat 0.3043, Number of Rej. 69
#> alpha = 0.3: FDPhat 0.2899, Number of Rej. 69
#> alpha = 0.29: FDPhat 0.2899, Number of Rej. 69
#> alpha = 0.28: FDPhat 0.2754, Number of Rej. 69
#> alpha = 0.27: FDPhat 0.2647, Number of Rej. 68
#> alpha = 0.26: FDPhat 0.2576, Number of Rej. 66
#> alpha = 0.25: FDPhat 0.2462, Number of Rej. 65
#> alpha = 0.24: FDPhat 0.2381, Number of Rej. 63
#> alpha = 0.23: FDPhat 0.2241, Number of Rej. 58
#> alpha = 0.22: FDPhat 0.2069, Number of Rej. 58
#> alpha = 0.21: FDPhat 0.2069, Number of Rej. 58
#> alpha = 0.2: FDPhat 0.193, Number of Rej. 57
#> alpha = 0.19: FDPhat 0.1786, Number of Rej. 56
#> alpha = 0.18: FDPhat 0.1786, Number of Rej. 56
#> alpha = 0.17: FDPhat 0.1667, Number of Rej. 54
#> alpha = 0.16: FDPhat 0.1481, Number of Rej. 54
#> alpha = 0.15: FDPhat 0.1481, Number of Rej. 54
#> alpha = 0.14: FDPhat 0.1296, Number of Rej. 54
#> alpha = 0.13: FDPhat 0.1296, Number of Rej. 54
#> alpha = 0.12: FDPhat 0.1154, Number of Rej. 52
#> alpha = 0.11: FDPhat 0.098, Number of Rej. 51
#> alpha = 0.1: FDPhat 0.098, Number of Rej. 51
#> alpha = 0.09: FDPhat 0.0833, Number of Rej. 48
#> alpha = 0.08: FDPhat 0.0789, Number of Rej. 38
#> alpha = 0.07: FDPhat 0.0571, Number of Rej. 35
#> alpha = 0.06: FDPhat 0.0571, Number of Rej. 35
We visualize the results below. The left figure gives the truth where the darker pixels correspond to non-nulls and the right figure gives the rejection set by adapt_gam
for α=0.1.
par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
# Truth
plot(x[, 1], x[, 2], type = "p", xlab = "", ylab = "", col = ifelse(H0, "#000000", "#A9A9A9"), cex = 2, pch = 15)
# Rejection set
rej <- pvals <= res_gam$s[, 10]
plot(x[, 1], x[, 2], type = "p", xlab = "", ylab = "", col = ifelse(rej, "#800000", "#FFB6C1"), cex = 2, pch = 15)
Similar to the 1d case, plot_2d_thresh
gives the plot for the rejection threshold as a function of x (must be bivariate without repeated value) for given α. In particular we set α=0.1.
plot_2d_thresh(res_gam, x, pvals, 0.1, "P-Value Threshold", xlab = "", ylab = "")
Unlike the 1d case, visualizing the level surfaces of local FDR estimates is not an easy task due to the extra dimension. plot_2d_lfdr
gives the plot for the local FDR estimates when all p-values are equal to a user-specified level targetp
for given α.
par(mfrow = c(2, 2), mar = c(4, 4, 2, 2))
for (targetp in c(0.001, 0.005, 0.01, 0.05)){
title <- paste0("Local FDR Estimates (p = ", targetp, ")")
plot_2d_lfdr(res_gam, x, pvals, 0.1, title, targetp, xlab = "", ylab = "")
}
adapt_glmnet
adapt_glmnet
is appropriate for the cases where the covariates are high-dimensional. Instead of fitting GLMs and GAMs, adapt_glmnet
fits a penalized GLM in each step. Unlike adapt_glm
and adapt_gam
, the covariate x should be of class matrix
, instead of data.frame
, and adapt_glmnet
always take the full matrix x as the covariates. As a result, adapt_glmnet
has a simpler interface than adapt_glm
and adapt_gam
in that it does not need inputs like formula
. We refer the readers to (Hastie and Qian 2014) for more details on glmnet
.
We illustrate adapt_glmnet
using a simulation dataset, under similar settings in Section 5.2 of (Lei and Fithian 2016). In this case, we generate 300 i.i.d. xi’s where each of them is a 50-dimensional vector with each entry generated from an uniform distribution on [0,1].
set.seed(0)
m <- 50
n <- 500
x <- matrix(runif(n * m), n, m)
We consider the p-values from a logistic Gamma GLM model with π1(xi)=exp(−3+2xi1+2xi2)1+exp(−3+2xi1+2xi2),μ(xi)=max
inv_logit <- function(x) {exp(x) / (1 + exp(x))}
pi <- inv_logit(x[, 1] * 2 + x[, 2] * 2 - 3)
mu <- pmax(1, x[, 1] * 2 + x[, 2] * 2)
Now we generate the null set and the p-values as follows:
H0 <- as.logical(ifelse(runif(n) < pi, 1, 0))
y <- ifelse(H0, rexp(n, 1/mu), rexp(n, 1))
pvals <- exp(-y)
To save computation time, we start adapt_glmnet
with s_{0} = (0.15, 0.15, \ldots, 0.15), instead of the default value s_{0} = (0.45, 0.45, \ldots, 0.45), and set nfits = 5
to reduce the number of fitting in the process (see ?adapt
for more details).
res <- adapt_glmnet(x, pvals, s0 = rep(0.15, n), nfits = 5)
#> alpha = 0.59: FDPhat 0.5847, Number of Rej. 118
#> alpha = 0.58: FDPhat 0.5763, Number of Rej. 118
#> alpha = 0.57: FDPhat 0.5641, Number of Rej. 117
#> alpha = 0.56: FDPhat 0.5556, Number of Rej. 117
#> alpha = 0.55: FDPhat 0.547, Number of Rej. 117
#> alpha = 0.54: FDPhat 0.5345, Number of Rej. 116
#> alpha = 0.53: FDPhat 0.5263, Number of Rej. 114
#> alpha = 0.52: FDPhat 0.5179, Number of Rej. 112
#> alpha = 0.51: FDPhat 0.5, Number of Rej. 96
#> alpha = 0.5: FDPhat 0.5, Number of Rej. 96
#> alpha = 0.49: FDPhat 0.4896, Number of Rej. 96
#> alpha = 0.48: FDPhat 0.4737, Number of Rej. 95
#> alpha = 0.47: FDPhat 0.4632, Number of Rej. 95
#> alpha = 0.46: FDPhat 0.4574, Number of Rej. 94
#> alpha = 0.45: FDPhat 0.4457, Number of Rej. 92
#> alpha = 0.44: FDPhat 0.4396, Number of Rej. 91
#> alpha = 0.43: FDPhat 0.4286, Number of Rej. 91
#> alpha = 0.42: FDPhat 0.4167, Number of Rej. 84
#> alpha = 0.41: FDPhat 0.4074, Number of Rej. 81
#> alpha = 0.4: FDPhat 0.3947, Number of Rej. 76
#> alpha = 0.39: FDPhat 0.3889, Number of Rej. 54
#> alpha = 0.38: FDPhat 0.3774, Number of Rej. 53
#> alpha = 0.37: FDPhat 0.3684, Number of Rej. 38
#> alpha = 0.36: FDPhat 0.3514, Number of Rej. 37
#> alpha = 0.35: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.34: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.33: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.32: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.31: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.3: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.29: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.28: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.27: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.26: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.25: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.24: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.23: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.22: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.21: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.2: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.19: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.18: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.17: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.16: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.15: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.14: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.13: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.12: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.11: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.1: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.09: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.08: FDPhat 0.0769, Number of Rej. 13
adaptMT
PackageThe data-adaptive/human-in-the-loop part is embedded in the EM sub-pipeline:
Hastie, Trevor, and Junyang Qian. 2014. “Glmnet Vignette.” Technical report, Stanford.
Lei, Lihua, and William Fithian. 2016. “AdaPT: An Interactive Procedure for Multiple Testing with Side Information.” arXiv Preprint arXiv:1609.06035.
Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Chapman; Hall/CRC.