| Title: | Penalized Linear Mixed Models for Correlated Data |
| Version: | 4.3.0 |
| Description: | Fits penalized linear mixed models that correct for unobserved confounding factors. 'plmmr' infers and corrects for the presence of unobserved confounding effects such as population stratification and environmental heterogeneity. It then fits a linear model via penalized maximum likelihood. Originally designed for the multivariate analysis of single nucleotide polymorphisms (SNPs) measured in a genome-wide association study (GWAS), 'plmmr' eliminates the need for subpopulation-specific analyses and post-analysis p-value adjustments. Functions for the appropriate processing of 'PLINK' files are also supplied. For examples, see the package homepage https://pbreheny.github.io/plmmr/. |
| License: | GPL-3 |
| URL: | https://pbreheny.github.io/plmmr/, https://github.com/pbreheny/plmmr/ |
| BugReports: | https://github.com/pbreheny/plmmr/issues/ |
| Depends: | bigalgebra, bigmemory, R (≥ 4.4.0) |
| Imports: | biglasso (≥ 1.6.0), data.table, glmnet, Matrix, ncvreg, parallel, utils |
| Suggests: | bigsnpr, bigstatsr, graphics, grDevices, knitr, MASS, rmarkdown, R.utils, tinytest, withr |
| LinkingTo: | BH, bigmemory, Rcpp, RcppArmadillo (≥ 0.8.600) |
| LazyData: | true |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| Config/roxygen2/version: | 8.0.0 |
| NeedsCompilation: | yes |
| Packaged: | 2026-05-19 13:27:29 UTC; pbreheny |
| Author: | Tabitha K. Peter |
| Maintainer: | Patrick J. Breheny <patrick-breheny@uiowa.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-11 07:20:02 UTC |
plmmr: Penalized Linear Mixed Models for Correlated Data
Description
Fits penalized linear mixed models that correct for unobserved confounding factors. 'plmmr' infers and corrects for the presence of unobserved confounding effects such as population stratification and environmental heterogeneity. It then fits a linear model via penalized maximum likelihood. Originally designed for the multivariate analysis of single nucleotide polymorphisms (SNPs) measured in a genome-wide association study (GWAS), 'plmmr' eliminates the need for subpopulation-specific analyses and post-analysis p-value adjustments. Functions for the appropriate processing of 'PLINK' files are also supplied. For examples, see the package homepage https://pbreheny.github.io/plmmr/.
Author(s)
Maintainer: Patrick J. Breheny patrick-breheny@uiowa.edu (ORCID)
Authors:
Patrick J. Breheny patrick-breheny@uiowa.edu (ORCID)
Tabitha K. Peter tabitha.peter15@gmail.com (ORCID)
Anna C. Reisetter anna-reisetter@uiowa.edu (ORCID)
Yujing Lu
Oscar A. Rysavy oscar-rysavy@uiowa.edu (ORCID)
See Also
Useful links:
Report bugs at https://github.com/pbreheny/plmmr/issues/
A helper function to add predictors to a filebacked matrix of data
Description
A helper function to add predictors to a filebacked matrix of data
Usage
add_predictors(obj, add_predictor, id_var, rds_dir, outfile, quiet)
Arguments
obj |
A |
add_predictor |
Optional: add additional covariates/predictors/features from an external file (i.e., not a PLINK file). |
id_var |
String specifying which column of the PLINK |
rds_dir |
The path to the directory in which you want to create the new |
outfile |
A string with the name of the filepath for the log file |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
A list of 2 components:
-
design_matrix- abigSNPobject with an added element representing the matrix that includes the additional predictors as the first few columns -
unpen- an integer vector that ranges from 1 to the number of added predictors. Example: if 2 predictors are added,unpen = 1:2
Admix: Semi-simulated SNP data
Description
A dataset containing the 100 SNPs, a demographic variable representing ancestry, and a simulated outcome.
Usage
admix
Format
A list with 3 components:
- X
SNP matrix (197 observations of 100 SNPs)
- y
197 x 1 matrix of simulated (continuous) outcomes
- ancestry
vector with ancestry categorization: 0 = African, 1 = African American, 2 = European, 3 = Japanese
Source
https://hastie.su.domains/CASI/
A helper function to support create_design_filebacked()
Description
A helper function to support create_design_filebacked()
Usage
align_ids(id_var, add_predictor, og_ids, outfile, quiet)
Arguments
id_var |
String specifying the variable name of the ID column |
add_predictor |
External data to include in design matrix. This is the |
og_ids |
Character vector with the PLINK ids (FID or IID) from the original data (i.e., the data before any subsetting from handling missing phenotypes) |
outfile |
A string with the name of the filepath for the log file |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
A matrix with the same dimensions as add_predictor
A version of cbind() for file-backed matrices
Description
A version of cbind() for file-backed matrices
Usage
big_cbind(A, B, C, quiet)
Arguments
A |
in-memory data |
B |
file-backed data |
C |
file-backed placeholder for combined data |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
C, filled in with all column values of A and B combined
Coef method for cv_plmm class
Description
Coef method for cv_plmm class
Usage
## S3 method for class 'cv_plmm'
coef(object, lambda, which = object$min, ...)
Arguments
object |
An object of class |
lambda |
A numeric vector of lambda values. |
which |
Vector of lambda indices for which to return coefficients. Defaults to lambda index with minimum CVE. |
... |
Additional arguments (not used). |
Value
Returns a named numeric vector. Values are the coefficients of the
model at the specified value(s) of either lambda or which. Names are the
values of lambda.
Examples
cv_fit <- cv_plmm(admix$X, admix$y, return_fit = TRUE)
head(coef(cv_fit))
Coef method for plmm class
Description
Coef method for plmm class
Usage
## S3 method for class 'plmm'
coef(object, lambda, which = seq_along(object$lambda), drop = TRUE, ...)
Arguments
object |
An object of class |
lambda |
A numeric vector of lambda values. |
which |
Vector of lambda indices for which to return coefficients. |
drop |
Logical. Should returned object be coerced to a vector if possible? |
... |
Additional arguments. |
Value
Either a numeric matrix (if model was fit on data stored in memory)
or a sparse matrix (if model was fit on data stored filebacked). Rownames are
feature names, columns are values of lambda.
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
coef(fit)[1:10, 41:45]
A function to compute the BLUP
Description
A function to compute the BLUP
Usage
compute_blup(fit, Xb, Sigma_21, idx)
Arguments
fit |
An object returned by |
Xb |
Linear predictor |
Sigma_21 |
Covariance matrix between the training and the testing data. Extracted from |
idx |
Vector of indices of the penalty parameter |
Value
A matrix of the linear predictors + the estimated random effects
A function to construct the estimated variance matrix from a PLMM fit
Description
A function to construct the estimated variance matrix from a PLMM fit
Usage
construct_variance(fit, K = NULL, eta = NULL)
Arguments
fit |
An object returned by |
K |
An optional matrix |
eta |
An optional numeric value between 0 and 1; if |
Value
Sigma_hat, a matrix representing the estimated variance
A helper function to count constant features
Description
A helper function to count constant features
Usage
count_constant_features(fbm, outfile, quiet)
Arguments
fbm |
A filebacked |
outfile |
String specifying name of log file |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
A numeric vector with the indices of the non-singular columns of the matrix associated with fbm
A helper function to count the number of cores available on the current machine
Description
A helper function to count the number of cores available on the current machine
Usage
count_cores()
Value
A number of cores to use; if parallel is installed, this will be parallel::detectCores(). Otherwise, this returns a 1.
A function to create a design for PLMM modeling
Description
A function to create a design for PLMM modeling
Usage
create_design(data_file = NULL, rds_dir = NULL, X = NULL, y = NULL, ...)
Arguments
data_file |
For filebacked data (data from |
rds_dir |
For filebacked data, this is the filepath to the directory/folder where you want the design to be saved.
Note: do not include/append the name you want for the to-be-created file – the name is the argument |
X |
For in-memory data (data in a matrix or data frame), this is the design matrix. Defaults to NULL (this argument does not apply for filebacked data). |
y |
For in-memory data, this is the numeric vector representing the outcome. Defaults to NULL (this argument does not apply for filebacked data). Note: it is the responsibility of the user to ensure that the rows in X and the corresponding elements of y have the same row order, i.e., observations must be in the same order in both the design matrix and in the outcome vector. |
... |
Additional arguments to pass to |
Details
This function is a wrapper for the other create_design...() inner functions; all arguments
included here are passed along to the create_design...() inner function that
matches the type of the data being supplied. Note which arguments are optional
and which ones are not.
Additional arguments for all filebacked data:
-
new_file User-specified filename (without .bk/.rds extension) for the to-be-created .rds/.bk files. Must be different from any existing .rds/.bk files in the same folder.
-
feature_id Optional: A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed. - for PLINK data: a string specifying an ID column of the PLINK
.famfile. Options are "IID" (default) and "FID" - for all other filebacked data: a character vector of unique identifiers (IDs) for each row of the feature data (i.e., the data processed withprocess_delim()) - if left NULL (default), X is assumed to have the same row-order as add_outcome. Note: if this assumption is made in error, calculations downstream will be incorrect. Pay close attention here. -
add_outcome A data frame or matrix with two columns: an ID column and a column with the outcome value (to be used as 'y' in the final design). IDs must be characters, outcome must be numeric.
-
outcome_id A string specifying the name of the ID column in
add_outcome -
outcome_col A string specifying the name of the phenotype column in
add_outcome -
na_outcome_vals Optional: a vector of numeric values used to code NA values in the outcome. Defaults to
c(-9, NA_integer)(the -9 matches PLINK conventions). -
overwrite Optional: logical - should existing .rds files be overwritten? Defaults to FALSE.
-
logfile Optional: the name (character string) of the prefix of the logfile to be written in
rds_dir. Default to NULL (no log file written). Note: do not append a.logto the filename; this is done automatically. -
quiet Optional: logical - should console messages be silenced? Defaults to FALSE
Additional arguments specific to PLINK data:
-
add_predictor Optional (for PLINK data only): a matrix or data frame to be used for adding additional unpenalized covariates/predictors/features from an external file (i.e., not a PLINK file). This matrix must have one column that is an ID column; all other columns aside the ID will be used as covariates in the design matrix. Columns must be named.
-
predictor_id Optional (for PLINK data only): A string specifying the name of the column in
add_predictorwith sample IDs. Required ifadd_predictoris supplied. The names will be used to subset and align this external covariate(s) with the supplied PLINK data.
Additional arguments specific to delimited file data:
-
unpen Optional: a character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, your delimited file must have column names.
Additional arguments for in-memory data:
-
unpen Optional: a character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names.
Value
A filepath to an object of class plmm_design, which is a named list with the design matrix,
outcome, penalty factor vector, and other details needed for fitting a model. This list is stored as an .rds
file for filebacked data, so in the filebacked case a string with the path to that file is returned. For in-memory data,
the list itself is returned.
Examples
## Example 1: matrix data in-memory ##
admix_design <- create_design(X = admix$X, y = admix$y, unpen = "Snp1")
## Example 2: delimited data ##
# process delimited data
temp_dir <- tempdir()
colon_dat <- process_delim(data_file = "colon2.txt",
data_dir = find_example_data(parent = TRUE), overwrite = TRUE,
rds_dir = temp_dir, rds_prefix = "processed_colon2", sep = "\t", header = TRUE)
# prepare outcome data
colon_outcome <- read.delim(find_example_data(path = "colon2_outcome.txt"))
# create a design
colon_design <- create_design(data_file = colon_dat, rds_dir = temp_dir, new_file = "std_colon2",
add_outcome = colon_outcome, outcome_id = "ID", outcome_col = "y", unpen = "sex",
overwrite = TRUE, logfile = "test.log")
# look at the results
colon_rds <- readRDS(colon_design)
str(colon_rds)
## Example 3: PLINK data ##
# process PLINK data
temp_dir <- tempdir()
unzip_example_data(outdir = temp_dir)
plink_data <- process_plink(data_dir = temp_dir,
data_prefix = "penncath_lite",
rds_dir = temp_dir,
rds_prefix = "imputed_penncath_lite",
# imputing the mode to address missing values
impute_method = "mode",
# overwrite existing files in temp_dir
# (you can turn this feature off if you need to)
overwrite = TRUE,
# turning off parallelization - leaving this on causes problems knitting this vignette
parallel = FALSE)
# get outcome data
penncath_pheno <- read.csv(find_example_data(path = 'penncath_clinical.csv'))
outcome <- data.frame(FamID = as.character(penncath_pheno$FamID),
CAD = penncath_pheno$CAD)
unpen_predictors <- data.frame(FamID = as.character(penncath_pheno$FamID),
sex = penncath_pheno$sex,
age = penncath_pheno$age)
# create design where sex and age are always included in the model
pen_design <- create_design(data_file = plink_data,
feature_id = "FID",
rds_dir = temp_dir,
new_file = "std_penncath_lite",
add_outcome = outcome,
outcome_id = "FamID",
outcome_col = "CAD",
add_predictor = unpen_predictors,
predictor_id = "FamID",
logfile = "design",
# again, overwrite if needed; use with caution
overwrite = TRUE)
# examine the design - notice the components of this object
pen_design_rds <- readRDS(pen_design)
A function to create a design matrix, outcome, and penalty factor to be passed to a model fitting function
Description
A function to create a design matrix, outcome, and penalty factor to be passed to a model fitting function
Usage
create_design_filebacked(
obj,
rds_dir,
new_file,
add_outcome,
outcome_id,
outcome_col,
na_outcome_vals = c(-9, NA_integer_),
feature_id = NULL,
add_predictor = NULL,
predictor_id = NULL,
unpen = NULL,
logfile = NULL,
overwrite = FALSE,
quiet = FALSE
)
Arguments
obj |
The RDS object read in by |
rds_dir |
The path to the directory in which you want to create the new |
new_file |
User-specified filename (without .bk/.rds extension) for the to-be-created .rds/.bk files. Must be different from any existing .rds/.bk files in the same folder. |
add_outcome |
A data frame or matrix with two columns: an ID column and a column with the outcome value (to be used as 'y' in the final design). IDs must be characters, outcome must be numeric. |
outcome_id |
A string specifying the name of the ID column in |
outcome_col |
A string specifying the name of the phenotype column in |
na_outcome_vals |
A vector of numeric values used to code NA values in the outcome. Defaults to |
feature_id |
A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed.
|
add_predictor |
Optional (for PLINK data only): a matrix or data frame to be used for adding additional unpenalized covariates/predictors/features from an external file (i.e., not a PLINK file). This matrix must have one column that is an ID column; all other columns aside the ID will be used as covariates in the design matrix. Columns must be named. |
predictor_id |
Optional (for PLINK data only): A string specifying the name of the column in |
unpen |
Optional (for delimited file data only): an optional character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names. |
logfile |
Optional: name of the |
overwrite |
Logical: should existing .rds files be overwritten? Defaults to FALSE. |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
A filepath to the created .rds file containing all the information
for model fitting, including a standardized X and model design information
A function to create a design with an in-memory X matrix
Description
A function to create a design with an in-memory X matrix
Usage
create_design_in_memory(X, y, unpen = NULL)
Arguments
X |
A numeric matrix in which rows correspond to observations (e.g., samples) and columns correspond to features. |
y |
A numeric vector representing the outcome for the model. Note: it is the responsibility of the user to ensure that the outcome_col and X have the same row order! |
unpen |
An optional character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names. |
Value
A named list containing the standardized design matrix, outcome, penalty factor vector, and other details needed for fitting a model
Create the .log file
Description
Create the .log file
Usage
create_log(outfile)
Arguments
outfile |
String specifying the name of the to-be-created file, without extension |
Value
Nothing is returned, instead a text file with the suffix .log is created.
If outfile is NULL, the path to the null device is returned.
Cross-validation for plmm
Description
Performs k-fold cross validation for lasso-, MCP-, or SCAD-penalized
linear mixed models over a grid of values for the regularization parameter lambda.
Usage
cv_plmm(
design,
y = NULL,
K = NULL,
eta = NULL,
penalty = "lasso",
type = NULL,
gamma,
alpha = 1,
lambda_min,
nlambda = 100,
lambda,
eps = 1e-04,
max_iter = 10000,
warn = TRUE,
init = NULL,
cluster,
nfolds = 5,
fold = NULL,
seed,
trace = FALSE,
save_rds = NULL,
return_fit = TRUE,
...
)
Arguments
design |
The first argument must be one of three things:
(1) |
y |
Optional: In the case where |
K |
Similarity matrix used to rotate the data. This should either be
(1) a known matrix that reflects the covariance of y,
(2) an estimate (Default is |
eta |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
penalty |
The penalty to be applied to the model. Either "lasso" (the default), "SCAD", or "MCP". |
type |
A character argument indicating what should be returned from |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. |
lambda_min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
Length of the sequence of lambda. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a sequence of values of length |
eps |
Convergence threshold. The algorithm iterates until the RMSE for the change in linear predictors for each coefficient is less than |
max_iter |
Maximum number of iterations (total across entire path). Default is 10000. |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
cluster |
Option for in-memory data only: |
nfolds |
The number of cross-validation folds. Default is 5. |
fold |
Which fold each observation belongs to. By default, the observations are randomly assigned. |
seed |
You may set the seed of the random number generator in order to obtain reproducible results. |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each CV fold. Default is FALSE. |
save_rds |
Optional: if a filepath and name without the |
return_fit |
Optional: a logical value indicating whether the fitted model should be returned as a |
... |
Additional arguments to |
Value
A list that includes 14 items:
-
type: The type of prediction used ('lp' or 'blup') -
cve: A numeric vector with the cross validation error (CVE) at each value oflambda -
cvse: A numeric vector with the estimated standard error associated with each value ofcve -
fold: A numericnlength vector of integers indicating the fold to which each observation was assigned -
lambda: A numeric vector oflambdavalues -
fit: The overall fit of the object, including all predictors; this is a list as returned byplmm() -
min: The index corresponding to the value oflambdathat minimizescve -
lambda_min: Thelambdavalue at whichcveis minimized -
min1se: The index corresponding to the value oflambdawithin 1 standard error of that which minimizescve -
lambda1se: The largest value of lambda such thatcveis within 1 standard error of the minimum -
null.dev: A numeric value representing the deviance for the intercept-only model. If you have supplied your ownlambdasequence, this quantity may not be meaningful. -
Y: A matrix with the predicted outcome (\hat{y}) values at each value oflambda. Rows are observations, columns are values oflambda. -
loss: A matrix with the loss values at each value of lambda. Rows are observations, columns are values oflambda. -
estimated_Sigma: Iftype = 'blup', an n x n matrix representing the estimated covariance matrix.
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
print(summary(cv_fit))
plot(cv_fit)
Cross-validation internal function for cv_plmm()
Description
Internal function for cv_plmm() which calls plmm() on a fold subset of the original data.
Usage
cvf(i, fold, type, cv_args, ...)
Arguments
i |
Fold number to be excluded from fit. |
fold |
n-length vector of fold-assignments. |
type |
A character argument indicating what should be returned from |
cv_args |
List of additional arguments to be passed to plmm. |
... |
Optional arguments to |
Value
A list with three elements:
-
loss: a numeric vector with the loss at each value of lambda -
nl: a numeric value indicating the number of lambda values used -
yhat: a numeric value with the predicted outcome values at each lambda
A function to take the eigendecomposition of K
Description
Note: This is faster than taking SVD of X when p >> n
Usage
eigen_K(std_X)
Arguments
std_X |
The standardized design matrix. |
Value
A list with three elements:
-
s: The non-zero eigenvalues of K -
U: The eigenvectors of K associated with s -
K: The fully computed K matrix
Estimate eta (to be used in rotating the data)
Description
This function is called internally by plmm()
Usage
estimate_eta(n, s, U, y, incpt_flag)
Arguments
n |
The number of observations |
s |
The non-zero eigenvalues of K, the realized relationship matrix |
U |
The eigenvectors of K associated with s |
y |
Continuous outcome vector |
incpt_flag |
Logical: Does the model require fitting an intercept? |
Value
a numeric value with the estimated value of eta, the variance parameter
A function to help with accessing example PLINK files.
Description
A function to help with accessing example PLINK files.
Usage
find_example_data(path, parent = FALSE)
Arguments
path |
Argument (string) specifying a path (filename) for an external data file in |
parent |
If the user wants the name of the parent directory where the example data is located, set |
Value
If path = NULL, a character vector of file names is returned. If path is given, then a character string
with the full file path.
Examples
find_example_data(parent = TRUE)
Read in processed data
Description
This function is intended to be called after either process_plink() or process_delim() has been called once.
Usage
get_data(path, returnX = FALSE, trace = TRUE)
Arguments
path |
The file path to the RDS object containing the processed data. Do not add the |
returnX |
Logical: should the design matrix be returned as a numeric matrix that will be stored in memory? Default is FALSE. |
trace |
Logical: should trace messages be shown? Default is TRUE. |
Value
A list with these components:
-
std_X, the column-standardized design matrix as either (1) a numeric matrix or (2) a filebackedbig.matrixobject. (if PLINK data)
fam, a data frame containing the pedigree information (like a.famfile in PLINK)(if PLINK data)
map, a data frame containing the feature information (like a.bimfile in PLINK)-
ns: A vector indicating the which columns of X contain nonsingular features (i.e., features with variance != 0). -
center: A vector of values for centering each column in X -
scale: A vector of values for scaling each column in X
A function to impute SNP data
Description
A function to impute SNP data
Usage
impute_snp_data(
obj,
X,
chr,
impute,
impute_method,
parallel,
outfile,
quiet,
seed = as.numeric(Sys.Date()),
...
)
Arguments
obj |
A |
X |
A matrix of genotype data as returned by |
chr |
A numeric vector of chromosomal locations of the SNPs. |
impute |
Logical: should data be imputed? Defaults to TRUE. |
impute_method |
If
|
parallel |
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See |
outfile |
Optional: the name (character string) of the prefix of the logfile to be written. |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
seed |
Numeric value to be passed as the seed for |
... |
Optional: additional arguments to |
Value
Nothing is returned, but the obj$genotypes is overwritten with the imputed version of the data
A function to align genotype and phenotype data
Description
A function to align genotype and phenotype data
Usage
index_samples(
obj,
rds_dir,
indiv_id,
add_outcome,
outcome_id,
outcome_col,
na_outcome_vals,
outfile,
quiet
)
Arguments
obj |
An object created by |
rds_dir |
The path to the directory in which you want to create the new |
indiv_id |
A character string indicating the ID column name in the 'fam' element of the genotype data list. Defaults to 'sample.ID', equivalent to 'IID' in PLINK. The other option is 'family.ID', equivalent to 'FID' in PLINK. |
add_outcome |
A data frame with at least two columns: an ID column and a phenotype column |
outcome_id |
A string specifying the name of the ID column in |
outcome_col |
A string specifying the name of the phenotype column in |
na_outcome_vals |
A vector of numeric values used to code NA values in the outcome. Defaults to |
outfile |
A string with the name of the filepath for the log file |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
a list with two items:
-
complete_samples: a data.table with rows corresponding to the samples for which both genotype and phenotype are available. -
outcome_idx: a numeric vector with indices indicating which samples were 'complete' (i.e., which samples from add_outcome had corresponding data in the PLINK files)
Generate nicely formatted lambda vector
Description
Generate nicely formatted lambda vector
Usage
lam_names(l)
Arguments
l |
Vector of lambda values. |
Value
A character vector of formatted lambda value names
Evaluate the negative log-likelihood of an intercept-only Gaussian plmm model
Description
This function allows you to evaluate the negative log-likelihood of a linear mixed model under the assumption of a null model in order to estimate the variance parameter, eta.
Usage
log_lik(eta, n, s, U, y, incpt_flag)
Arguments
eta |
Estimated proportion of the variance in the outcome attributable to population/correlation structure |
n |
The number of observations |
s |
The non-zero eigenvalues of K, the realized relationship matrix |
U |
The eigenvectors of K associated with s |
y |
Continuous outcome vector |
incpt_flag |
Logical: Does the model require fitting an intercept? Passed from |
Value
the value of the log-likelihood of the PLMM, evaluated with the supplied parameters
A helper function to label and summarize the contents of a bigSNP
Description
A helper function to label and summarize the contents of a bigSNP
Usage
name_and_count_bigsnp(obj, id_var, outfile, quiet)
Arguments
obj |
a |
id_var |
String specifying which column of the PLINK |
outfile |
The string with the name of the |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
a list with 7 components:
-
na_counts: vector of missing SNP counts ingenotypes -
obj: a modifiedbigSNPlist with additional components -
og_plink_ids: either the IID or FID column from.fam, determined byid_var -
chr: p-length containing the chromosomes for each SNP -
X: theobj$genotypesas its own FBM -
pos: vector of physical positions of the SNPs -
chr_range: vector containing the minimum and maximum values ofchr. Character strings are treated as the maximum.
Fit a linear mixed model via penalized maximum likelihood.
Description
Fit a linear mixed model via penalized maximum likelihood.
Usage
plmm(
design,
y = NULL,
K = NULL,
eta = NULL,
penalty = "lasso",
init = NULL,
gamma,
alpha = 1,
lambda_min,
nlambda = 100,
lambda,
eps = 1e-04,
max_iter = 10000,
dfmax = NULL,
warn = TRUE,
trace = FALSE,
save_rds = NULL,
return_fit = TRUE,
...
)
Arguments
design |
The first argument must be one of three things:
(1) |
y |
Optional: In the case where |
K |
Similarity matrix used to rotate the data. This should either be:
(1) a known matrix that reflects the covariance of y,
(2) an estimate (Default is |
eta |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
penalty |
The penalty to be applied to the model. Either "lasso" (the default), "SCAD", or "MCP". |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. |
lambda_min |
The smallest value for lambda, as a fraction of the maximum lambda. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
Length of the sequence of lambda. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a sequence of values of length |
eps |
Convergence threshold. The algorithm iterates until the RMSE for the change in linear predictors for each coefficient is less than |
max_iter |
Maximum number of iterations (total across entire path). Default is 10000. |
dfmax |
Maximum number of non-zero coefficients that may enter the model. Default is NULL (no maximum) |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE. |
save_rds |
Optional: if a filepath and name without the |
return_fit |
Optional: a logical value indicating whether the fitted model should be returned as a |
... |
Additional optional arguments to |
Value
A list which includes 18 items:
-
beta_vals: The matrix of estimated coefficients. Rows are predictors (with the first row being the intercept), and columns are values oflambda. -
std_Xbeta: A matrix of the linear predictors on the scale of the standardized design matrix. Rows are predictors, columns are values oflambda. Note:std_Xbetawill not include rows for the intercept or for constant features. -
std_X_details: A list with 9 items:-
center: The center values used to center the columns of the design matrix -
scale: The scaling values used to scale the columns of the design matrix -
ns: An integer vector of the nonsingular columns of the original data -
unpen: An integer vector of indices of the unpenalized features, if any were specified in the design -
unpen_colnames: A character vector of the column names of any unpenalized features. -
X_colnames: A character vector with the column names of all features in the original design matrix -
X_rownames: A character vector with the row names of all features in the original design matrix; if none were provided, these are named 'row1', 'row2', etc. -
std_X_colnames: A subset ofX_colnamesrepresenting only nonsingular columns (i.e., the columns indexed byns) -
std_X_rownames: A subset ofX_rownamesrepresenting rows that passed QC filtering & and are represented in both the genotype and phenotype data sets (this only applies to PLINK data)
-
-
std_X: If design matrix is filebacked, the descriptor for the filebacked data is returned usingbigmemory::describe(). If the the data were stored in-memory, nothing is returned (std_Xis NULL). -
y: The outcome vector used in model fitting. -
p: The total number of columns in the design matrix (including any singular columns, excluding the intercept). -
plink_flag: Logical - did the data come from PLINK files? -
lambda: A numeric vector of the tuning parameter values used in model fitting. -
eta: A double between 0 and 1 representing the estimated proportion of the variance in the outcome attributable to population/correlation structure. -
penalty: A character string indicating the penalty with which the model was fit (e.g., 'MCP') -
gamma: A numeric value indicating the tuning parameter used for the SCAD or MCP penalties. Not relevant for lasso models. -
alpha: A numeric value indicating the elastic net tuning parameter. -
loss: A vector with the numeric values of the loss at each value oflambda(calculated on the ~rotated~ scale) -
penalty_factor: A vector of indicators corresponding to each predictor, where 1 = predictor was penalized. -
ns_idx: An integer vector with the indices of predictors which were non-singular features (i.e., features which had variation), where feature 1 is the intercept. -
iter: An integer vector with the number of iterations needed in model fitting for each value oflambda -
converged: A vector of logical values indicating whether the model fitting converged at each value oflambda -
K: a list with 2 elements,sandU—-
s: a vector of the non-zero eigenvalues of the relatedness matrix K (note: K is the kinship matrix for genetic/genomic data; see the article on notation for details) -
U: a matrix of the eigenvectors of K associated withs
-
Examples
# using admix data
fit <- plmm(admix$X, admix$y)
s <- summary(fit, idx = 50)
print(s)
plot(fit)
A function to perform checks on passed objects before model fitting.
Description
A function to perform checks on passed objects before model fitting.
Usage
plmm_checks(
design,
K = NULL,
eta = NULL,
penalty = "lasso",
init = NULL,
gamma,
alpha = 1,
dfmax = NULL,
trace = FALSE,
save_rds = NULL,
return_fit = TRUE,
...
)
Arguments
design |
The design object, as created by |
K |
Similarity matrix used to rotate the data. This should either be
(1) a known matrix that reflects the covariance of y,
(2) an estimate (Default is |
eta |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
penalty |
The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso". |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. |
dfmax |
Maximum number of non-zero coefficients that may enter the model. Default is NULL (no maximum) |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE. |
save_rds |
Optional: if a filepath and name is specified (e.g., |
return_fit |
Optional: a logical value indicating whether the fitted model should be returned as a |
... |
Additional arguments to |
Value
A list which includes 16 items:
-
std_X: The standardized design matrix. If design matrix is filebacked, the descriptor for the filebacked data is returned usingbigmemory::describe(). -
std_X_details: Metadata forstd_X. -
std_X_n: Number of rows instd_X. -
std_X_p: Number of columns instd_X. -
y: Original outcome vector. -
y_name: Variable name ofy. -
centered_y: The centered outcome vector. -
K: The relationship matrix (as passed byplmm(), may be NULL) -
eta: Estimated proportion of the variance in the outcome attributable to population/correlation structure (as passed byplmm(), may be NULL) -
fbm_flag: Logical, isstd_Xfilebacked? -
plink_flag: Logical, doesstd_Xoriginate from PLINK files? -
penalty: A character string indicating the penalty type. -
gamma: Tuning parameter for the SCAD or MCP penalties. -
init: Initialized values for beta coefficients. -
dfmax: Maximum number of non-zero coefficients that may enter the model. -
n: Number of rows in the original design matrix prior to standardization procedures. -
p: Number of columns in the original design matrix prior to standardization procedures.
PLMM fit: A function that fits a PLMM using the values returned by plmm_prep()
Description
PLMM fit: A function that fits a PLMM using the values returned by plmm_prep()
Usage
plmm_fit(
prep,
y,
std_X_details,
fbm_flag,
penalty,
gamma = 3,
alpha = 1,
lambda_min,
nlambda = 100,
lambda,
eps = 1e-04,
max_iter = 10000,
init = NULL,
dfmax = NULL,
warn = TRUE,
...
)
Arguments
prep |
A list as returned from |
y |
The original (not centered) outcome vector. Need this for intercept estimate |
std_X_details |
A list with components |
fbm_flag |
Logical: is std_X a filebacked |
penalty |
The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso". |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. |
lambda_min |
The smallest value for lambda, as a fraction of the maximum lambda. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
Length of the sequence of lambda. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a sequence of values of length |
eps |
Convergence threshold. The algorithm iterates until the RMSE for the change in linear predictors for each coefficient is less than |
max_iter |
Maximum number of iterations (total across entire path). Default is 10000. |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
dfmax |
Maximum number of non-zero coefficients that may enter the model. Default is NULL (no maximum). |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
... |
Additional arguments that can be passed to |
Value
A list which includes 21 items:
-
y: The outcome vector used in model fitting. -
std_scale_beta: The matrix of estimated coefficients on the standardized scale. Rows are predictors (with the first row being the intercept), and columns are values oflambda. -
std_Xbeta: A matrix of the linear predictors on the scale of the standardized design matrix. Rows are predictors, columns are values oflambda. Note:std_Xbetawill not include rows for the intercept or for constant features. -
centered_y: The centered outcome vector. -
s: a vector of the non-zero eigenvalues of the relatedness matrix K (note: K is the kinship matrix for genetic/genomic data; see the article on notation for details) -
U: a matrix of the eigenvectors of K associated withs -
lambda: A numeric vector of the tuning parameter values used in model fitting. -
penalty: A character string indicating the penalty with which the model was fit (e.g., 'MCP') -
penalty_factor: A vector of indicators corresponding to each predictor, where 1 = predictor was penalized. -
iter: An integer vector with the number of iterations needed in model fitting for each value oflambda -
converged: A vector of logical values indicating whether the model fitting converged at each value oflambda -
loss: A vector with the numeric values of the loss at each value oflambda(calculated on the ~rotated~ scale) -
eta: A double between 0 and 1 representing the estimated proportion of the variance in the outcome attributable to population/correlation structure. -
gamma: A numeric value indicating the tuning parameter used for the SCAD or MCP penalties. Not relevant for lasso models. -
alpha: A numeric value indicating the elastic net tuning parameter. -
nlambdaLength of the sequence of lambda. -
eps: Convergence threshold. The algorithm iterates until the RMSE for the change in linear predictors for each coefficient is less thaneps -
max_iter: Maximum number of iterations (total across entire path) -
warn: Return warning messages for failures to converge and model saturation? -
trace: If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process -
std_X: If design matrix is filebacked, the descriptor for the filebacked data is returned usingbigmemory::describe().
PLMM format: a function to format the output of a model constructed with plmm_fit()
Description
PLMM format: a function to format the output of a model constructed with plmm_fit()
Usage
plmm_format(fit, p, std_X_details, fbm_flag, plink_flag)
Arguments
fit |
A list of parameters describing the output of a model constructed with |
p |
The number of features in the original data (including constant features) |
std_X_details |
A list with 3 items:
|
fbm_flag |
Logical: is the corresponding design matrix filebacked? Passed from |
plink_flag |
Logical: did these data come from PLINK files?
Note: This flag matters because of how non-genomic features
are handled for PLINK files – in data from PLINK files,
unpenalized columns are not counted in the |
Value
A list with 18 components:
-
beta_vals: the matrix of estimated coefficients on the original scale. Rows are predictors, columns are values oflambda -
std_Xbeta: A matrix of the linear predictors on the scale of the standardized design matrix. Rows are predictors, columns are values oflambda. Note:std_Xbetawill not include rows for the intercept or for constant features. -
std_X_details: A list with 9 items:-
center: The center values used to center the columns of the design matrix -
scale: The scaling values used to scale the columns of the design matrix -
ns: An integer vector of the nonsingular columns of the original data -
unpen: An integer vector of indices of the unpenalized features, if any were specified in the design -
unpen_colnames: A character vector of the column names of any unpenalized features. -
X_colnames: A character vector with the column names of all features in the original design matrix -
X_rownames: A character vector with the row names of all features in the original design matrix; if none were provided, these are named 'row1', 'row2', etc. -
std_X_colnames: A subset ofX_colnamesrepresenting only nonsingular columns (i.e., the columns indexed byns) -
std_X_rownames: A subset ofX_rownamesrepresenting rows that passed QC filtering & and are represented in both the genotype and phenotype data sets (this only applies to PLINK data)
-
-
y: The original outcome vector. -
p: The total number of columns in the design matrix (including any singular columns, excluding the intercept). -
plink_flag: Logical - did the data come from PLINK files? -
lambda: a numeric vector of the lasso tuning parameter values used in model fitting. -
eta: a number (double) between 0 and 1 representing the estimated proportion of the variance in the outcome attributable to population/correlation structure. -
penalty: character string indicating the penalty with which the model was fit (e.g., 'MCP') -
gamma: numeric value indicating the tuning parameter used for the SCAD or lasso penalties was used. Not relevant for lasso models. -
alpha: numeric value indicating the elastic net tuning parameter. -
loss: vector with the numeric values of the loss at each value oflambda(calculated on the ~rotated~ scale) -
penalty_factor: vector of indicators corresponding to each predictor, where 1 = predictor was penalized. -
ns_idx: vector with the indices of predictors which were nonsingular features (i.e., had variation). -
iter: numeric vector with the number of iterations needed in model fitting for each value oflambda -
converged: vector of logical values indicating whether the model fitting converged at each value oflambda -
K: a list with 2 elements,sandU—-
s: a vector of the non-zero eigenvalues of the relatedness matrix K (note: K is the kinship matrix for genetic/genomic data; see the article on notation for details) -
U: a matrix of the eigenvectors of K associated withs
-
-
std_X: If design matrix is filebacked, the descriptor for the filebacked data is returned usingbigmemory::describe().
Loss method for plmm class
Description
Loss method for plmm class
Usage
plmm_loss(y, yhat)
Arguments
y |
Observed outcomes (response) vector |
yhat |
Predicted outcomes (response) vector |
Value
A numeric vector of the squared-error loss values for the given observed and predicted outcomes
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
yhat <- predict(object = fit, newX = admix$X, type = 'lp', lambda = 0.05)
head(plmm_loss(yhat = yhat, y = admix$y))
PLMM prep: a function to run checks, eigendecomposition, and rotation prior to fitting a PLMM model
Description
This is an internal function for plmm()
Usage
plmm_prep(
std_X,
std_X_n,
std_X_p,
centered_y,
penalty_factor,
K = NULL,
eta = NULL,
fbm_flag,
trace = NULL,
...
)
Arguments
std_X |
Column standardized design matrix. May include clinical covariates and other non-SNP data. |
std_X_n |
The number of observations in |
std_X_p |
The number of features in |
centered_y |
Continuous outcome vector, centered. |
penalty_factor |
A multiplicative factor for the penalty applied to each coefficient. |
K |
Similarity matrix used to rotate the data. This should either be:
(1) a known matrix that reflects the covariance of y,
(2) an estimate (Default is |
eta |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
fbm_flag |
Logical: is |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE. |
... |
Not used |
Value
List with these components:
-
std_X: Standardized design matrix. If design matrix is filebacked, the descriptor for the filebacked data is returned usingbigmemory::describe(). -
centered_y: Vector of centered outcomes -
K: Similarity matrix -
s: Vector of the non-zero eigenvalues ofK -
U: Matrix of eigenvectors ofKassociated withs(same as left singular values of X). -
eta: The numeric value of the estimated eta parameter -
penalty_factorA multiplicative factor for the penalty applied to each coefficient. -
incpt_flagLogical: Does the model require fitting an intercept? -
trace: If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process
Plot method for cv_plmm class
Description
Plot method for cv_plmm class
Usage
## S3 method for class 'cv_plmm'
plot(
x,
log.l = TRUE,
type = c("cve", "rsq", "scale", "snr", "all"),
selected = TRUE,
vertical.line = TRUE,
col = "red",
...
)
Arguments
x |
An object of class |
log.l |
Logical to indicate the plot should be returned on the natural log scale. Defaults to TRUE. |
type |
Type of plot to return. Options include:
|
selected |
Logical to indicate if the number of variables selected should be plotted on the top axis. Defaults to TRUE. |
vertical.line |
Logical to indicate whether a vertical line should be plotted at the minimum/maximum value. Defaults to TRUE. |
col |
Color for the points along the CV curve. Defaults to "red". |
... |
Additional arguments. |
Value
Nothing is returned; instead, a plot is drawn representing the relationship
between the tuning parameter lambda value (x-axis) and the cross validation error (y-axis).
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cvfit <- cv_plmm(design = admix_design)
plot(cvfit)
Plot method for plmm class
Description
Plot method for plmm class
Usage
## S3 method for class 'plmm'
plot(x, alpha = 1, log.l = FALSE, shade = TRUE, col, ...)
Arguments
x |
An object of class |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty.
|
log.l |
Logical to indicate the plot should be returned on the natural log scale. Defaults to FALSE. |
shade |
Logical to indicate whether a local nonconvex region should be shaded. Defaults to TRUE. |
col |
Vector of colors for coefficient lines. |
... |
Additional arguments. |
Value
Nothing is returned; instead, a plot of the coefficient paths is drawn at each value of lambda (one 'path' for each coefficient).
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
plot(fit)
plot(fit, log.l = TRUE)
Predict method for cv_plmm class
Description
Predict method for cv_plmm class
Usage
## S3 method for class 'cv_plmm'
predict(
object,
newX,
type = c("blup", "coefficients", "vars", "nvars", "lp"),
X,
lambda,
idx = object$min,
...
)
Arguments
object |
An object of class |
newX |
Matrix of values at which predictions are to be made (not used for |
type |
A character argument indicating what type of prediction should be returned. Options are "lp," "coefficients," "vars," "nvars," and "blup." See details. |
X |
Optional: if |
lambda |
A numeric vector of regularization parameter |
idx |
Vector of indices of regularization parameter |
... |
Additional optional arguments |
Details
Define beta-hat as the coefficients estimated at the value of lambda that minimizes cross-validation error (CVE). Then options for type are as follows:
-
lp(linear predictor): uses the product ofnewXand the beta coefficients ofobjectto predict new values of the outcome. This does not incorporate the correlation structure of the data. -
blup(acronym for Best Linear Unbiased Predictor): adds to thelpa value that represents the estimated random effect. This addition is a way of incorporating the estimated correlation structure of data into our prediction of the outcome. -
coefficients: returns the estimated beta-hat -
vars: returns the indices of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept. -
nvars: returns the number of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept.
Value
Depends on the type - see Details
Examples
set.seed(123)
train_idx <- sample(1:nrow(admix$X), 100)
# Note: ^ shuffling is important here! Keeps test and train groups comparable.
train <- list(X = admix$X[train_idx,], y = admix$y[train_idx])
train_design <- create_design(X = train$X, y = train$y)
test <- list(X = admix$X[-train_idx,], y = admix$y[-train_idx])
fit <- cv_plmm(design = train_design)
pred1 <- predict(object = fit, newX = test$X, X = train$X) # Minimum CVE lambda
pred2 <- predict(object = fit, newX = test$X, X = train$X, idx = fit$min1se) # 1 SE lambda
Predict method for plmm class
Description
Predict method for plmm class
Usage
## S3 method for class 'plmm'
predict(
object,
newX,
type = c("blup", "coefficients", "vars", "nvars", "lp"),
X = NULL,
lambda,
idx = seq_along(object$lambda),
...
)
Arguments
object |
An object of class |
newX |
Matrix of values at which predictions are to be made (not used for |
type |
A character argument indicating what type of prediction should be returned. Options are "lp," "coefficients," "vars," "nvars," and "blup." See details. |
X |
Optional: if |
lambda |
A numeric vector of regularization parameter |
idx |
Vector of indices of regularization parameter |
... |
Additional optional arguments |
Details
The options for type are as follows:
-
lp(linear predictor): uses the product ofnewXand the beta coefficients ofobjectto predict new values of the outcome. This does not incorporate the correlation structure of the data. -
blup(default, acronym for Best Linear Unbiased Predictor): adds to thelpa value that represents the estimated random effect. This addition is a way of incorporating the estimated correlation structure of the data into our prediction of the outcome. -
coefficients: returns the estimated beta coefficients. -
vars: returns the indices of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept. -
nvars: returns the number of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept.
Value
Depends on the type - see Details
Examples
set.seed(123)
train_idx <- sample(1:nrow(admix$X), 100)
# Note: ^ shuffling is important here! Keeps test and train groups comparable.
train <- list(X = admix$X[train_idx,], y = admix$y[train_idx])
train_design <- create_design(X = train$X, y = train$y)
test <- list(X = admix$X[-train_idx,], y = admix$y[-train_idx])
fit <- plmm(design = train_design)
# make predictions for all lambda values
pred1 <- predict(object = fit, newX = test$X, type = "lp")
pred2 <- predict(object = fit, newX = test$X, type = "blup", X = train$X)
# look at mean squared prediction error
mspe <- apply(pred1, 2, function(c){crossprod(test$y - c)/length(c)})
min(mspe)
mspe_blup <- apply(pred2, 2, function(c){crossprod(test$y - c)/length(c)})
min(mspe_blup) # BLUP is better
# compare the MSPE of our model to a null model, for reference
# null model = intercept only -> y_hat is always mean(y)
crossprod(mean(test$y) - test$y)/length(test$y)
Predict method to use in cross-validation (within cvf())
Description
Predict method to use in cross-validation (within cvf())
Usage
predict_within_cv(fit, testX, type, fbm = FALSE, Sigma_21 = NULL)
Arguments
fit |
A list with the components returned by |
testX |
A design matrix used for computing predicted values (i.e, the test data). |
type |
A character argument indicating what type of prediction should be returned. Passed from |
fbm |
Logical: is |
Sigma_21 |
Covariance matrix between the training and the testing data. Required if |
Details
-
lp(linear predictor): uses the product oftestXand the beta coefficients offitto predict new values of the outcome. This does not incorporate the correlation structure of the data. -
blup(acronym for Best Linear Unbiased Predictor): adds to the 'lp“ a value that represents the estimated random effect. This addition is a way of incorporating the estimated correlation structure of data into our prediction of the outcome. -
coefficients: returns the estimated beta-hat -
vars: returns the indices of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept. -
nvars: returns the number of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept.
Note: the main difference between this function and the predict.plmm() method is that
here in CV, the standardized testing data (std_test_X), Sigma_11, and Sigma_21 are calculated in cvf() instead of the function defined here.
Value
A numeric vector of predicted values
A function to format the time
Description
A function to format the time
Usage
pretty_time()
Value
A string with the formatted current date and time
Print method for summary.cv_plmm objects
Description
Print method for summary.cv_plmm objects
Usage
## S3 method for class 'summary.cv_plmm'
print(x, digits, ...)
Arguments
x |
An object of class |
digits |
The number of digits to use in formatting output |
... |
Not used |
Value
Nothing is returned; instead, a message is printed to the console summarizing the results of the cross-validated model fit.
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
print(summary(cv_fit))
A function to print the summary of a plmm model
Description
A function to print the summary of a plmm model
Usage
## S3 method for class 'summary.plmm'
print(x, ...)
Arguments
x |
A |
... |
Not used |
Value
Nothing is returned; instead, a message is printed to the console summarizing the results of the model fit.
Examples
lam <- rev(seq(0.01, 1, length.out=20)) |> round(2) # for sake of example
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design, lambda = lam)
fit2 <- plmm(design = admix_design, penalty = "SCAD", lambda = lam)
print(summary(fit, idx = 18))
print(summary(fit2, idx = 18))
A function to read in large data files as a filebacked big.matrix
Description
A function to read in large data files as a filebacked big.matrix
Usage
process_delim(
data_dir,
data_file,
feature_id,
rds_dir = data_dir,
rds_prefix,
logfile = NULL,
overwrite = FALSE,
quiet = FALSE,
...
)
Arguments
data_dir |
The directory to the file. |
data_file |
The file to be read in, without the filepath. This should be a file of numeric values.
Example: use |
feature_id |
A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed. |
rds_dir |
The directory where the user wants to create the |
rds_prefix |
String specifying the user's preferred filename for the to-be-created |
logfile |
Optional: the name (character string) of the prefix of the logfile to be written in |
overwrite |
Logical: if existing .bk/.rds files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc. |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
... |
Optional: other arguments to be passed to |
Value
The file path to the newly created .rds file
Examples
temp_dir <- tempdir()
colon_dat <- process_delim(data_file = "colon2.txt",
data_dir = find_example_data(parent = TRUE), overwrite = TRUE,
rds_dir = temp_dir, rds_prefix = "processed_colon2", sep = "\t", header = TRUE)
colon2 <- readRDS(colon_dat)
str(colon2)
Preprocess PLINK files using the bigsnpr package
Description
Preprocess PLINK files using the bigsnpr package
Usage
process_plink(
data_dir,
data_prefix,
rds_dir = data_dir,
rds_prefix = NULL,
logfile = NULL,
impute = TRUE,
impute_method = "mode",
id_var = "IID",
parallel = TRUE,
quiet = FALSE,
overwrite = FALSE,
...
)
Arguments
data_dir |
The path to the bed/bim/fam data files, without a trailing "/" (e.g., use |
data_prefix |
The prefix (as a character string) of the bed/fam data files (e.g., |
rds_dir |
The path to the directory in which you want to create the new |
rds_prefix |
String specifying the user's preferred filename for the to-be-created .rds file (will be create inside |
logfile |
Optional: the name (character string) of the prefix of the logfile to be written in |
impute |
Logical: should data be imputed? Default to TRUE. |
impute_method |
If
|
id_var |
String specifying which column of the PLINK |
parallel |
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
overwrite |
Logical: if existing .bk/.rds files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc. |
... |
Optional: additional arguments to |
Details
Three files are created in the location specified by rds_dir:
-
rds_prefix.rds: This is a list with three items: (1)X: the filebackedbigmemory::big.matrixobject pointing to the imputed genotype data. This matrix has typedouble, which is important for downstream operations increate_design()(2)map: a data.frame with the PLINKbimdata (i.e., the variant information) (3)fam: a data.frame with the PLINKfamdata (i.e., the pedigree information) -
rds_prefix.bk: This is the backing file that stores the numeric data of the genotype matrix. -
rds_prefix.descThis is the description file, needed to attach the genotype matrix to the R session.
Note that process_plink() need only be run once for a given set of PLINK
files; in subsequent data analysis/scripts, get_data() will access the .rds file.
For an example, see vignette on processing PLINK files.
Value
The filepath to the .rds object created; see details for explanation.
A function to read in a large file as a numeric file-backed matrix
Description
Note: this function is a wrapper for bigmemory::read.big.matrix()
Usage
read_data_files(
data_file,
data_dir,
rds_dir,
rds_prefix,
outfile,
overwrite,
quiet,
...
)
Arguments
data_file |
The name of the file to read, not including its directory. Directory should be specified in |
data_dir |
The path to the directory where |
rds_dir |
The path to the directory in which you want to create the new |
rds_prefix |
String specifying the user's preferred filename for the to-be-created .rds/.bk files (will be create inside |
outfile |
Optional: the name (character string) of the prefix of the logfile to be written. Defaults to NULL (no log file written). |
overwrite |
Logical: if existing |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
... |
Optional: other arguments to be passed to |
Value
.rds, .bk, and .desc files are created in data_dir, and obj (a filebacked bigmemory big.matrix object) is returned. See bigmemory documentation for more info on the big.matrix class.
A function to read in PLINK files using bigsnpr methods
Description
A function to read in PLINK files using bigsnpr methods
Usage
read_plink_files(
data_dir,
data_prefix,
rds_dir,
rds_prefix,
outfile,
parallel,
overwrite,
quiet
)
Arguments
data_dir |
The path to the bed/bim/fam data files, without a trailing "/" (e.g., use |
data_prefix |
The prefix (as a character string) of the bed/fam data files (e.g., |
rds_dir |
The path to the directory in which you want to create the new |
rds_prefix |
String specifying the user's preferred filename for the to-be-created |
outfile |
Optional: the name (character string) of the prefix of the logfile to be written. Defaults to NULL (no log written). |
parallel |
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See |
overwrite |
Logical: if existing |
quiet |
Logical: should messages be printed to the console? Defaults to TRUE |
Value
.rds and .bk files are created in data_dir, and obj (a bigSNP object) is returned. See bigsnpr documentation for more info on the bigSNP class.
Calculate a relatedness matrix
Description
Given a matrix of genotypes, this function estimates the genetic relatedness matrix (GRM,
also known as the RRM, see Hayes et al. 2009, doi:10.1017/S0016672308009981) among
the subjects: \frac{1}{p}(XX^T), where X is standardized.
Usage
relatedness_mat(X, std = TRUE)
Arguments
X |
An n x p numeric matrix of genotypes (from fully-imputed data). Can be a filebacked |
std |
Logical: should |
Value
An n x n numeric matrix capturing the genomic relatedness of the
samples represented in X. In our notation, we call this matrix K for 'kinship';
this is also known as the GRM or RRM.
Examples
RRM <- relatedness_mat(X = admix$X)
RRM[1:5, 1:5]
A function to rotate filebacked data
Description
A function to rotate filebacked data
Usage
rotate_filebacked(prep, tocenter = TRUE, ...)
Arguments
prep |
The object returned by |
tocenter |
Should the matrix be centered in addition to scaled? Defaults to TRUE |
... |
Not used |
Value
a list with 4 items:
-
stdrot_X:Xon the rotated and re-standardized scale -
rot_y:yon the rotated scale (a numeric vector) -
stdrot_X_center: numeric vector of values used to centerrot_X -
stdrot_X_scale: numeric vector of values used to scalerot_X
Compute sequence of lambda values for plmm models
Description
Compute sequence of lambda values for plmm models
Usage
setup_lambda(X, y, alpha, lambda_min, nlambda, penalty_factor)
Arguments
X |
Rotated and standardized design matrix which includes the intercept column if present. May include clinical covariates and other non-SNP data. This can be either a matrix or a filebacked |
y |
Continuous outcome vector. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. |
lambda_min |
The smallest value for lambda, as a fraction of the maximum lambda. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. A value of |
nlambda |
The desired number of lambda values in the sequence to be generated. |
penalty_factor |
A multiplicative factor for the penalty applied to each coefficient. If supplied, |
Value
a numeric vector of lambda values, equally spaced on the log scale
A helper function to standardize a filebacked matrix
Description
A helper function to standardize a filebacked matrix
Usage
standardize_filebacked(X, outfile, quiet, tocenter = TRUE)
Arguments
X |
A |
outfile |
Optional: the name (character string) of the logfile to be written. |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
tocenter |
Should the matrix be centered in addition to scaled? Defaults to TRUE. |
Value
A list with a component called std_X - this is a filebacked big.matrix with column-standardized data.
List also includes several other indices/meta-data on the standardized matrix
A helper function to standardize matrices
Description
A helper function to standardize matrices
Usage
standardize_in_memory(X, tocenter = TRUE)
Arguments
X |
A matrix |
tocenter |
Should the matrix be centered in addition to scaled? Defaults to TRUE. |
Details
This function is adapted from https://github.com/pbreheny/ncvreg/blob/master/R/std.R
NOTE: this function returns a matrix in memory. For standardizing filebacked
data, use standardize_filebacked() – see src/big_standardize.cpp
Value
A list containing the standardized X matrix and associated metadata
A helper function to subset big.matrix objects
Description
A helper function to subset big.matrix objects
Usage
subset_filebacked(X, new_file, complete_samples, ns, rds_dir, outfile, quiet)
Arguments
X |
A filebacked |
new_file |
Optional user-specified new file for the to-be-created .rds/.bk files. |
complete_samples |
Numeric vector with indices marking the rows of the original data which have a non-missing entry in the 6th column of the |
ns |
Numeric vector with the indices of the non-singular columns |
rds_dir |
The path to the directory in which you want to create the new |
outfile |
Optional: the name (character string) of the logfile to be written. |
quiet |
Logical: should console messages be silenced? Defaults to FALSE |
Value
A list with two components. First, a big.matrix object, subset_X, representing a design matrix wherein:
rows are subset to those with complete phenotype information
columns are subset so that no constant features remain – this is important for standardization downstream
The list also includes the integer vector ns which marks which columns of the original matrix were 'non-singular' (i.e. not constant features).
The ns index plays an important role in plmm_format() and untransform()
A summary function for cv_plmm objects
Description
A summary function for cv_plmm objects
Usage
## S3 method for class 'cv_plmm'
summary(object, lambda = "min", ...)
Arguments
object |
A |
lambda |
The regularization parameter value at which inference should be reported. Can choose a numeric value, 'min', or '1se'. Defaults to 'min'. |
... |
Not used |
Value
The return value is an object with S3 class summary.cv_plmm. The class has its own print method and contains the following list elements:
-
lambda_min: The lambda value at the minimum cross validation error -
lambda.1se: The maximum lambda value within 1 standard error of the minimum cross validation error -
penalty: The penalty applied to the fitted model -
nvars: The number of non-zero coefficients at the selected lambda value -
cve: The cross validation error at all folds -
min: The minimum cross validation error -
fit: Theplmmfit used in the cross validation
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
summary(cv_fit)
A summary method for plmm objects
Description
A summary method for plmm objects
Usage
## S3 method for class 'plmm'
summary(object, lambda, idx, eps = 1e-05, ...)
Arguments
object |
An object of class |
lambda |
The regularization parameter value at which inference should be reported. |
idx |
Alternatively, |
eps |
If lambda is given, |
... |
Not used |
Value
The return value is an object with S3 class summary.plmm. The class has its own print method and contains the following list elements:
-
penalty: The penalty used byplmm(e.g. SCAD, MCP, lasso) -
n: Number of instances/observations -
std_X_n: the number of observations in the standardized data; the only time this would differ fromnis if data are from PLINK and the external data does not include all the same samples -
p: Number of regression coefficients (not including the intercept) -
converged: Logical indicator for whether the model converged -
lambda: Thelambdavalue at which inference is being reported -
lambda_char: A formatted character string indicating the lambda value -
nvars: The number of nonzero coefficients (again, not including the intercept) at that value oflambda -
nonzero: The column names indicating the nonzero coefficients in the model at the specified value oflambda
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
summary(fit, idx = 97)
Untransform coefficient values back to the original scale
Description
This function unwinds the initial standardization of the data to obtain
coefficient values on their original scale. It is called by plmm_format().
Usage
untransform(
std_scale_beta,
p,
std_X_details,
fbm_flag,
plink_flag,
use_names = TRUE
)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements |
fbm_flag |
Logical: is the corresponding design matrix filebacked? |
plink_flag |
Logical: did these data come from PLINK files?
Note: This flag matters because of how non-genomic features
are handled for PLINK files – in data from PLINK files,
unpenalized columns are not counted in the |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coefficients, untransformed_beta, that is on the scale of the original data.
Untransform coefficient values back to the original scale for file-backed data
Description
This function unwinds the initial standardization of the data to obtain
coefficient values on their original scale. It is called by plmm_format().
Usage
untransform_delim(std_scale_beta, p, std_X_details, use_names = TRUE)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coefficients, untransformed_beta, that is on the scale of the original data.
Untransform coefficient values back to the original scale in memory
Description
This function unwinds the initial standardization of the data to obtain
coefficient values on their original scale. It is called by plmm_format().
Usage
untransform_in_memory(std_scale_beta, p, std_X_details, use_names = TRUE)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coefficients, untransformed_beta, that is on the scale of the original data.
Untransform coefficient values back to the original scale for file-backed data
Description
This function unwinds the initial standardization of the data to obtain
coefficient values on their original scale. It is called by plmm_format().
Usage
untransform_plink(std_scale_beta, p, std_X_details, use_names = TRUE)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coefficients, untransformed_beta, that is on the scale of the original data.
Companion function to unzip the .gz files that ship with the plmmr package.
Description
Companion function to unzip the .gz files that ship with the plmmr package.
Usage
unzip_example_data(outdir)
Arguments
outdir |
The file path to the directory to which the |
Details
For an example of this function, look at vignette('plink_files', package = "plmmr").
Value
Nothing is returned; the PLINK files that ship with the plmmr package are stored in the directory specified by outdir.