Processing math: 29%

Example 2: Using SEAGLE with Simulated Data

This tutorial demonstrates how to use the SEAGLE package when the user inputs a matrix G. We’ll begin by loading the SEAGLE package.

library(SEAGLE)

As an example, we’ll generate some synthetic data for usage in this tutorial. Let’s consider a dataset with n=5000 individuals and L=100 loci, where the first 40 are causal.

The makeSimData function generates a covariate matrix ˜XRn×3, where the first column is the all ones vector for the intercept and the second and third columns are XN(0,1) and EN(0,1), respectively. The last two columns are scaled to have 0 mean and unit variance.

The makeSimData function additionally generates the genetic marker matrix G with synthetic haplotype data from the COSI software. Detailed procedures for generating G can be found in the accompanying journal manuscript. Finally, the makeSimData function also generates a continuous phenotype y according to the following fixed effects model {\bf y} = \tilde{\bf X} \boldsymbol{\gamma}_{\widetilde{\bf X}} + {\bf G}\boldsymbol{\gamma}_{G} + \text{diag}(E){\bf G}\boldsymbol{\gamma}_{GE} + {\bf e}. Here, \boldsymbol{\gamma}_{\tilde{\bf X}} is the all ones vector of length P=3, \boldsymbol{\gamma}_{G} \in \mathbb{R}^{L}, \boldsymbol{\gamma}_{GE}\in \mathbb{R}^{L}, and {\bf e} \sim \text{N}({\bf 0}, \sigma\, {\bf I}_{n}). The entries of \boldsymbol{\gamma}_{G} and \boldsymbol{\gamma}_{GE} pertaining to causal loci are set to be \gamma_{G} = gammaG and \gamma_{GE} = gammaGE, respectively. The remaining entries of \boldsymbol{\gamma}_{G} and \boldsymbol{\gamma}_{GE} pertaining to non-causal loci are set to 0.

dat <- makeSimData(H=cosihap, n=5000, L=100, gammaG=1, gammaGE=0, causal=40, seed=1)

Now that we have our data, we can prepare it for use in the SEAGLE algorithm. We will input our {\bf y}, {\bf X}, {\bf E}, and {\bf G} into the prep.SEAGLE function. The intercept = 1 parameter indicates that the first column of {\bf X} is the all ones vector for the intercept.

This preparation procedure formats the input data for the SEAGLE function by checking the dimensions of the input data. It also pre-computes a QR decomposition for \widetilde{\bf X} = \begin{pmatrix} {\bf 1}_{n} & {\bf X} & {\bf E} \end{pmatrix}, where {\bf 1}_{n} denotes the all ones vector of length n.

objSEAGLE <- prep.SEAGLE(y=dat$y, X=dat$X, intercept=1, E=dat$E, G=dat$G)

Finally, we’ll input the prepared data into the SEAGLE function to compute the score-like test statistic T and its corresponding p-value. The init.tau and init.sigma parameters are the initial values for \tau and \sigma employed in the REML EM algorithm.

res <- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
res$T
#> [1] 246.1886
res$pv
#> [1] 0.8441451

The score-like test statistic T for the G\timesE effect and its corresponding p-value can be found in res$T and res$pv, respectively.