To install this package, start R (version “4.3”) and enter:
library(plotly)
library(ggplot2)
library(foreach)
library(doParallel)
# library(devtools)
# install_github('koyelucd/idiffomix',force = TRUE)
# library(idiffomix)
Gene expression and DNA methylation are two interconnected biological processes and understanding their relationship is important in advancing understanding in diverse areas, including disease pathogenesis, environmental adaptation, developmental biology, and therapeutic responses. Differential analysis, including the identification of differentially methylated cytosine-guanine dinucleotide (CpG) sites (DMCs) and differentially expressed genes (DEGs) between two conditions, such as healthy and affected samples, can aid understanding of biological processes and disease progression. Typically, gene expression and DNA methylation data are analysed independently to identify DMCs and DEGs which are further analysed to explore relationships between them, or methylation data are aggregated to gene level for analysis. Such approaches ignore the inherent dependencies and biological structure within these related data.
The package idiffomix contains a joint mixture model is proposed that integrates information from the two data types at the modelling stage to capture their inherent dependency structure, enabling simultaneous identification of DMCs and DEGs. The model leverages a joint likelihood function that accounts for the nested structure in the data, with parameter estimation performed using an expectation-maximisation algorithm.
This document gives a quick tour of the functionalities in
idiffomix. See help(package="idiffomix")
for further details and references provided by
citation("idiffomix")
.
Before starting the idiffomix walk through, the user should have a working R software environment installed on their machine. The idiffomix package has the following dependencies which, if not already installed on the machine will automatically be installed along with the package: foreach, doParallel, parallel, mclust, stats, utils, edgeR, magrittr, ggplot2, scales, tidyr, dplyr, reshape2, gridExtra, grid, tidyselect, cowplot.
Assuming that the user has the idiffomix package installed, the user first needs to load the package:
library(idiffomix)
The idiffomix package provides simulated gene expression data, methylation data and gene chromosome data. The data are simulated based on gene expression and methylation data of matched healthy and breast invasive carcinoma tumour samples from The Cancer Genome Atlas (TCGA) that are are publicly available in the Genomic Data Commons (GDC) data portal. The complete data are also available in the GitHub. Due to computational complexities of the package which needs parallel processing and package testing limitations, simulated data are used for this walk-through. The simulated data are assumed to be representing gene expression and methylation data from two biological conditions (benign and tumour) collected from N=4 patients. The genes and CpG sites are assumed to be located on 2 chromosomes (chromosome 1 and 2). Each gene has a number of CpG sites associated with it.
data(gene_expression_data)
data(methylation_data)
data(gene_chromosome_data)
The RNA-Seq data consisted of raw counts depicting the gene
expression levels. To ensure data quality, only genes whose sum of
expression counts across both biological conditions \(> 5\) were retained. The data were
normalized to account for differences in library sizes. The normalized
count data were used to obtain CPM values which were further
log-transformed to obtain log-CPM values. Given the paired design of the
motivating setting, the log-fold changes between the tumour and benign
samples were calculated for each gene in every patient and used in the
subsequent analyses. For the methylation array data, the beta values at
the CpG sites were logit transformed to M-values. Similar to the RNA-Seq
data, given the paired design, the difference in M-values between tumour
and benign samples were calculated for each CpG site in every patient
and used in the subsequent analyses. The function
data_transformation()
filters and transforms the data and
returns a list with two dataframes. The transformed gene expression data
and the transformed methylation data are returned as dataframes.
<- 4
N = data_transformation(seq_data=gene_expression_data,
data_transformed meth_data=methylation_data,
gene_chr=gene_chromosome_data,
N=N)
A joint mixture model approach, termed idiffomix*,
is proposed for the integrated differential analysis of gene expression
data and methylation array data that accounts for their nested
dependency structure. A key dependency matrix parameter is used in the
joint mixture model to allow the methylation state of a CpG site to
depend on the expression level status of the gene to which it is
associated. The model parameters are estimated via an EM algorithm. In
the E-step, as the expected values of the latent variables are
intractable, approximate but tractable estimates are employed. Due to
independence of chromosomes and to ease the computational burden, the
model is fitted to each chromosome independently. When applying to real
dataset the parallel_process argument should be set to TRUE for
efficient execution. Gene expression levels are modeled to undergo three
state changes between benign and tumor conditions: downregulated (E-,
negative log-fold change), upregulated (E+, positive log-fold change),
or non-differentially expressed (E0, log-fold change near zero).
Similarly, methylation levels at CpG sites are classified as
hypomethylated (M-, negative M-value difference), hypermethylated (M+,
positive M-value difference), or non-differentially methylated (M0,
M-value difference near zero). Under idiffomix, these
changes are modeled using and component mixture models for gene
expression and methylation, respectively.
The model parameters are: The function idiffomix()
takes
the transformed data as input and applies the joint mixture model to
estimate the model parameters and the DMCs and DEGs.
<- 4
N = idiffomix(seq_data=data_transformed$seq_transformed,
data_output meth_data=data_transformed$meth_transformed,
gene_chr=gene_chromosome_data,
N=N, K=3, L=3, probs=c(0.25,0.75),
parallel_process = FALSE)
print(data_output$tau[[1]])
#> Cluster1 Cluster2 Cluster3
#> 0.4 0.2 0.4
head(data_output$seq_classification)
#> Gene Patient1 Patient2 Patient3 Patient4 Final_Cluster
#> gene_1 gene_1 2.340567 1.3592557 0.2499695 2.0096230 3
#> gene_10 gene_10 -1.516838 -1.7266992 -2.0006156 -0.9142228 1
#> gene_2 gene_2 -1.810118 -2.4486927 -3.7578701 -3.6756716 1
#> gene_3 gene_3 -2.931101 -2.0544608 -0.6025684 -1.5649753 1
#> gene_4 gene_4 2.608504 1.9117714 0.7904971 3.3069637 3
#> gene_5 gene_5 -1.948577 -0.4773091 -2.1640730 -1.0853623 1
The conditional probabilities estimated when the joint mixture model is applied to data from a chromosome can be visualized. Panel A displays the probability of a gene in the chromosome belonging to each of the K clusters. Panel B details the estimated matrix pi of conditional probabilities of a CpG site’s cluster membership, given its gene’s cluster. Panel C details the conditional probabilities of a gene belonging to cluster \(k\) given a single CpG site associated with the gene belongs to cluster \(l\), computed using Bayes’ theorem, given tau and pi.
plot(data_output,what="chromosome",CHR=1, Gene=NULL,K=3,L=3,
gene_cluster_name=c( "E-","E0","E+"),
cpg_cluster_name=c( "M-","M0","M+"))
#> TableGrob (2 x 1) "arrange": 2 grobs
#> z cells name grob
#> 1 1 (1-1,1-1) arrange gtable[arrange]
#> 2 2 (2-2,1-1) arrange gtable[guide-box]
The log-fold changes and differences in M-values and the estimated posterior probability of the gene belonging to the K clusters can be visualized. Panel A shows the log-fold change and difference in M-values for the given gene and its associated CpG sites while Panel B shows the posterior probabilities of cluster membership for the gene under idiffomix.
plot(data_output,what="gene",CHR=1, Gene="gene_1",K=3,L=3,
gene_cluster_name=c( "E-","E0","E+"),
cpg_cluster_name=c( "M-","M0","M+"))
#> TableGrob (2 x 1) "arrange": 2 grobs
#> z cells name grob
#> 1 1 (1-1,1-1) arrange gtable[arrange]
#> 2 2 (2-2,1-1) arrange gtable[arrange]
Majumdar, K., Jaffrézic, F., Rau, A., Gormley, I. C., & Murphy, T. B. (2024). Integrated differential analysis of multi-omics data using a joint mixture model: idiffomix. arXiv preprint arXiv:2412.17511.