ghcm
is an R package used to perform conditional independence tests for densely observed functional data.
This vignette gives a brief overview of the usage of the ghcm
package. We give a brief presentation of the idea behind the GHCM and the conditions under which the test is valid. Subsequently, we provide several examples of the usage of the ghcm
package by analysing a simulated dataset.
In this section we briefly describe the idea behind the GHCM. For the full technical details and theoretical results, see [1].
Let \(X\), \(Y\) and \(Z\) be random variables of which we are given \(n\) i.i.d. observations \((X_1, Y_1, Z_1), \dots, (X_n, Y_n, Z_n)\) and where \(X\), \(Y\) and \(Z\) can be either scalar or functional. Existing methods, such as the GCM [2] implemented in the GeneralisedCovarianceMeasure
package [3], can deal with most cases where both \(X\) and \(Y\) are scalar hence our primary interest is in the cases where at least one of \(X\) and \(Y\) are functional. For the moment, we think of all functional observations as being fully observed.
The GHCM estimates the expected conditional covariance of \(X\) and \(Y\) given \(Z\), \(\mathscr{K}\), and rejects the hypothesis \(X \mbox{${}\perp\mkern-11mu\perp{}$}Y \,|\,Z\) if the Hilbert-Schmidt norm of \(\mathscr{K}\) is large. To describe the algorithm, we utilise outer products \(x \otimes y\), that can be thought of as a possibly infinite-dimensional generalisation of the matrix outer product \(xy^T\) (for precise definitions, we refer to [1]).
Assuming that the regression methods perform sufficiently well, the GHCM has uniformly distributed \(p\)-values when the null is true. It should be noted that there are situations where \(X \mbox{${}\not\!\perp\mkern-11mu\perp{}$}Y \,|\,Z\) but the GHCM is unable to detect this dependence for any sample size, since \(\mathscr{K}\) can be zero in this case.
In practice, we do not observe the functional data fully but rather at a discrete set of values. To deal with this, we utilise functional principal components analysis (FPCA) to express \(\hat{\varepsilon}\) and \(\hat{\xi}\) (in the case that these are functional) as vectors.
To give concrete examples of the usage of the package, we perform conditional independence tests on a simulated dataset consisting of both functional and scalar variables. The functional variables are observed on a common equidistant grid of \(101\) points on \([0, 1]\).
library(ghcm)
set.seed(111)
data(ghcm_sim_data)
grid <- seq(0, 1, length.out=101)
colnames(ghcm_sim_data)
#> [1] "Y_1" "Y_2" "X" "Z" "W"
ghcm_sim_data
consists of 500 observations of the scalar variables \(Y_1\) and \(Y_2\) and the functional variables \(X\), \(Z\) and \(W\). The curves and the mean curve for functional data can be seen in Figures 1, 2 and 3.
Figure 1: Plot of \(X\) with the estimated mean curve in red.
Figure 2: Plot of \(Z\) with the estimated mean curve in red.
Figure 3: Plot of \(W\) with the estimated mean curve in red.
In all of the upcoming examples we will use functions from the refund
R-package [4] to perform regressions. We will not attempt to justify the validity of the regression models we use here as the upcoming tests are only included for illustrative purposes. In actual applications of the GHCM however, it is critical that the regression methods employed estimate the conditional expectations \(\mathbb{E}(X \,|\,Z)\) and \(\mathbb{E}(Y \,|\,Z)\) sufficiently well for the \(p\)-values to be valid. Any use of the GHCM should be prefaced by an analysis of the performance of the regression methods in use.
We first test whether \(Y_1\) and \(Y_2\) are conditionally independent given the functional variables. This is relevant if, say, we’re trying to predict \(Y_1\) and we want to know whether including \(Y_2\) as a predictor would be helpful. A naive correlation-based approach would suggest that \(Y_2\) could be relevant since:
To perform the conditional independence test, we need a scalar-on-function regression method and we will use the pfr
function from the refund
package [4] with lf
-terms. We run the test in the following code:
library(refund)
m_1 <- pfr(Y_1 ~ lf(X) + lf(Z) + lf(W) , data=ghcm_sim_data)
m_2 <- pfr(Y_2 ~ lf(X) + lf(Z) + lf(W), data=ghcm_sim_data)
test <- ghcm_test(resid(m_1), resid(m_2), X_grid = NA, Y_grid = NA )
print(test)
#> H0: X _||_ Y | Z, p: 0.959904
#> Not rejected at 5 % level
#> Test statistic: 3.025583e-06
Since both \(Y_1\) and \(Y_2\) are scalar, we set X_grid
and Y_grid
to be NA
. This tells the ghcm_test
function to treat the variables as real-valued observations rather than functional observations. We get a \(p\)-value of 0.96 and an estimate of the test statistic of 3.03e-06. It should be noted that since the asymptotic distribution of the test statistic depends on the underlying distribution, there is no way to know the \(p\)-value from just the test statistic alone. However, we can get an idea of how extreme the test statistic is by plotting the asymptotic distribution and the test statistic together. This can be done by simply calling plot
on the ghcm
object, in this case plot(test)
, which results in the plot seen in Figure 4.
Figure 4: Plot of the estimated asymptotic test distribution under the null with the red line indicating the observed value.
We now test whether \(Y_1 \mbox{${}\perp\mkern-11mu\perp{}$}X \,|\,Z\). This is relevant if we’re interested in modelling \(Y_1\) and want to determine whether \(X\) should be included in a model that already includes \(Z\). We can plot \(X\) and \(Z\) and color the curves based on the value of \(Y_1\) as can be seen in Figures 5 and 6 below.
Figure 5: Plot of \(Z\) with colors based on the value of \(Y_1\).
Figure 6: Plot of \(X\) with colors based on the value of \(Y_1\).
It appears that both of the functional variables contain information about \(Y_1\). To use the GHCM for this test, in addition to the scalar-on-function regression employed in the previous section, we will need to be able to perform function-on-function regressions. This is done using the pffr
function in the refund
package [4] with ff
terms. We run the test in the following code:
m_1 <- pfr(Y_1 ~ lf(Z), data = ghcm_sim_data)
m_X <- pffr(X ~ ff(Z), data = ghcm_sim_data, chunk.size = 31000)
test <- ghcm_test(resid(m_X), resid(m_1), X_grid = grid, Y_grid = NA )
print(test)
#> H0: X _||_ Y | Z, p: 0.8811119
#> Not rejected at 5 % level
#> Test statistic: 0.0009254735
The Y_1
variable is scalar, hence we set Y_grid = NA
, however, since X_1
is functional and observed on the grid grid
, we set X_grid = grid
. This tells the ghcm_test
function to treat the first set of variables as functional observations (and hence perform FPCA on these) and the second set of variables as scalars. We get a \(p\)-value of 0.88 and an estimate of the test statistic of 9.25e-04. As before, we call plot(test)
to plot the estimated null distribution of the test statistic which can be seen in Figure 7.
Figure 7: Plot of the estimated asymptotic test distribution under the null with the red line indicating the observed value.
Finally, we test whether \(X \mbox{${}\perp\mkern-11mu\perp{}$}W \,|\,Z\), which could be relevant in creating prediction models for \(X\) or \(W\) or in simply ascertaining the relationships between the functional variables. We run the test in the following code:
m_X <- pffr(X ~ ff(Z), data=ghcm_sim_data, chunk.size=31000)
m_W <- pffr(W ~ ff(Z), data=ghcm_sim_data, chunk.size=31000)
test <- ghcm_test(resid(m_X), resid(m_W), X_grid = grid, Y_grid = grid )
print(test)
#> H0: X _||_ Y | Z, p: 0.6377362
#> Not rejected at 5 % level
#> Test statistic: 0.0006764566
Both variables are functional, hence we set both X_grid
and Y_grid
to be grid
. We get a \(p\)-value of \(0.638\) and an estimate of the test statistic of 6.76e-04. To get an idea of how extreme the observed value of the test statistic is, we plot the asymptotic distribution as before, which can be seen in Figure 8.
Figure 8: Plot of the estimated asymptotic test distribution under the null with the red line indicating the observed value.
[1] A. R. Lundborg, R. D. Shah, and J. Peters, “Conditional independence testing in hilbert spaces with applications to functional data analysis.” 2021, [Online]. Available: https://arxiv.org/abs/2101.07108.
[2] R. D. Shah and J. Peters, “The hardness of conditional independence testing and the generalised covariance measure,” Annals of Statistics, vol. 48, no. 3, pp. 1514–1538, 2020.
[3] J. Peters and R. D. Shah, GeneralisedCovarianceMeasure: Test for conditional independence based on the generalized covariance measure (gcm). 2019.
[4] J. Goldsmith et al., Refund: Regression with functional data. 2020.