% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dgp.R
\name{dgp}
\alias{dgp}
\title{Deep Gaussian process emulator construction}
\usage{
dgp(
  X,
  Y,
  depth = 2,
  node = ncol(X),
  name = "sexp",
  lengthscale = 1,
  bounds = NULL,
  prior = "ga",
  share = TRUE,
  nugget_est = FALSE,
  nugget = NULL,
  scale_est = TRUE,
  scale = 1,
  connect = TRUE,
  likelihood = NULL,
  training = TRUE,
  verb = TRUE,
  check_rep = TRUE,
  vecchia = FALSE,
  M = 25,
  ord = NULL,
  N = ifelse(vecchia, 200, 500),
  cores = 1,
  blocked_gibbs = TRUE,
  ess_burn = 10,
  burnin = NULL,
  B = 10,
  internal_input_idx = NULL,
  linked_idx = NULL,
  id = NULL
)
}
\arguments{
\item{X}{a matrix where each row is an input training data point and each column represents an input dimension.}

\item{Y}{a matrix containing observed training output data. The matrix has its rows being output data points and columns representing
output dimensions. When \code{likelihood} (see below) is not \code{NULL}, \code{Y} must be a matrix with a single column.}

\item{depth}{number of layers (including the likelihood layer) for a DGP structure. \code{depth} must be at least \code{2}.
Defaults to \code{2}.}

\item{node}{number of GP nodes in each layer (except for the final layer or the layer feeding the likelihood node) of the DGP. Defaults to
\code{ncol(X)}.}

\item{name}{a character or a vector of characters that indicates the kernel functions (either \code{"sexp"} for squared exponential kernel or
\code{"matern2.5"} for Matérn-2.5 kernel) used in the DGP emulator:
\enumerate{
\item if a single character is supplied, the corresponding kernel function will be used for all GP nodes in the DGP hierarchy.
\item if a vector of characters is supplied, each character of the vector specifies the kernel function that will be applied to all GP nodes in the corresponding layer.
}

Defaults to \code{"sexp"}.}

\item{lengthscale}{initial lengthscales for GP nodes in the DGP emulator. It can be a single numeric value or a vector:
\enumerate{
\item if it is a single numeric value, the value will be applied as the initial lengthscales for all GP nodes in the DGP hierarchy.
\item if it is a vector, each element of the vector specifies the initial lengthscales that will be applied to all GP nodes in the corresponding layer.
The vector should have a length of \code{depth} if \code{likelihood = NULL} or a length of \code{depth - 1} if \code{likelihood} is not \code{NULL}.
}

Defaults to a numeric value of \code{1.0}.}

\item{bounds}{the lower and upper bounds of lengthscales in GP nodes. It can be a vector or a matrix:
\enumerate{
\item if it is a vector, the lower bound (the first element of the vector) and upper bound (the second element of the vector) will be applied to
lengthscales for all GP nodes in the DGP hierarchy.
\item if it is a matrix, each row of the matrix specifies the lower and upper bounds of lengthscales for all GP nodes in the corresponding layer.
The matrix should have its row number equal to \code{depth} if \code{likelihood = NULL} or to \code{depth - 1} if \code{likelihood} is not \code{NULL}.
}

Defaults to \code{NULL} where no bounds are specified for the lengthscales.}

\item{prior}{prior to be used for MAP estimation of lengthscales and nuggets of all GP nodes in the DGP hierarchy:
\itemize{
\item gamma prior (\code{"ga"}),
\item inverse gamma prior (\code{"inv_ga"}), or
\item jointly robust prior (\code{"ref"}).
}

Defaults to \code{"ga"}.}

\item{share}{a bool indicating if all input dimensions of a GP node share a common lengthscale. Defaults to \code{TRUE}.}

\item{nugget_est}{a bool or a bool vector that indicates if the nuggets of GP nodes (if any) in the final layer are to be estimated. If a single bool is
provided, it will be applied to all GP nodes (if any) in the final layer. If a bool vector (which must have a length of \code{ncol(Y)}) is provided, each
bool element in the vector will be applied to the corresponding GP node (if any) in the final layer. The value of a bool has following effects:
\itemize{
\item \code{FALSE}: the nugget of the corresponding GP in the final layer is fixed to the corresponding value defined in \code{nugget} (see below).
\item \code{TRUE}: the nugget of the corresponding GP in the final layer will be estimated with the initial value given by the correspondence in \code{nugget} (see below).
}

Defaults to \code{FALSE}.}

\item{nugget}{the initial nugget value(s) of GP nodes (if any) in each layer:
\enumerate{
\item if it is a single numeric value, the value will be applied as the initial nugget for all GP nodes in the DGP hierarchy.
\item if it is a vector, each element of the vector specifies the initial nugget that will be applied to all GP nodes in the corresponding layer.
The vector should have a length of \code{depth} if \code{likelihood = NULL} or a length of \code{depth - 1} if \code{likelihood} is not \code{NULL}.
}

Set \code{nugget} to a small value and the bools in \code{nugget_est} to \code{FALSE} for deterministic emulation, where the emulator
interpolates the training data points. Set \code{nugget} to a larger value and the bools in \code{nugget_est} to \code{TRUE} for stochastic emulation where
the computer model outputs are assumed to follow a homogeneous Gaussian distribution. Defaults to \code{1e-6} if \code{nugget_est = FALSE} and
\code{0.01} if \code{nugget_est = TRUE}. If \code{likelihood} is not \code{NULL} and \code{nugget_est = FALSE}, the nuggets of GPs that feed into the likelihood layer default to
\code{1e-4}.}

\item{scale_est}{a bool or a bool vector that indicates if the variance of GP nodes (if any) in the final layer are to be estimated. If a single bool is
provided, it will be applied to all GP nodes (if any) in the final layer. If a bool vector (which must have a length of \code{ncol(Y)}) is provided, each
bool element in the vector will be applied to the corresponding GP node (if any) in the final layer. The value of a bool has following effects:
\itemize{
\item \code{FALSE}: the variance of the corresponding GP in the final layer is fixed to the corresponding value defined in \code{scale} (see below).
\item \code{TRUE}: the variance of the corresponding GP in the final layer will be estimated with the initial value given by the correspondence in \code{scale} (see below).
}

Defaults to \code{TRUE}.}

\item{scale}{the initial variance value(s) of GP nodes (if any) in the final layer. If it is a single numeric value, it will be applied to all GP nodes (if any)
in the final layer. If it is a vector (which must have a length of \code{ncol(Y)}), each numeric in the vector will be applied to the corresponding GP node
(if any) in the final layer. Defaults to \code{1}.}

\item{connect}{a bool indicating whether to implement global input connection to the DGP structure. Setting it to \code{FALSE} may produce a better emulator in some cases at
the cost of slower training. Defaults to \code{TRUE}.}

\item{likelihood}{the likelihood type of a DGP emulator:
\enumerate{
\item \code{NULL}: no likelihood layer is included in the emulator.
\item \code{"Hetero"}: a heteroskedastic Gaussian likelihood layer is added for stochastic emulation where the computer model outputs are assumed to follow a heteroskedastic Gaussian distribution
(i.e., the computer model outputs have input-dependent noise).
\item \code{"Poisson"}: a Poisson likelihood layer is added for emulation where the computer model outputs are counts and a Poisson distribution is used to model them.
\item \code{"NegBin"}: a negative Binomial likelihood layer is added for emulation where the computer model outputs are counts and a negative Binomial distribution is used to capture dispersion variability in input space.
\item \ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#new}{\figure{lifecycle-new.svg}{options: alt='[New]'}}}{\strong{[New]}} \code{"Categorical"}: a categorical likelihood layer is added for emulation (classification), where the computer model output is categorical.
}

When \code{likelihood} is not \code{NULL}, the value of \code{nugget_est} is overridden by \code{FALSE}. Defaults to \code{NULL}.}

\item{training}{a bool indicating if the initialized DGP emulator will be trained.
When set to \code{FALSE}, \code{\link[=dgp]{dgp()}} returns an untrained DGP emulator, to which one can apply \code{\link[=summary]{summary()}} to inspect its specifications
or apply \code{\link[=predict]{predict()}} to check its emulation performance before training. Defaults to \code{TRUE}.}

\item{verb}{a bool indicating if the trace information on DGP emulator construction and training will be printed during the function execution.
Defaults to \code{TRUE}.}

\item{check_rep}{a bool indicating whether to check for repetitions in the dataset, i.e., if one input
position has multiple outputs. Defaults to \code{TRUE}.}

\item{vecchia}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#new}{\figure{lifecycle-new.svg}{options: alt='[New]'}}}{\strong{[New]}} a bool indicating whether to use Vecchia approximation for large-scale DGP emulator construction and prediction. Defaults to \code{FALSE}.}

\item{M}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#new}{\figure{lifecycle-new.svg}{options: alt='[New]'}}}{\strong{[New]}} the size of the conditioning set for the Vecchia approximation in the DGP emulator training. Defaults to \code{25}.}

\item{ord}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#new}{\figure{lifecycle-new.svg}{options: alt='[New]'}}}{\strong{[New]}} an R function that returns the ordering of the input to each GP node contained in the DGP emulator for the Vecchia approximation. The
function must satisfy the following basic rules:
\itemize{
\item the first argument represents the input to a GP node scaled by its lengthscales.
\item the output of the function is a vector of indices that gives the ordering of the input to the GP node.
}

If \code{ord = NULL}, the default random ordering is used. Defaults to \code{NULL}.}

\item{N}{number of iterations for the training. Defaults to \code{500} if \code{vecchia = FALSE} and \code{200} if \code{vecchia = TRUE}. This argument is only used when \code{training = TRUE}.}

\item{cores}{the number of processes to be used to optimize GP components (in the same layer) at each M-step of the training. If set to \code{NULL},
the number of processes is set to \verb{(max physical cores available - 1)} if \code{vecchia = FALSE} and \verb{max physical cores available \%/\% 2} if \code{vecchia = TRUE}.
Only use multiple processes when there is a large number of GP components in different layers and optimization of GP components is computationally expensive. Defaults to \code{1}.}

\item{blocked_gibbs}{a bool indicating if the latent variables are imputed layer-wise using ESS-within-Blocked-Gibbs. ESS-within-Blocked-Gibbs would be faster and
more efficient than ESS-within-Gibbs that imputes latent variables node-wise because it reduces the number of components to be sampled during Gibbs steps,
especially when there is a large number of GP nodes in layers due to higher input dimensions. Default to \code{TRUE}.}

\item{ess_burn}{number of burnin steps for the ESS-within-Gibbs
at each I-step of the training. Defaults to \code{10}. This argument is only used when \code{training = TRUE}.}

\item{burnin}{the number of training iterations to be discarded for
point estimates of model parameters. Must be smaller than the training iterations \code{N}. If this is not specified, only the last 25\% of iterations
are used. Defaults to \code{NULL}. This argument is only used when \code{training = TRUE}.}

\item{B}{the number of imputations used to produce predictions. Increase the value to refine the representation of imputation uncertainty.
Defaults to \code{10}.}

\item{internal_input_idx}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} The argument will be removed in the next release. To set up connections of emulators for linked emulations,
please use the updated \code{\link[=lgp]{lgp()}} function instead.

Column indices of \code{X} that are generated by the linked emulators in the preceding layers.
Set \code{internal_input_idx = NULL} if the DGP emulator is in the first layer of a system or all columns in \code{X} are
generated by the linked emulators in the preceding layers. Defaults to \code{NULL}.}

\item{linked_idx}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} The argument will be removed in the next release. To set up connections of emulators for linked emulation,
please use the updated \code{\link[=lgp]{lgp()}} function instead.

Either a vector or a list of vectors:
\itemize{
\item If \code{linked_idx} is a vector, it gives indices of columns in the pooled output matrix (formed by column-combined outputs of all
emulators in the feeding layer) that feed into the DGP emulator. The length of the vector shall equal to the length of \code{internal_input_idx}
when \code{internal_input_idx} is not \code{NULL}. If the DGP emulator is in the first layer of a linked emulator system, the vector gives the column indices of the global
input (formed by column-combining all input matrices of emulators in the first layer) that the DGP emulator will use. If the DGP emulator is to be used in both the first
and subsequent layers, one should initially set \code{linked_idx} to the appropriate values for the situation where the emulator is not in the first layer. Then, use the
function \code{\link[=set_linked_idx]{set_linked_idx()}} to reset the linking information when the emulator is in the first layer.
\item When the DGP emulator is not in the first layer of a linked emulator system, \code{linked_idx} can be a list that gives the information on connections
between the DGP emulator and emulators in all preceding layers. The length of the list should equal to the number of layers before
the DGP emulator. Each element of the list is a vector that gives indices of columns in the pooled output matrix (formed by column-combined outputs
of all emulators) in the corresponding layer that feed into the DGP emulator. If the DGP emulator has no connections to any emulator in a certain layer,
set \code{NULL} in the corresponding position of the list. The order of input dimensions in \code{X[,internal_input_idx]} should be consistent with \code{linked_idx}.
For example, a DGP emulator in the 4th-layer that is fed by the output dimension 2 and 4 of emulators in layer 2 and all output dimension 1 to 3 of
emulators in layer 3 should have \code{linked_idx = list( NULL, c(2,4), c(1,2,3) )}. In addition, the first and second columns of \code{X[,internal_input_idx]}
should correspond to the output dimensions 2 and 4 from layer 2, and the third to fifth columns of \code{X[,internal_input_idx]} should
correspond to the output dimensions 1 to 3 from layer 3.
}

Set \code{linked_idx = NULL} if the DGP emulator will not be used for linked emulations. However, if this is no longer the case, one can use \code{\link[=set_linked_idx]{set_linked_idx()}}
to add linking information to the DGP emulator. Defaults to \code{NULL}.}

\item{id}{an ID to be assigned to the DGP emulator. If an ID is not provided (i.e., \code{id = NULL}), a UUID (Universally Unique Identifier) will be automatically generated
and assigned to the emulator. Default to \code{NULL}.}
}
\value{
An S3 class named \code{dgp} that contains five slots:
\itemize{
\item \code{id}: A number or character string assigned through the \code{id} argument.
\item \code{data}: a list that contains two elements: \code{X} and \code{Y} which are the training input and output data respectively.
\item \code{specs}: a list that contains
\enumerate{
\item \emph{L} (i.e., the number of layers in the DGP hierarchy) sub-lists named \verb{layer1, layer2,..., layerL}. Each sub-list contains \emph{D}
(i.e., the number of GP/likelihood nodes in the corresponding layer) sub-lists named \verb{node1, node2,..., nodeD}. If a sub-list
corresponds to a likelihood node, it contains one element called \code{type} that gives the name (\code{Hetero}, \code{Poisson}, \code{NegBin}, or \code{Categorical}) of the likelihood node.
If a sub-list corresponds to a GP node, it contains four elements:
\itemize{
\item \code{kernel}: the type of the kernel function used for the GP node.
\item \code{lengthscales}: a vector of lengthscales in the kernel function.
\item \code{scale}: the variance value in the kernel function.
\item \code{nugget}: the nugget value in the kernel function.
}
\item \ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} \code{internal_dims}: the column indices of \code{X} that correspond to the linked emulators in the preceding layers of a linked system.
\strong{The slot will be removed in the next release}.
\item \ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} \code{external_dims}: the column indices of \code{X} that correspond to global inputs to the linked system of emulators. It is shown
as \code{FALSE} if \code{internal_input_idx = NULL}. \strong{The slot will be removed in the next release}.
\item \ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} \code{linked_idx}: the value passed to argument \code{linked_idx}. It is shown as \code{FALSE} if the argument \code{linked_idx} is \code{NULL}.
\strong{The slot will be removed in the next release}.
\item \code{seed}: the random seed generated to produce imputations. This information is stored for reproducibility when the DGP emulator (that was saved by \code{\link[=write]{write()}}
with the light option \code{light = TRUE}) is loaded back to R by \code{\link[=read]{read()}}.
\item \code{B}: the number of imputations used to generate the emulator.
\item \ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#new}{\figure{lifecycle-new.svg}{options: alt='[New]'}}}{\strong{[New]}} \code{vecchia}: whether the Vecchia approximation is used for the GP emulator training.
\item \ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#new}{\figure{lifecycle-new.svg}{options: alt='[New]'}}}{\strong{[New]}} \code{M}: the size of the conditioning set for the Vecchia approximation in the DGP emulator training. \code{M} is generated only when \code{vecchia = TRUE}.
}
\item \code{constructor_obj}: a 'python' object that stores the information of the constructed DGP emulator.
\item \code{container_obj}: a 'python' object that stores the information for the linked emulation.
\item \code{emulator_obj}: a 'python' object that stores the information for the predictions from the DGP emulator.
}

The returned \code{dgp} object can be used by
\itemize{
\item \code{\link[=predict]{predict()}} for DGP predictions.
\item \code{\link[=continue]{continue()}} for additional DGP training iterations.
\item \code{\link[=validate]{validate()}} for LOO and OOS validations.
\item \code{\link[=plot]{plot()}} for validation plots.
\item \code{\link[=lgp]{lgp()}} for linked (D)GP emulator constructions.
\item \code{\link[=window]{window()}} for model parameter trimming.
\item \code{\link[=summary]{summary()}} to summarize the trained DGP emulator.
\item \code{\link[=write]{write()}} to save the DGP emulator to a \code{.pkl} file.
\item \code{\link[=set_imp]{set_imp()}} to change the number of imputations.
\item \code{\link[=design]{design()}} for sequential design.
\item \code{\link[=update]{update()}} to update the DGP emulator with new inputs and outputs.
\item \code{\link[=alm]{alm()}}, \code{\link[=mice]{mice()}}, and \code{\link[=vigf]{vigf()}} to locate next design points.
}
}
\description{
This function builds and trains a DGP emulator.
}
\details{
See further examples and tutorials at \url{https://mingdeyu.github.io/dgpsi-R/}.
}
\note{
Any R vector detected in \code{X} and \code{Y} will be treated as a column vector and automatically converted into a single-column
R matrix. Thus, if \code{X} is a single data point with multiple dimensions, it must be given as a matrix.
}
\examples{
\dontrun{

# load the package and the Python env
library(dgpsi)

# construct a step function
f <- function(x) {
  if (x < 0.5) return(-1)
  if (x >= 0.5) return(1)
  }

# generate training data
X <- seq(0, 1, length = 10)
Y <- sapply(X, f)

# set a random seed
set_seed(999)

# training a DGP emulator
m <- dgp(X, Y)

# continue for further training iterations
m <- continue(m)

# summarizing
summary(m)

# trace plot
trace_plot(m)

# trim the traces of model parameters
m <- window(m, 800)

# LOO cross validation
m <- validate(m)
plot(m)

# prediction
test_x <- seq(0, 1, length = 200)
m <- predict(m, x = test_x)

# OOS validation
validate_x <- sample(test_x, 10)
validate_y <- sapply(validate_x, f)
plot(m, validate_x, validate_y)

# write and read the constructed emulator
write(m, 'step_dgp')
m <- read('step_dgp')
}
}
