\name{Cross-Validation for gOMP}
\alias{cv.gomp}

\title{
Cross-Validation for gOMP
}

\description{
The function performs a k-fold cross-validation for identifying the best tolerance values for gOMP.
}

\usage{
cv.gomp(target, dataset, kfolds = 10, folds = NULL, tol = seq(4, 9, by = 1), 
task = "C", metric = NULL, metricbbc = NULL, modeler = NULL, test = NULL, 
method = "ar2", B = 1)
}

\arguments{
\item{target}{
The target or class variable as in SES and MMPC. The difference is that it cannot accept a single numeric value, an integer indicating the column in the dataset. 
}
\item{dataset}{
The dataset object as in SES and MMPC.
}
\item{kfolds}{
The number of the folds in the k-fold Cross Validation (integer).
}
\item{folds}{
The folds of the data to use (a list generated by the function generateCVRuns {TunePareto}). If NULL the folds are created internally with the same function.
}
\item{tol}{
A vector of tolerance values. 
}
\item{task}{
A character ("C", "R" or "S"). It can be "C" for classification (logistic, multinomial or ordinal regression), "R" for regression (robust and non robust linear regression, median regression, 
(zero inflated) poisson and negative binomial regression, beta regression), "S" for survival regresion (Cox, Weibull or exponential regression).
}
\item{metric}{
A metric function provided by the user. If NULL the following functions will be used: auc.mxm, mse.mxm, ci.mxm for classification, regression and 
survival analysis tasks, respectively. See details for more. If you know what you have put it here to avoid the function choosing somehting else. 
\bold{Note} that you put these words as they are, without "".  
}
\item{metricbbc}{
This is the same argument as "metric" with the difference that " " must be placed. If for example, metric = auc.mxm, here metricbbc = "auc.mxm". The same value must be given here. This argument is to be used with the function \code{\link{bbc}} which does bootstrap bias correction of the estimated performance (Tsamardinos, Greasidou and Borboudakis, 2018). This argument is valid if the last argument (B) is more than 1. 
}
\item{modeler}{
A modeling function provided by the user. If NULL the following functions will be used: glm.mxm, lm.mxm, coxph.mxm for classification, regression and survival analysis tasks, respectively. See details for more.
If you know what you have put it here to avoid the function choosing somehting else. \bold{Note} that you put these words as they are, without "". 
}
\item{test}{
A function object that defines the conditional independence test used in the SES function (see also SES help page). If NULL, "testIndFisher", "testIndLogistic" and "censIndCR" are used for classification, regression and survival analysis tasks, respectively. If you know what you have put it here to avoid the function choosing somehting else. Not all tests can be included here. "testIndClogit", "testIndMVreg", "testIndIG", "testIndGamma", "testIndZIP" and "testIndTobit" are anot available at the moment.  
}
\item{method}{
This is only for the "testIndFisher". You can either specify, "ar2" for the adjusted R-square or "sse" for the sum of squares of errors. The tolerance value in both cases must a number between 0 and 1. That will denote a percentage. If the percentage increase or decrease is less than the nubmer the algorithm stops. An alternative is "BIC" for BIC and the tolerance values are like in all other regression models.
}
\item{B}{
How many bootstrap re-samples to draw. This argument is to be used with the function \code{\link{bbc}} which does bootstrap bias correction of the estimated performance (Tsamardinos, Greasidou and Borboudakis, 2018). If you have thousands of samples (observations) then this might not be necessary, as there is no optimistic bias to be corrected. What is the lower limit cannot be told beforehand however. SES and MMPC however were designed for the low sample cases, hence, bootstrap bias correction is perhaps a must thing to do.
}
}

\details{
For more details see also \code{\link{cv.ses}}.
}

\value{
A list including:
\item{cv_results_all}{
A list with predictions, performances and selected variables for each fold and each tolerance value. The elements are called
"preds", "performances" and "selectedVars".
}
\item{best_performance}{
A numeric value that represents the best average performance.
}
\item{best_configuration}{
A numeric value that represents the best tolerance value.
}
\item{bbc_best_performance}{
The bootstrap bias corrected best performance if B was more than 1, othwerwise this is NULL.
}
\item{runtime}{
The runtime of the cross-validation procedure.
}

Bear in mind that the values can be extracted with the $ symbol, i.e. this is an S3 class output. 
}

\references{
Tsamardinos I., Greasidou E. and Borboudakis G. (2018).  
Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. 
Machine Learning 107(12): 1895-1922.  
\url{ https://link.springer.com/article/10.1007/s10994-018-5714-4 }

Michail Tsagris, Zacharias Papadovasilakis, Kleanthi Lakiotaki, Ioannis Tsamardinos (2020). 
A generalised OMP algorithm for feature selection with application to gene expression data. arxiv preprint.
https://arxiv.org/pdf/2004.00281.pdf
}

\author{
R implementation and documentation: Michail Tsagris \email{mtsagris@uoc.gr}.
}

%\note{
%%  ~~further notes~~
%}

\seealso{
\code{ \link{cv.mmpc}, \link{gomp.path}, \link{bbc} }
}

\examples{
set.seed(1234)

# simulate a dataset with continuous data
dataset <- matrix( rnorm(200 * 50), ncol = 50 )
# the target feature is the last column of the dataset as a vector
target <- dataset[, 50]
dataset <- dataset[, -50]

# run a 10 fold CV for the regression task
best_model = cv.gomp(target, dataset, kfolds = 5, task = "R", 
tol = seq(0.001, 0.01,by=0.001), method = "ar2" )
}

